Sunday, October 23, 2022

The RISC Deprogrammer

I should write up a larger technical document on this, but in the meanwhile is this short (-ish) blogpost. Everything you know about RISC is wrong. It's some weird nerd cult. Techies frequently mention RISC in conversation, with other techies nodding their head in agreement, but it's all wrong. Somehow everyone has been mind controlled to believe in wrong concepts.

An example is this recent blogpost which starts out saying that "RISC is a set of design principles". No, it wasn't. Let's start from this sort of viewpoint to discuss this odd cult.

What is RISC?

Because of the march of Moore's Law, every year, more and more parts of a computer could be included onto a single chip. When chip densities reached the point where we could almost fit an entire computer on a chip, designers made tradeoffs, discarding unimportant stuff to make the fit happen. They made tradeoffs, deciding what needed to be included, what needed to change, and what needed to be discarded.

RISC is a set of creative tradeoffs, meaningful at the time (early 1980s), but which were meaningless by the late 1990s.

The interesting parts of CPU evolution are the three decades from 1964 with IBM's System/360 mainframe and 2007 with Apple's iPhone. The issue was a 32-bit core with memory-protection allowing isolation among different programs with virtual memory. These were real computers, from the modern perspective: real computers have at least 32-bit and an MMU (memory management unit).

The year 1975 saw the release of Intel 8080 and MOS 6502, but these were 8-bit systems without memory protection. This was at the point of Moore's Law where we could get a useful CPU onto a single chip.

In the year 1977 we saw DEC release it's VAX minicomputer, having a 32-bit CPU w/ MMU. Real computing had moved from insanely expensive mainframes filling entire rooms to less expensive devices that merely filled a rack. But the VAX was way too big to fit onto a chip at this time.

The real interesting evolution of real computing happened in 1980 with Motorola's 68000 (aka. 68k) processor, essentially the first microprocessor that supported real computing.

But this comes with caveats. Making microprocessor required creative work to decide what wasn't included. In the case of the 68k, it had only a 16-bit ALU. This meant adding two 32-bit registers required passing them twice through the ALU, adding each half separately. Because of this, many call the 68k a 16-bit rather than 32-bit microprocessor.

More importantly, only the lower 24-bits of the registers were valid for memory addresses. Since it's memory addressing that makes a real computer "real", this is the more important measure. But 24-bits allows for 16-megabytes of memory, which is all that anybody could afford to include in a computer anyway. It was more than enough to run a real operating system like Unix. In contrast, 16-bit processors could only address 64-kilobytes of memory, and weren't really practical for real computing.

The 68k didn't come with a MMU, but it allowed an extra MMU chip. Thus, the early 1980s saw an explosion of workstations and servers consisting of a 68k and an MMU. The most famous was Sun Microsystems launched in 1982, with their own custom designed MMU chip.

Sun and its competitors transformed the industry running Unix. Many point to IBM's PC from 1982 as the transformative moment in computer history, but these were non-real 16-bit systems that struggled with more than 64k of memory. IBM PC computers wouldn't become real until 1993 with Microsoft's Windows NT, supporting full 32-bits, memory-protection, and pre-emptive multitasking.

But except for Windows itself, the rest of computing is dominated by the Unix heritage. The phone in your hand, whether Android or iPhone, is a Unix computer that inherits almost nothing from the IBM PC.

These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. The VAX was considered a mini-supercomputer. The Unix workstations were mere toys in comparison. Too many tradeoffs were made in order to fit everything onto a single chip, too many sacrifices made.

Some people asked "What if we make different tradeoffs?"

Most people thought the VAX was the way of the future, and were all chasing that design. The 68k CPU was essentially a cut down VAX design. But history had anti-VAX designs that worked very differently, notably the CDC 6600 supercomputer from the 1960s and the IBM 801/ROMP processor from the 1970s.

It's not simply one tradeoff, but a bunch of inter-related tradeoffs. They snowball -- each choice you make changes the costs-vs-benefit analysis of other choices, changing them as well.

This is why people can't agree upon a single definition of RISC. It's not one tradeoff made in isolation, but a long list of tradeoffs, each part of a larger scheme.

In 1987, Motorola shipped its 68030 version of the 68k processor, chasing the VAX ideal. By then, we had ARM, SPARC, and MIPS processors that significantly outperformed it. Given a budget of roughly 100,000 transistors allowed by Moore's Law of the time, the RISC tradeoffs were better than VAX-like tradeoffs.

So really, what is RISC?

Let's define things in terms of 1986, comparing the [ARM, SPARC, MIPS] processors called "RISC" to the [68030, 80386] processors that weren't "RISC". They all supported full 32-bit processing, memory-management, and preemptive multitasking operating systems like Unix.

The major ways RISC differed were:

  • fixed-length instructions (32-bits or 4-bytes each)
  • simple instruction decoding
  • horizontal vs. vertical microcode
  • deep pipelines of around 5 stages
  • load/store aka reg-reg
  • simple address modes
  • compilers optimized code
  • more registers

If you are looking for the one thing that defines RISC, it's the thing that nobody talks about: horizontal microcode.

The VAX/68k/x86 architecture decoded external instructions into internal control ops that were pretty complicated, supporting such things as loops. Each external instruction executed an internal microprogram with a variable number of such operations.

The classic RISC worked differently. Each external instruction decoded into exactly 4 internal ops. Moreover, each op had a fixed purpose:

  1. read from two registers into the ALU (arithmetic-logic unit)
  2. execute a math operation in the ALU
  3. access memory (well, the L1 cache)
  4. write results back into one register

(This explanation has been fudged and simplified, btw).

This internal detail was expressed externally in the instruction set, simplifying decoding. The external instructions specified two registers to read, an ALU opcode, and one register to write. All of this was fit into a constant 32-bits. In contrast, the [68k/x86/VAX] model meant a complex decoding of instructions with a large ROM containing microprograms.

Roughly half (50%) of the 68000's transistors contained this complex decoding logic and ROM. In contrast, for RISC processors, it was closer to 1%. All those transistors could be dedicated to other things. See how tradeoffs snowball? Saving so many transistors involved in instruction decoding meant being able to support other features elsewhere. It's not clear this is a benefit, however. This meant that RISC needed multiple instructions to do the same thing as a single [68k/x86/VAX] instruction.

This meant instructions could be deeply pipelined. Instructions could be overlapped. When reading registers for the current instruction, we can simultaneously be fetching the next, and performing the ALU calculation on the previous instruction. The classic RISC pipeline had 5 stages (the 4 mentioned above plus 1 for fetching the next instruction). Each clock cycle would execute part of 5 instructions simultaneous, each at a different stage in the pipeline.

This was called scalar operation, In previous processor, it would take a variable number of clock cycles for an instruction to complete. In RISC, every instruction had 5 clock cycle latency from beginning to end. And since execution was overlapped/pipelined, executing 5 instructions at a time, the throughput was one instruction per clock cycle.

All CPUs are pipelined to some extent, but they need complex interlocks to prevent things from colliding with each other, such as two pipelined instructions trying to read registers at the same time. RISC removed most of those interlocks, by strictly regulation what an instruction could do in each stage of the pipeline. Removing these interlocks reduced transistor count and sped things up. This could be one possible definition of RISC that you never hear of: it got rid of all these interlocks found in other processors.

Some pipeline conflicts were worse. Because pipelining, the results of an instruction won't be available until many clock cycles later. What if one instruction writes its results to register #5 (r5), and the very next instruction attempts to read from register #5 (r5)? It's too soon, it has to wait more clock cycles for the result.

The answer: don't do that. Assembly language programmers need to know this complication, and are told to simply not write code that does this, because then the program won't work.

This was anathema of the time. Throughout history to this point, each new CPU architecture also had a new operating-system written in assembly language, with many applications written in assembly language. Thus, a programmer-friendly assembly language was considered one of the biggest requirements for any new system. Requiring programmers to know such quirks lead to buggy code was simply unacceptable. Everybody knew that programmer-hostile instruction-sets would never work in the market, even if they performed faster and cheaper.

But technology is littered with what everybody knowing being wrong. In this case, by 1980 we had the  C programming language that was essentially a "portable assembly language" and the Unix operating system written in C. The only people who needed to know about a quirky assembly language were the compiler writers. They would take care of all such problems.

That's why the history lesson above talks about Unix and real computing. Without Unix and C, RISC wouldn't have happened. An operating-system written in a high-level language was a prerequisite for RISC. It's as import an innovation as Moore's Law allowing 100,000 transistors to fit on a chip.

Because of the lack of complex decoding logic, the transistor budget was freed up to support such things as more registers. The Intel x86 architecture famously had 8 registers, while the RISC competitors typically had as many as 32. The limitation was decode space. It takes 5 bits to specify one of 32 possibilities. Given that most every instructions specified two registers to read from and one register to write to, that's 15 bits, or half of the instruction space, leaving 17 bits for other purposes.

The creators of the RISC, Hennessy and Patterson, wrote a textbook called a "Computer Architecture: A Quantitative Approach". It's horrible. It imagines a world where people need to be taught tradeoffs and transistor budgets. But there is no other approach than a quantitative one, it's like an economics textbook "Economics: A Supply And Demand Approach". While the textbook has a weird obsession with quantitative theory, it misses non quantitative tradeoffs, like the fact that RISC couldn't happen without C and Unix. 

Among the snowballing tradeoffs is the load/store architecture, while at the same time, having fewer addressing modes. It's here that we need to go back and discuss history -- what the heck is an "addressing mode"????

In the beginning, computers had only a single general purpose register, called the accumulator. All calculations, like adding two numbers together, involved reading the second value from memory and combining with the first value already in the accumulator. All calculations, whether arithmetic (add, subtract) or logical (AND, OR, XOR) involved one value already in the register, and another value from memory.

Addresses have to be calculated. For example, when accessing elements in a table, we have to take the row number, multiply it by the size of the table, add an offset into the row for desired column, then add all that to the address at the start of the table. Then after calculating this address, we often want to increment the index to fetch the next row.

If the table base address and row index are held in registers, we might get a complex instructions like the following. This calculates and address using two registers r10 and r11, fetches that value from memory, then adds it into register r9.

 ADD r9, [r10 + r11*8 + 4]

Such calculations embedded in the instruction-set were necessary for such early computers. While they had only a single general purpose register (the accumulator), they still had multiple special purpose registers used this way for address calculations. 

For complex computers like the VAX, such address modes imbedded in instructions were no longer necessary, but still desirable. Half the work of the computer is in calculating memory addresses. It's very tedious for programmers to do it manually, easier when the instruction-set takes care of common memory access patterns (like accessing cells within a table).

This leads us to the load/store issue.

With many registers, we no longer need to read another value from memory (a reg-mem calculation). We can instead perform the calculation using two registers (reg-reg). The VAX had such reg-reg instructions, but programmers still mostly used the reg-mem instructions with the complex address calculations.

RISC changed this. Calculations were now exclusively reg-reg, where math operations like addition could only operate on registers. To add something from memory, you needed first to load it from memory into a register, using a separate, explicit instruction. Likewise, writing back to memory required an explicit store operation.

This architecture can be called either reg-reg or load/store, with the second name being more popular.

With RISC, addressing modes were still desirable, but now they applied to only the two load and store instructions.

The available addressing modes were constrained by the limited RISC pipeline and limited 32-bit fixed-length instructions. Since the pipeline allowed for the reading of two registers at the start, adding two registers together to form the address was allowed. The example shown above was too complex, though.

What you are supposed to be reading from all of this is that all of these tradeoffs are linked. Each decision that diverges from the ideal VAX-like architecture snowballed into other decisions that drifted further and further from this ideal, until what we had was something that looked nothing like a VAX.

The upshot of these decisions was being able to reduce a 32-bit MMU CPU into roughly a single chip because it needed fewer transistors, while at the same time performing much faster. It required maybe twice as many instructions to perform the same tasks (mostly due to needing more complex address calculations due to lack of addressing modes), but performed them at maybe 5 times faster, for a significant speed up.

At the time, the VAX was the standard benchmark target. When Sun shipped it's first SPARC RISC systems (the Sun-4), they benchmarked about twice as fast as the latest VAX systems, while being considerably cheaper.

The end of RISC

By the late 1980s, everybody knew that RISC was the future. Sure, Intel continued with its x86 and Motorola with it's 68000, but that's because the market wanted backwards compatibility with legacy instruction-sets. Both attempted to build their own RISC alternatives, but failed. When backwards compatibility wasn't required, everybody created RISC processors, because for 32-bit MMU real computing, they were  better. And everybody knew it.

But of course, everybody was eventually wrong. Even as early as the 80486 in 1989, Intel was converting the innards of the processor into something that looked more RISC-like.

The nail in the coffin came in 1995 with Intel's Pentium Pro processor that supported out-of-order (or OoO) processing. Again, it wasn't really a new innovation. Out-of-order instructions first appeared on 1960s era supercomputers from CDC and IBM. This was the first time that transistor budgets allowed it to be practically used on single-chip microprocessors.

Transistor budgets were so high that designers no longer had to make basic painful tradeoffs. The decisions necessary trying to cram everything into 100,000 transistors were longer meaningful when you had more than 1-million transistors to work with. Instruction-set decoding requiring 20k transistors is important with small budgets, but meaningless with large budgets.

With OoO, the microarchitecture inside the chip looks roughly the same, regardless if it's an Intel x86, ARM, SPARC, or whatever.

This was proven in benchmarks. When Intel released its out-of-order Pentium Pro in 1995, it beat all the competing in-order RISC processors on the market.

Everybody was wrong -- RISC wasn't the future, the future was OoO.

One way of describing the Pentium Pro is that it "translates x86 into RISC-like micro-ops". What that really means is that instead of vertical microcode, it translated things into horizontal, pipelined micro-ops. Most of the typical math operations were split into two micro-ops, one a load/store operation, and the other a reg-reg operation. (Some x86 instructions need even more micro-ops: address calculation, then load/store, then ALU op).

Intel still has an "x86 tax" decoding complex instructions. But in terms of pipeline stages, that tax only applies to the first stage. Typical OoO processors have at least 10 more stages after that. Even RISC instruction-set processors like ARM must translate external instructions into internal micro-ops.

The only significant difference left is the fact that Intel's instructions are variable length. The fixed length instructions of RISC means that multiple can be fetched at once, and decoded all in parallel. This is impossible with Intel x86, they must at least partially be decoded serially, one before the next. You don't know where the next instruction starts until you've figured out the length of the current instruction.

Intel and AMD find crafty ways to get around this. For example, AMD has often put hints in its instruction cache (L1I) to so that decoders can know the length of instructions. Intel has played around with "loop caches" (so-called because they are most useful for loops) that track instructions after they've been decoded, so they don't need to be decoded again.

The upshot is that for most code, there's no inherent difference between x86 and RISC, they have essentially the same internal architecture for out-of-order (OoO) processors. No instruction-set has an inherent advantage over the other.

And it's been that way since 1995.

I mention this because bizarrely this cult has persisted for the last 30 years after OoO replaced RISC for high-end real computers. It ceased being a useful technical distinction, so what are techies still discussing it?

They persist in believing dumb things, for which no amount of deprogramming is possible. For example, they look at mobile (battery powered) devices and note that they use ARM chips to conserve power. They make the assumption that there must be some sort of inherent power-efficiency advantage.

This isn't true. These chips consume less power by simply being slower. Fewer transistors mean less power consumption. This meant while desktops/servers used power-hungry OoO processors, mobile phones went back to the transistor budgets of yesteryear, meaning back to in-order RISC.

But as Moore's Law turned, transistors got smaller, to the point where even mobile phones got OoO chips. They use clever tricks to keep that OoO chip powered down most of the time, often including an in-order chip that runs slower on less power for minor tasks.

We've reached the point where mobile and laptops now use the same chips, your MacBook uses (essentially) the same chip as your iPhone, which is the same chip as Apple desktops.

Now Apple's M1 ARM (and hence RISC) processor is much better at power consumption than it's older Intel x86 chip, but this isn't because it's RISC. Apple did a good job at analyzing what people do on mobile devices like laptops and phones and optimized for that. For example, they added a lot of great JavaScript features, cognizant of the ton of online and semi-offline apps that are written in JavaScript. In contrast, Intel attempts to optimize a chip simultaneously for laptops, desktops, and servers, leading poorly optimizations for laptops.

Apple also does crazy things like putting a high end GPU (graphics processor) on the same chip. This has the effect of making their M1 ARM CPU crazy good for desktops for certain applications, those requiring the sorts of memory-bandwidth normally needed by GPUs.

But overall, x86 chips from AMD and Intel are still faster on desktops and servers.

In addition to the fixed-length instructions providing a tiny benefit, ARM has another key advantage, but it has nothing to do with RISC. When they upgraded their instruction-set to support 64-bit instead of just 32-bit, they went back and redesigned it from scratch. This allowed them to optimize the new instruction-set for the OoO pipeline, such as removing some dependencies that slow things down.

This was something that Intel couldn't do. When it came time to support 64-bit, AMD simply extended the existing 32-bit instructions. A long sequence of code often looks identical between the 32-bit and 64-bit versions of the x86 instruction-sets, whereas they look completely different on ARM 32-bit vs. 64-bit.

What about RISC-V and ARM-on-servers?

We've reached the point in tech where the instruction-set doesn't matter. It's not simply that code is written in high-level language. It's mostly that micro-architectural details have converged.

Take byte-order, for example. Back in the 1980s, most of the major CPUs in the world were big-endian, while Intel bucked the trend being little-endian. The reason is that some engineer made a simple optimization back when the 8008 processor was designed for terminals, and because of backwards compatibility, the poor decision continues to plague x86 today.

Except when it annoys programmers debugging memory dumps, byte-order doesn't matter. Therefore, all the RISC processors allowed a simple bit to be set to switch processors from big-endian to little-endian mode.

Over time, that has caused everyone to match Intel's little-endianess, driven primarily by Linux. The kernel itself supports either mode, but a lot of drivers depend upon byte-order, and user-mode programs developed on x86 sometimes have byte-order bugs. As it was ported to architectures like ARM or PowerPC, most of the time it was done in little-endian mode. (You can get PowerPC Linux in big-endian, but the preference is little-endian, because drivers).

The same effect happens even in things that aren't strictly CPU related, like memory and I/O. The tech stack has converged so that processors look more and more alike except for the instruction-set.

The convergence of architecture is demonstrated most powerfully by Apple's M1 transition, where they stopped using Intel's processors in their computers in favor of their custom ARM processor they created for the iPhone.

The MacBook Air M1 looks identical on the outside compared to the immediately preceding x86 MacBook. But more the the point, it performs almost identically running x86 code -- it runs x86 code at native x86 speeds but on an ARM CPU. The processors are so similar architecturally that instruction-sets could be converted on the fly -- it simply reads the x86 program, converts to ARM transparently on the fly, then runs the ARM version. Previous code translation attempts have incurred massive slowdowns to account for architectural differences, but the M1 cheated by removing any differences that weren't instruction-set related, allowing smooth translation of the instructions.

Technically, instruction-sets don't matter, but for business reasons, they still do. Intel and AMD control x86, and prevent others from building compatible processors. ARM lets others build compatible processors (indeed, making no CPUs themselves), but charges them a license fee.

Especially for processors on the low-end, people don't want to pay license fees.

For that reason, RISC V has become popular. For low-end processors (in-order microcontrollers competing against ARM Cortex Ms) in the 100,000 transistor range, it matters that an instruction-set be RISC. The only free alternative is the aging MIPS. It has annoying quirks, like "delay slots", which are fixed by RISC V. Since RISC V is an open standard, free of license fees, those designing their own low end processor have adopted it.

For example, nVidia uses RISC V extensively throughout its technology. GPUs contain tiny embedded CPUs to manage things internally. They have ARM licenses, but they don't want to pay the pennies it would cost for every unit that ARM charges. Likewise, Western Digital (a big hard-drive maker) designed a RISC V core for its drives. 

There are a lot of RISC V fans due to the RISC cult who insist it should go everywhere, but it's not going anywhere for high-end processors. At the high-end, you are going to pay licensing fees for designs anyway. In other words, while big companies have the resources to design small in-order processors, they don't have the resources to design big OoO processors, and would therefore buy designs from others.

Amazon's AWS Graviton is a good example (ARM-based servers). They aren't licensing the instruction-set from ARM so much as the complete OoO CPU design. They include the ARM cores on a chip of Amazon's design, having memory, I/O, security features tailored to AWS use cases. Neither the instruction-set architecture or micro-architecture particularly matter to Amazon compared to all the other features of their chips.

Lots of big companies are getting into the custom CPU game, licensing ARM cores. Big tech companies tend to have their own programming language, their own operating systems, their own computer designs, and nowadays their own CPUs. This includes Microsoft, Google, Apple, Facebook, and so on. The advantage of ARM processors (or in the future, possibly RISC V processors) isn't their RISC nature, or their instruction-sets, but the fact they are big processor designs that others can included with their own chips. There is no inherent power efficiency or speed benefit -- only the business benefit.

Conclusion

This blogpost is in reaction to that blogpost I link above. That writer just recycles old RISC rhetoric of the past 30 years, like claiming it's a "design philosophy". It, it was a set of tradeoffs meaningful to small in-order chips -- the best way of designing a chip with 100,000 transistors.

The term "RISC" has been obsolete for 30 years, and yet this nonsense continues. One reason is the Penisy textbook that indoctrinates the latest college students. Another reason is the political angle, people hating whoever is dominant (in this case, Intel on the desktop). People believe in RISC, people evangelize RISC. But it's just a cult, it's all junk. Any conversation that mentions RISC can be improved by removing the word "RISC".

OoO has replaced RISC as the dominant architecture for CPUs, and it did so in 1995, and ever since then, the terminology "RISC" is obsolete. The only thing you care about when looking at chips is whether it's an in-order design or an out-of-order design. Well, that's if you care about theory. If you care about practice, you care about whether it supports your legacy tooling and code. In the real world, whether you use x86 or ARM or MIPS or PowerPC is simply because of legacy market conditions. We still launch rockets to Mars using PowerPC processors because that's what the market for radiation-hardened CPUs has always used.

No comments: