Wednesday, February 14, 2024

C can be memory safe, part 2

This post from last year was posted to a forum, so I thought I'd write up some rebuttals to their comments.

The first comment is by David Chisnall, creator of CHERI C/C++, which proposes we can solve the problem with CPU instruction set extensions. It's a good idea, but after 14 years, CPUs haven't had their instruction-sets upgraded. Even mainstream RISC V processors haven't been created using those extensions.

Chisnall: "If your safety requires you to insert explicit checks, it’s not safe". This is true from one perspective, false from another. My proposal includes compilers spitting out warnings whenever bounds information doesn't exist.

C is full of problems in theory that doesn't exist in practice because the compiler spits out warnings telling programmers to fix the problem. Warnings can also note cases where programmers probably made mistakes. We can't achieve perfect guarantees, because programmers can still make mistakes, but we can certainly achieve "good enough".

Chisnall: ....tread safety..... I'm not sure I full understand the comment. I understand that CHERI can guarantee atomicity of bounds checking, which would require multiple (interruptible) instructions otherwise. The number of cases where this is a problem, and the C proposal would be no worse than other languages like Rust.

Chisnall: Temporal safety.... A lot of Rust "ownership" techniques can be applied to C with these annotations, namely, marking which variables OWN allocated memory and which simply BORROW it. I've reviewed a lot of famous use-after-free and double-free bugs, and most can be trivially fixed by annotation.

Chisnall: If you are writing a blog never having actually tried to make large (million line or more) C codebases memory safe, you probably underestimate the difficulty by at least one order of magnitude. I'm both a programmer who has written a million lines of code in my lifetime as well as a hacker with decades of experience looking for such bugs. The goal isn't to pursue the ideal of 100% safe language, but of getting rid of 99% of safety errors. 1% less safe makes the goal an order of magnitude easier to reach.

snej: This post seems to epitomize the common engineer trait of seeing any problem you haven’t personally worked on as trivial. Sure bro, you’ll add a few patches to Clang and GCC and with those new attributes our C code will be safe. It’ll only take a few weeks and then no one will need Rust anymore. But I've spent decades working on this. The comment epitomizes the common trait of not realizing how much thought and expertise is behind the post. I few patches to clang and GCC will make make C safer. The solution is far less safe than Rust. In fact, my proposal makes code more interoperable and translatable into Rust. Right now, translating C into Rust creates just a bunch of 'unsafe' code that needs to be cleaned up. With such annotations, in a refactoring step using existing testing frameworks, results in code that can no be auto-translated safely in to Rust.

As for existing clang/gcc attributes, there are only a couple that match the macros I propose. They dod show how trivial it would be to actually go further.

danso: In addition to the criticisms I share with everyone else, I found this to be one of the most “talk is cheap, show me the code” posts I’ve ever read. The reason I wrote the post is because learning clang/gcc internals is a long process, and when asking for help, I needed something to point to "this is what I'm trying to achieve". I'm not trying to communicate what other people should do, I'm communicating what I'm trying to do. I still don't know clang/gcc internals enough to even get started ... any pointers would be helpful.

Wednesday, February 01, 2023

C can be memory-safe

The idea of memory-safe languages is in the news lately. C/C++ is famous for being the world's system language (that runs most things) but also infamous for being unsafe. Many want to solve this by hard-forking the world's system code, either by changing C/C++ into something that's memory-safe, or rewriting everything in Rust.

Forking is a foolish idea. The core principle of computer-science is that we need to live with legacy, not abandon it.

And there's no need. Modern C compilers already have the ability to be memory-safe, we just need to make minor -- and compatible -- changes to turn it on. Instead of a hard-fork that abandons legacy system, this would be a soft-fork that enables memory-safety for new systems.

Consider the most recent memory-safety flaw in OpenSSL. They fixed it by first adding a memory-bounds, then putting every access to the memory behind a macro PUSHC() that checks the memory-bounds:

A better (but currently hypothetical) fix would be something like the following:

size_t maxsize CHK_SIZE(outptr) = out ? *outlen : 0;

This would link the memory-bounds maxsize with the memory outptr. The compiler can then be relied upon to do all the bounds checking to prevent buffer overflows, the rest of the code wouldn't need to be changed.

An even better (and hypothetical) fix would be to change the function declaration like the following:

int ossl_a2ulabel(const char *in, char *out, size_t *outlen CHK_INOUT_SIZE(out));

That's the intent anyway, that *outlen is the memory-bounds of out on input, and receives a shorter bounds on output.

This specific feature isn't in compilers. But gcc and clang already have other similar features. They've only been halfway implemented. This feature would be relatively easy to add. I'm currently studying the code to see how I can add it myself. I could just mostly copy what's done for the alloc_size attribute. But there's a considerable learning curve, I'd rather just persuade an existing developer of gcc or clang to add the new attributes for me.

Once you give the programmer the ability to fix memory-safety problems like the solution above, you can then enable warnings for unsafe code. The compiler knew the above code was unsafe, but since there's no practical way to fix it, it's pointless nagging the programmer about it. With this new features comes warnings about failing to use it.

In other words, it becomes compiler-guided refactoring. Forking code is hard, refactoring is easy.

As the above function shows, the OpenSSL code is already somewhat memory safe, just based upon the flawed principle of relying upon diligent programmers. We need the compiler to enforce it. With such features, the gap is relative small, mostly just changing function parameter lists and data structures to link a pointer with its memory-bounds. The refactoring effort would be small, rather than a major rewrite.

This would be a soft-fork. The memory-bounds would work only when compiled with new compilers. The macro would be ignored on older systems. 

This memory-safety is a problem. The idea of abandoning C/C++ isn't a solution. We already have the beginnings of a solution in modern gcc and clang compilers. We just need to extend that solution.

Wednesday, January 25, 2023

I'm still bitter about Slammer

Today is the 20th anniversary of the Slammer worm. I'm still angry over it, so I thought I'd write up my anger. This post will be of interest to nobody, it's just me venting my bitterness and get off my lawn!!

Back in the day, I wrote "BlackICE", an intrusion detection and prevention system that ran as both a desktop version and a network appliance. Most cybersec people from that time remember it as the desktop version, but the bulk of our sales came from the network appliance.

The network appliance competed against other IDSs at the time, such as Snort, an open-source product. For much the cybersec industry, IDS was Snort -- they had no knowledge of how intrusion-detection would work other than this product, because it was open-source.

My intrusion-detection technology was radically different. The thing that makes me angry is that I couldn't explain the differences to the community because they weren't technical enough.

When Slammer hit, Snort and Snort-like products failed. Mine succeeded extremely well. Yet, I didn't get the credit for this.

The first difference is that I used a custom poll-mode driver instead of interrupts. This the now the norm in the industry, such as with Linux NAPI drivers. The problem with interrupts is that a computer could handle less than 50,000 interrupts-per-second. If network traffic arrived faster than this, then the computer would hang, spending all it's time in the interrupt handler doing no other useful work. By turning off interrupts and instead polling for packets, this problem is prevented. The cost is that if the computer isn't heavily loaded by network traffic, then polling causes wasted CPU and electrical power. Linux NAPI drivers switch between them, interrupts when traffic is light and polling when traffic is heavy.

The consequence is that a typical machine of the time (dual Pentium IIIs) could handle 2-million packets-per-second running my software, far better than the 50,000 packets-per-second of the competitors.

When Slammer hit, it filled a 1-gbps Ethernet with 300,000 packets-per-second. As a consequence, pretty much all other IDS products fell over. Those that survived were attached to slower links -- 100-mbps was still common at the time.

An industry luminary even gave a presentation at BlackHat saying that my claimed performance (2-million packets-per-second) was impossible, because everyone knew that computers couldn't handle traffic that fast. I couldn't combat that, even by explaining with very small words "but we disable interrupts".

Now this is the norm. All network drivers are written with polling in mind. Specialized drivers like PF_RING and DPDK do even better. Networks appliances are now written using these things. Now you'd expect something like Snort to keep up and not get overloaded with interrupts. What makes me bitter is that back then, this was inexplicable magic.

I wrote an article in PoC||GTFO 0x15 that shows how my portscanner masscan uses this driver, if you want more info.

The second difference with my product was how signatures were written. Everyone else used signatures that triggered on the pattern-matching. Instead, my technology included protocol-analysis, code that parsed more than 100 protocols.

The difference is that when there is an exploit of a buffer-overflow vulnerability, pattern-matching searched for patterns unique to the exploit. In my case, we'd measure the length of the buffer, triggering when it exceeded a certain length, finding any attempt to attack the vulnerability.

The reason we could do this was through the use of state-machine parsers. Such analysis was considered heavy-weight and slow, which is why others avoided it. State-machines are faster than pattern-matching, many times faster. Better and faster.

Such parsers are now more common. Modern web-servers (nginx, ISS, LightHTTPD, etc.) use them to parse HTTP requests. You can tell if a server does this by sending 1-gigabyte of spaces between "GET" and "/". Apache gives up after 64k of input. State-machines keep going, because while in that state ("between-method-and-uri"), they'll accept any number of spaces -- the only limit is a timeout. Go read the nginx source-code to understand how this works.

I wrote a paper in PoC||GTFO 0x21 that shows the technique to implement the common wc (word-count) program. A simplified version of this wc2o.c. Go read the code -- it's crazy.

The upshot is that when Slammer hit, most IDSs didn't have a signature for it. If they didn't just fall over, what they triggered on were things like "UDP flood", not "SQL buffer overflow". This lead many to believe what was happening was DDoS attack. My product correctly identified the vulnerability being exploited.

The third difference with my product was the event coalescer. Instead of a timestamp, my events had a start-time, end-time, and count of the number of times the event triggered.

Other event systems sometimes have this, with such events as "last event repeated 39003 times", to prevent the system from clogging up with events.

My system was more complex. For one things, an attacker may deliberately intermix events, so it can't simply be 1 event that gets coalesced this way. For another thing, the attacker could sweep targets or spoof sources. Thus, coalescing needed to aggregate events over address ranges as well as time.

Slammer easily filled a gigabit link with 300,000 packets-per-second. Every packet triggered the signature, thus creating 300,000 events-per-second. No system could handle that. To keep up with the load, events had to be reduced somehow.

My event coalescing logic worked. It reduced the load of events from 300,000 down to roughly 500 events-per-second. This was still a little bit higher load than the system could handle, forwarding to the remote management system. Customers reported that at their consoles, they saw the IDS slowly fall behind, spooling events at the sensor and struggling to ship them up to the management system.

The problem is so accurate that it's a big flaw in IDS still to this day. Snort often has signatures that throw away the excess data, but it's still easy to flood them with packets that overload their event logging.

What was exciting for me is that I'd designed all this in theory, tested using artificial cases, unsure how it would stand up to the real world. Watching it stand up to the real world was exciting: big customers saw it successfully work in practice, with the only complaint that at the centralized console, it fell behind a little.

The point is that I made three radical design choices, unprecedented at the time though more normal now, and they worked. And yet, the industry wasn't technical enough to recognize that it worked.

For example, a few months later I had a meeting at the Pentagon where a Gartner analyst gave a presentation claiming that only hardware-based IDS would work, because software-based IDS couldn't keep up. Well, these were my customer. I didn't refute Gartner so much as my customer did, with their techies standing up and pointing out that when Slammer hit, my "software" product did keep up. Gartner doesn't test products themselves. They rightly identified the problem with other software using interrupts, but couldn't conceive there was a third alternative, "poll mode" drivers.

I apologize to you, the reader, for subjecting you to this vain bitching, but I just want to get this off my chest.

Sunday, October 23, 2022

The RISC Deprogrammer

I should write up a larger technical document on this, but in the meanwhile is this short (-ish) blogpost. Everything you know about RISC is wrong. It's some weird nerd cult. Techies frequently mention RISC in conversation, with other techies nodding their head in agreement, but it's all wrong. Somehow everyone has been mind controlled to believe in wrong concepts.

An example is this recent blogpost which starts out saying that "RISC is a set of design principles". No, it wasn't. Let's start from this sort of viewpoint to discuss this odd cult.

What is RISC?

Because of the march of Moore's Law, every year, more and more parts of a computer could be included onto a single chip. When chip densities reached the point where we could almost fit an entire computer on a chip, designers made tradeoffs, discarding unimportant stuff to make the fit happen. They made tradeoffs, deciding what needed to be included, what needed to change, and what needed to be discarded.

RISC is a set of creative tradeoffs, meaningful at the time (early 1980s), but which were meaningless by the late 1990s.

The interesting parts of CPU evolution are the three decades from 1964 with IBM's System/360 mainframe and 2007 with Apple's iPhone. The issue was a 32-bit core with memory-protection allowing isolation among different programs with virtual memory. These were real computers, from the modern perspective: real computers have at least 32-bit and an MMU (memory management unit).

The year 1975 saw the release of Intel 8080 and MOS 6502, but these were 8-bit systems without memory protection. This was at the point of Moore's Law where we could get a useful CPU onto a single chip.

In the year 1977 we saw DEC release it's VAX minicomputer, having a 32-bit CPU w/ MMU. Real computing had moved from insanely expensive mainframes filling entire rooms to less expensive devices that merely filled a rack. But the VAX was way too big to fit onto a chip at this time.

The real interesting evolution of real computing happened in 1980 with Motorola's 68000 (aka. 68k) processor, essentially the first microprocessor that supported real computing.

But this comes with caveats. Making microprocessor required creative work to decide what wasn't included. In the case of the 68k, it had only a 16-bit ALU. This meant adding two 32-bit registers required passing them twice through the ALU, adding each half separately. Because of this, many call the 68k a 16-bit rather than 32-bit microprocessor.

More importantly, only the lower 24-bits of the registers were valid for memory addresses. Since it's memory addressing that makes a real computer "real", this is the more important measure. But 24-bits allows for 16-megabytes of memory, which is all that anybody could afford to include in a computer anyway. It was more than enough to run a real operating system like Unix. In contrast, 16-bit processors could only address 64-kilobytes of memory, and weren't really practical for real computing.

The 68k didn't come with a MMU, but it allowed an extra MMU chip. Thus, the early 1980s saw an explosion of workstations and servers consisting of a 68k and an MMU. The most famous was Sun Microsystems launched in 1982, with their own custom designed MMU chip.

Sun and its competitors transformed the industry running Unix. Many point to IBM's PC from 1982 as the transformative moment in computer history, but these were non-real 16-bit systems that struggled with more than 64k of memory. IBM PC computers wouldn't become real until 1993 with Microsoft's Windows NT, supporting full 32-bits, memory-protection, and pre-emptive multitasking.

But except for Windows itself, the rest of computing is dominated by the Unix heritage. The phone in your hand, whether Android or iPhone, is a Unix computer that inherits almost nothing from the IBM PC.

These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. The VAX was considered a mini-supercomputer. The Unix workstations were mere toys in comparison. Too many tradeoffs were made in order to fit everything onto a single chip, too many sacrifices made.

Some people asked "What if we make different tradeoffs?"

Most people thought the VAX was the way of the future, and were all chasing that design. The 68k CPU was essentially a cut down VAX design. But history had anti-VAX designs that worked very differently, notably the CDC 6600 supercomputer from the 1960s and the IBM 801/ROMP processor from the 1970s.

It's not simply one tradeoff, but a bunch of inter-related tradeoffs. They snowball -- each choice you make changes the costs-vs-benefit analysis of other choices, changing them as well.

This is why people can't agree upon a single definition of RISC. It's not one tradeoff made in isolation, but a long list of tradeoffs, each part of a larger scheme.

In 1987, Motorola shipped its 68030 version of the 68k processor, chasing the VAX ideal. By then, we had ARM, SPARC, and MIPS processors that significantly outperformed it. Given a budget of roughly 100,000 transistors allowed by Moore's Law of the time, the RISC tradeoffs were better than VAX-like tradeoffs.

So really, what is RISC?

Let's define things in terms of 1986, comparing the [ARM, SPARC, MIPS] processors called "RISC" to the [68030, 80386] processors that weren't "RISC". They all supported full 32-bit processing, memory-management, and preemptive multitasking operating systems like Unix.

The major ways RISC differed were:

  • fixed-length instructions (32-bits or 4-bytes each)
  • simple instruction decoding
  • horizontal vs. vertical microcode
  • deep pipelines of around 5 stages
  • load/store aka reg-reg
  • simple address modes
  • compilers optimized code
  • more registers

If you are looking for the one thing that defines RISC, it's the thing that nobody talks about: horizontal microcode.

The VAX/68k/x86 architecture decoded external instructions into internal control ops that were pretty complicated, supporting such things as loops. Each external instruction executed an internal microprogram with a variable number of such operations.

The classic RISC worked differently. Each external instruction decoded into exactly 4 internal ops. Moreover, each op had a fixed purpose:

  1. read from two registers into the ALU (arithmetic-logic unit)
  2. execute a math operation in the ALU
  3. access memory (well, the L1 cache)
  4. write results back into one register

(This explanation has been fudged and simplified, btw).

This internal detail was expressed externally in the instruction set, simplifying decoding. The external instructions specified two registers to read, an ALU opcode, and one register to write. All of this was fit into a constant 32-bits. In contrast, the [68k/x86/VAX] model meant a complex decoding of instructions with a large ROM containing microprograms.

Roughly half (50%) of the 68000's transistors contained this complex decoding logic and ROM. In contrast, for RISC processors, it was closer to 1%. All those transistors could be dedicated to other things. See how tradeoffs snowball? Saving so many transistors involved in instruction decoding meant being able to support other features elsewhere. It's not clear this is a benefit, however. This meant that RISC needed multiple instructions to do the same thing as a single [68k/x86/VAX] instruction.

This meant instructions could be deeply pipelined. Instructions could be overlapped. When reading registers for the current instruction, we can simultaneously be fetching the next, and performing the ALU calculation on the previous instruction. The classic RISC pipeline had 5 stages (the 4 mentioned above plus 1 for fetching the next instruction). Each clock cycle would execute part of 5 instructions simultaneous, each at a different stage in the pipeline.

This was called scalar operation, In previous processor, it would take a variable number of clock cycles for an instruction to complete. In RISC, every instruction had 5 clock cycle latency from beginning to end. And since execution was overlapped/pipelined, executing 5 instructions at a time, the throughput was one instruction per clock cycle.

All CPUs are pipelined to some extent, but they need complex interlocks to prevent things from colliding with each other, such as two pipelined instructions trying to read registers at the same time. RISC removed most of those interlocks, by strictly regulation what an instruction could do in each stage of the pipeline. Removing these interlocks reduced transistor count and sped things up. This could be one possible definition of RISC that you never hear of: it got rid of all these interlocks found in other processors.

Some pipeline conflicts were worse. Because pipelining, the results of an instruction won't be available until many clock cycles later. What if one instruction writes its results to register #5 (r5), and the very next instruction attempts to read from register #5 (r5)? It's too soon, it has to wait more clock cycles for the result.

The answer: don't do that. Assembly language programmers need to know this complication, and are told to simply not write code that does this, because then the program won't work.

This was anathema of the time. Throughout history to this point, each new CPU architecture also had a new operating-system written in assembly language, with many applications written in assembly language. Thus, a programmer-friendly assembly language was considered one of the biggest requirements for any new system. Requiring programmers to know such quirks lead to buggy code was simply unacceptable. Everybody knew that programmer-hostile instruction-sets would never work in the market, even if they performed faster and cheaper.

But technology is littered with what everybody knowing being wrong. In this case, by 1980 we had the  C programming language that was essentially a "portable assembly language" and the Unix operating system written in C. The only people who needed to know about a quirky assembly language were the compiler writers. They would take care of all such problems.

That's why the history lesson above talks about Unix and real computing. Without Unix and C, RISC wouldn't have happened. An operating-system written in a high-level language was a prerequisite for RISC. It's as import an innovation as Moore's Law allowing 100,000 transistors to fit on a chip.

Because of the lack of complex decoding logic, the transistor budget was freed up to support such things as more registers. The Intel x86 architecture famously had 8 registers, while the RISC competitors typically had as many as 32. The limitation was decode space. It takes 5 bits to specify one of 32 possibilities. Given that most every instructions specified two registers to read from and one register to write to, that's 15 bits, or half of the instruction space, leaving 17 bits for other purposes.

The creators of the RISC, Hennessy and Patterson, wrote a textbook called a "Computer Architecture: A Quantitative Approach". It's horrible. It imagines a world where people need to be taught tradeoffs and transistor budgets. But there is no other approach than a quantitative one, it's like an economics textbook "Economics: A Supply And Demand Approach". While the textbook has a weird obsession with quantitative theory, it misses non quantitative tradeoffs, like the fact that RISC couldn't happen without C and Unix. 

Among the snowballing tradeoffs is the load/store architecture, while at the same time, having fewer addressing modes. It's here that we need to go back and discuss history -- what the heck is an "addressing mode"????

In the beginning, computers had only a single general purpose register, called the accumulator. All calculations, like adding two numbers together, involved reading the second value from memory and combining with the first value already in the accumulator. All calculations, whether arithmetic (add, subtract) or logical (AND, OR, XOR) involved one value already in the register, and another value from memory.

Addresses have to be calculated. For example, when accessing elements in a table, we have to take the row number, multiply it by the size of the table, add an offset into the row for desired column, then add all that to the address at the start of the table. Then after calculating this address, we often want to increment the index to fetch the next row.

If the table base address and row index are held in registers, we might get a complex instructions like the following. This calculates and address using two registers r10 and r11, fetches that value from memory, then adds it into register r9.

 ADD r9, [r10 + r11*8 + 4]

Such calculations embedded in the instruction-set were necessary for such early computers. While they had only a single general purpose register (the accumulator), they still had multiple special purpose registers used this way for address calculations. 

For complex computers like the VAX, such address modes imbedded in instructions were no longer necessary, but still desirable. Half the work of the computer is in calculating memory addresses. It's very tedious for programmers to do it manually, easier when the instruction-set takes care of common memory access patterns (like accessing cells within a table).

This leads us to the load/store issue.

With many registers, we no longer need to read another value from memory (a reg-mem calculation). We can instead perform the calculation using two registers (reg-reg). The VAX had such reg-reg instructions, but programmers still mostly used the reg-mem instructions with the complex address calculations.

RISC changed this. Calculations were now exclusively reg-reg, where math operations like addition could only operate on registers. To add something from memory, you needed first to load it from memory into a register, using a separate, explicit instruction. Likewise, writing back to memory required an explicit store operation.

This architecture can be called either reg-reg or load/store, with the second name being more popular.

With RISC, addressing modes were still desirable, but now they applied to only the two load and store instructions.

The available addressing modes were constrained by the limited RISC pipeline and limited 32-bit fixed-length instructions. Since the pipeline allowed for the reading of two registers at the start, adding two registers together to form the address was allowed. The example shown above was too complex, though.

What you are supposed to be reading from all of this is that all of these tradeoffs are linked. Each decision that diverges from the ideal VAX-like architecture snowballed into other decisions that drifted further and further from this ideal, until what we had was something that looked nothing like a VAX.

The upshot of these decisions was being able to reduce a 32-bit MMU CPU into roughly a single chip because it needed fewer transistors, while at the same time performing much faster. It required maybe twice as many instructions to perform the same tasks (mostly due to needing more complex address calculations due to lack of addressing modes), but performed them at maybe 5 times faster, for a significant speed up.

At the time, the VAX was the standard benchmark target. When Sun shipped it's first SPARC RISC systems (the Sun-4), they benchmarked about twice as fast as the latest VAX systems, while being considerably cheaper.

The end of RISC

By the late 1980s, everybody knew that RISC was the future. Sure, Intel continued with its x86 and Motorola with it's 68000, but that's because the market wanted backwards compatibility with legacy instruction-sets. Both attempted to build their own RISC alternatives, but failed. When backwards compatibility wasn't required, everybody created RISC processors, because for 32-bit MMU real computing, they were  better. And everybody knew it.

But of course, everybody was eventually wrong. Even as early as the 80486 in 1989, Intel was converting the innards of the processor into something that looked more RISC-like.

The nail in the coffin came in 1995 with Intel's Pentium Pro processor that supported out-of-order (or OoO) processing. Again, it wasn't really a new innovation. Out-of-order instructions first appeared on 1960s era supercomputers from CDC and IBM. This was the first time that transistor budgets allowed it to be practically used on single-chip microprocessors.

Transistor budgets were so high that designers no longer had to make basic painful tradeoffs. The decisions necessary trying to cram everything into 100,000 transistors were longer meaningful when you had more than 1-million transistors to work with. Instruction-set decoding requiring 20k transistors is important with small budgets, but meaningless with large budgets.

With OoO, the microarchitecture inside the chip looks roughly the same, regardless if it's an Intel x86, ARM, SPARC, or whatever.

This was proven in benchmarks. When Intel released its out-of-order Pentium Pro in 1995, it beat all the competing in-order RISC processors on the market.

Everybody was wrong -- RISC wasn't the future, the future was OoO.

One way of describing the Pentium Pro is that it "translates x86 into RISC-like micro-ops". What that really means is that instead of vertical microcode, it translated things into horizontal, pipelined micro-ops. Most of the typical math operations were split into two micro-ops, one a load/store operation, and the other a reg-reg operation. (Some x86 instructions need even more micro-ops: address calculation, then load/store, then ALU op).

Intel still has an "x86 tax" decoding complex instructions. But in terms of pipeline stages, that tax only applies to the first stage. Typical OoO processors have at least 10 more stages after that. Even RISC instruction-set processors like ARM must translate external instructions into internal micro-ops.

The only significant difference left is the fact that Intel's instructions are variable length. The fixed length instructions of RISC means that multiple can be fetched at once, and decoded all in parallel. This is impossible with Intel x86, they must at least partially be decoded serially, one before the next. You don't know where the next instruction starts until you've figured out the length of the current instruction.

Intel and AMD find crafty ways to get around this. For example, AMD has often put hints in its instruction cache (L1I) to so that decoders can know the length of instructions. Intel has played around with "loop caches" (so-called because they are most useful for loops) that track instructions after they've been decoded, so they don't need to be decoded again.

The upshot is that for most code, there's no inherent difference between x86 and RISC, they have essentially the same internal architecture for out-of-order (OoO) processors. No instruction-set has an inherent advantage over the other.

And it's been that way since 1995.

I mention this because bizarrely this cult has persisted for the last 30 years after OoO replaced RISC for high-end real computers. It ceased being a useful technical distinction, so what are techies still discussing it?

They persist in believing dumb things, for which no amount of deprogramming is possible. For example, they look at mobile (battery powered) devices and note that they use ARM chips to conserve power. They make the assumption that there must be some sort of inherent power-efficiency advantage.

This isn't true. These chips consume less power by simply being slower. Fewer transistors mean less power consumption. This meant while desktops/servers used power-hungry OoO processors, mobile phones went back to the transistor budgets of yesteryear, meaning back to in-order RISC.

But as Moore's Law turned, transistors got smaller, to the point where even mobile phones got OoO chips. They use clever tricks to keep that OoO chip powered down most of the time, often including an in-order chip that runs slower on less power for minor tasks.

We've reached the point where mobile and laptops now use the same chips, your MacBook uses (essentially) the same chip as your iPhone, which is the same chip as Apple desktops.

Now Apple's M1 ARM (and hence RISC) processor is much better at power consumption than it's older Intel x86 chip, but this isn't because it's RISC. Apple did a good job at analyzing what people do on mobile devices like laptops and phones and optimized for that. For example, they added a lot of great JavaScript features, cognizant of the ton of online and semi-offline apps that are written in JavaScript. In contrast, Intel attempts to optimize a chip simultaneously for laptops, desktops, and servers, leading poorly optimizations for laptops.

Apple also does crazy things like putting a high end GPU (graphics processor) on the same chip. This has the effect of making their M1 ARM CPU crazy good for desktops for certain applications, those requiring the sorts of memory-bandwidth normally needed by GPUs.

But overall, x86 chips from AMD and Intel are still faster on desktops and servers.

In addition to the fixed-length instructions providing a tiny benefit, ARM has another key advantage, but it has nothing to do with RISC. When they upgraded their instruction-set to support 64-bit instead of just 32-bit, they went back and redesigned it from scratch. This allowed them to optimize the new instruction-set for the OoO pipeline, such as removing some dependencies that slow things down.

This was something that Intel couldn't do. When it came time to support 64-bit, AMD simply extended the existing 32-bit instructions. A long sequence of code often looks identical between the 32-bit and 64-bit versions of the x86 instruction-sets, whereas they look completely different on ARM 32-bit vs. 64-bit.

What about RISC-V and ARM-on-servers?

We've reached the point in tech where the instruction-set doesn't matter. It's not simply that code is written in high-level language. It's mostly that micro-architectural details have converged.

Take byte-order, for example. Back in the 1980s, most of the major CPUs in the world were big-endian, while Intel bucked the trend being little-endian. The reason is that some engineer made a simple optimization back when the 8008 processor was designed for terminals, and because of backwards compatibility, the poor decision continues to plague x86 today.

Except when it annoys programmers debugging memory dumps, byte-order doesn't matter. Therefore, all the RISC processors allowed a simple bit to be set to switch processors from big-endian to little-endian mode.

Over time, that has caused everyone to match Intel's little-endianess, driven primarily by Linux. The kernel itself supports either mode, but a lot of drivers depend upon byte-order, and user-mode programs developed on x86 sometimes have byte-order bugs. As it was ported to architectures like ARM or PowerPC, most of the time it was done in little-endian mode. (You can get PowerPC Linux in big-endian, but the preference is little-endian, because drivers).

The same effect happens even in things that aren't strictly CPU related, like memory and I/O. The tech stack has converged so that processors look more and more alike except for the instruction-set.

The convergence of architecture is demonstrated most powerfully by Apple's M1 transition, where they stopped using Intel's processors in their computers in favor of their custom ARM processor they created for the iPhone.

The MacBook Air M1 looks identical on the outside compared to the immediately preceding x86 MacBook. But more the the point, it performs almost identically running x86 code -- it runs x86 code at native x86 speeds but on an ARM CPU. The processors are so similar architecturally that instruction-sets could be converted on the fly -- it simply reads the x86 program, converts to ARM transparently on the fly, then runs the ARM version. Previous code translation attempts have incurred massive slowdowns to account for architectural differences, but the M1 cheated by removing any differences that weren't instruction-set related, allowing smooth translation of the instructions.

Technically, instruction-sets don't matter, but for business reasons, they still do. Intel and AMD control x86, and prevent others from building compatible processors. ARM lets others build compatible processors (indeed, making no CPUs themselves), but charges them a license fee.

Especially for processors on the low-end, people don't want to pay license fees.

For that reason, RISC V has become popular. For low-end processors (in-order microcontrollers competing against ARM Cortex Ms) in the 100,000 transistor range, it matters that an instruction-set be RISC. The only free alternative is the aging MIPS. It has annoying quirks, like "delay slots", which are fixed by RISC V. Since RISC V is an open standard, free of license fees, those designing their own low end processor have adopted it.

For example, nVidia uses RISC V extensively throughout its technology. GPUs contain tiny embedded CPUs to manage things internally. They have ARM licenses, but they don't want to pay the pennies it would cost for every unit that ARM charges. Likewise, Western Digital (a big hard-drive maker) designed a RISC V core for its drives. 

There are a lot of RISC V fans due to the RISC cult who insist it should go everywhere, but it's not going anywhere for high-end processors. At the high-end, you are going to pay licensing fees for designs anyway. In other words, while big companies have the resources to design small in-order processors, they don't have the resources to design big OoO processors, and would therefore buy designs from others.

Amazon's AWS Graviton is a good example (ARM-based servers). They aren't licensing the instruction-set from ARM so much as the complete OoO CPU design. They include the ARM cores on a chip of Amazon's design, having memory, I/O, security features tailored to AWS use cases. Neither the instruction-set architecture or micro-architecture particularly matter to Amazon compared to all the other features of their chips.

Lots of big companies are getting into the custom CPU game, licensing ARM cores. Big tech companies tend to have their own programming language, their own operating systems, their own computer designs, and nowadays their own CPUs. This includes Microsoft, Google, Apple, Facebook, and so on. The advantage of ARM processors (or in the future, possibly RISC V processors) isn't their RISC nature, or their instruction-sets, but the fact they are big processor designs that others can included with their own chips. There is no inherent power efficiency or speed benefit -- only the business benefit.


This blogpost is in reaction to that blogpost I link above. That writer just recycles old RISC rhetoric of the past 30 years, like claiming it's a "design philosophy". It, it was a set of tradeoffs meaningful to small in-order chips -- the best way of designing a chip with 100,000 transistors.

The term "RISC" has been obsolete for 30 years, and yet this nonsense continues. One reason is the Penisy textbook that indoctrinates the latest college students. Another reason is the political angle, people hating whoever is dominant (in this case, Intel on the desktop). People believe in RISC, people evangelize RISC. But it's just a cult, it's all junk. Any conversation that mentions RISC can be improved by removing the word "RISC".

OoO has replaced RISC as the dominant architecture for CPUs, and it did so in 1995, and ever since then, the terminology "RISC" is obsolete. The only thing you care about when looking at chips is whether it's an in-order design or an out-of-order design. Well, that's if you care about theory. If you care about practice, you care about whether it supports your legacy tooling and code. In the real world, whether you use x86 or ARM or MIPS or PowerPC is simply because of legacy market conditions. We still launch rockets to Mars using PowerPC processors because that's what the market for radiation-hardened CPUs has always used.

Sunday, July 03, 2022

DS620slim tiny home server

In this blogpost, I describe the Synology DS620slim. Mostly these are notes for myself, so when I need to replace something in the future, I can remember how I built the system. It's a "NAS" (network attached storage) server that has six hot-swappable bays for 2.5 inch laptop drives.

That's right, laptop 2.5 inch drives. It makes this a tiny server that you can hold in your hand.

The purpose of a NAS is reliable storage. All disk drives eventually fail. If you stick a USB external drive on your desktop for backups, it'll eventually crash, losing any data on it. A failure is unlikely tomorrow, but a spinning disk will almost certainly fail some time in the next 10 years. If you want to keep things, like photos, for the rest of your life, you need to do something different.

The solution is RAID, an array of redundant disks such that when one fails (or even two), you don't lose any data. You simply buy a new disk to replace the failed one and keep going. With occasional replacements (as failures happen) it can last decades. My older NAS is 10 years old and I've replaced all the disks, one slot replaced twice.

Monday, January 31, 2022

No, a researcher didn't find Olympics app spying on you

For the Beijing 2022 Winter Olympics, the Chinese government requires everyone to download an app onto their phone. It has many security/privacy concerns, as CitizenLab documents. However, another researcher goes further, claiming his analysis proves the app is recording all audio all the time. His analysis is fraudulent. He shows a lot of technical content that looks plausible, but nowhere does he show anything that substantiates his claims.

Tuesday, December 07, 2021

Journalists: stop selling NFTs that you don't understand

The reason you don't really understand NFTs is because the journalists describing them to you don't understand them, either. We can see that when they attempt to sell an NFT as part of their stories (e.g. AP and NYTimes). They get important details wrong.

The latest is magazine selling an NFT. As libertarians, you'd think at least they'd get the technical details right. But they didn't. Instead of selling an NFT of the artwork, it's just an NFT of a URL. The URL points to OpenSea, which is known to remove artwork from its site (such as in response to DMCA takedown requests).

If you buy that NFT, what you'll actually get is a token pointing to:

This is just the metadata, which in turn contains a link to the claimed artwork:

If either OpenSea or Google removes the linked content, then any connection between the NFT and the artwork disappears.

It doesn't have to be this way. The correct way to do NFT artwork is to point to a "hash" instead which uniquely identifies the work regardless of where it's located. That $69 million Beeple piece was done this correct way. It's completely decentralized. If the entire Internet disappeared except for the Ethereum blockchain, that Beeple NFT would still work.

This is an analogy for the entire blockchain, cryptocurrency, and Dapp ecosystem: the hype you hear ignores technical details. They promise an entirely decentralized economy controlled by math and code, rather than any human entities. In practice, almost everything cheats, being tied to humans controlling things. In this case, the " NFT artwork" is under control of OpenSea and not the "owner" of the token.

Journalists have a problem. NFTs selling for millions of dollars are newsworthy, and it's the journalists place to report news rather than making judgements, like whether or not it's a scam. But at the same time, journalists are trying to explain things they don't understand. Instead of standing outside the story, simply quoting sources, they insert themselves into the story, becoming advocates rather than reporters. They can no longer be trusted as an objective observers.

From a fraud perspective, it may not matter that the NFT points to a URL instead of the promised artwork. The entire point of the blockchain is caveat emptor in action. Rules are supposed to be governed by code rather than companies, government, or the courts. There is no undoing of a transaction even if courts were to order it, because it's math.

But from a journalistic point of view,  this is important. They failed at an honest description of what actually the NFT contains. They've involved themselves in the story, creating a conflict of interest. It's now hard for them to point out NFT scams when they themselves have participated in something that, from a certain point of view, could be viewed as a scam.