Saturday, August 02, 2014

C10M: The coming DDR4 revolution

Computer memory has been based on the same DRAM technology since the 1970s. Recent developments have been versions of the DDR technology, DDR2, DDR2, and now DDR4. The capacity and transfer speed have been doubling every couple years according to Moore's Law, but the latency has been stuck at ~70 nanoseconds for decades. The recent DDR4 standard won't fix this latency, but will give us a lot more tools to mitigate its effects.


Latency is bad. If a thread needs data from main memory, it must stop and wait for around 1000 instructions before the data is returned from memory. CPU caches mitigate most of this latency by keeping a copy of frequently used data in local, high-speed memory. This allows the processor to continue at full speed without having to wait.

The problem with Internet scale is that it can't be cached. If you have 10 million concurrent connections, each requiring 10-kilobytes of data, you'll need 100-gigabytes of memory. However, processors have only 20-megabytes of cache -- 50 thousand times too small to cache everything. That means whenever a packet arrives, the memory associated with that packet will not be in cache. The CPU will have to stop and wait while the data is retrieved from memory.

There are some ways around this. Specialty network processors solve this by having 8 threads per CPU core (whereas Intel has only 2 or even 1 thread per core). At any point in time, 7 threads can be blocked waiting for data to arrive from main memory, while the 8th thread continues at full speed with data from the cache.

On Intel processors, we have only 2 threads per core. Instead, our primary strategy for solving this problem is prefetching: telling the CPU to read memory into the cache that we'll need in the future.

For these strategies to work, however, the CPU needs to be able to read memory in parallel. To understand this, we need to look into details about how DRAM works.

As you know, DRAM consists of a bunch of capacitors arranged in large arrays. To read memory, you first select a row, and then rid each bit a column at a time. The problem is that it takes a long time to open the row before a read can take place. Also, before reading another row, the current row much be closed, which takes even more time. Most of memory latency is the time that it takes to close the current row and open the next row we want to read.

In order to allow parallel memory access, a chip will split the memory arrays into multiple banks, currently 4 banks. This now allows memory requests in parallel. The CPU issues a command to memory to open a row on bank #1. While it's waiting for the results, it can also issue a command to open a different row on bank #3.

Thus, with 4 banks, and random memory accesses, we can often have 4 memory requests happening in parallel at any point in time. The actual reads must happen sequentially, but most of the time, we'll be reading from one bank while waiting for another bank to open a row.

There is another way to increase parallel access, using multiple sets or ranks of chips. You'll often see that in DIMMs, where sometimes only one side is populated with chips (one rank), but other times both sides are populated (two ranks). In high density server memory, they'll double the size of the DIMMs, putting two ranks on each side.

There is yet another way to increase parallel access, using multiple channels. These are completely separate subsystems: not only can there be multiple commands outstanding to open a row on a given bank/rank, they can be streaming data from the chips simultaneously too. Thus, adding channels adds both to the maximum throughput as well as to the number of outstanding transactions.

A typical low-end system will have two channels, two ranks, and four banks giving a total of eight requests outstanding at any point in time.

Given a single thread, that means a C10M program with a custom TCP/IP stack can do creative things with prefetch. It can pull eight packets at a time from the incoming queue, hash them all, then do a prefetch on each one's TCP connection data. It can then process each packet as normal, being assured that all the data is now going to be in the cache instead of waiting on memory.

The problem here is that low-end desktop processors have four-cores with two-threads each, or eight threads total. Since the memory only allows eight concurrent transactions, we have a budget of only a single outstanding transaction per core. Prefetching will still help a little here, because it parallel access only works when they are on different channels/ranks/banks. The more outstanding requests, the more the CPU can choose from to work in parallel.


Now, here's where DDR4 comes into play: it dramatically increases the number of outstanding requests. It increases the number of banks from the standard 4 to 16. It also increases ranks from 4 to 8. By itself, this is an 8 fold increase in outstanding commands.

But it goes even further. A hot new technology is stacked chips. You see that in devices like the Raspberry Pi, where the 512-megabyte DDR3 DRAM chip is stacked right on top of the ARM CPU, looking from the outside world as a single chip.

For DDR4, designers plan on up to eight stacked DRAM chips. They've added chip select bits to select which chip in the stack is being accessed. Thus, this gives us a 256-fold theoretical increase in the number of outstanding transactions.

Intel has announced their Haswell-E processors with 8 hyperthreaded cores (16 threads total). This chip has 4 channels of DDR4 memory. Even a low-end configuration with only 32-gigs of RAM will still give you 16 banks times 2 ranks times 4 channels, or 128 outstanding transactions for 16 threads, or 8 outstanding transactions per thread.

But that's only with unstacked, normal memory. Vendors are talking about stacked packages that will increase this even further -- though it may take a couple years for these to come down in price.

This means that whereas in the past, prefetch has made little difference to code that was already limited by the number of outstanding memory transactions, it can make a big difference in future code with DDR4 memory.

Conclusion

This post is about getting Internet scale out of desktop hardware. An important limitation for current systems is the number of outstanding memory transactions possible with existing DRAM technology. New DDR4 memory will dramatically increase the number of outstanding transactions. This means that techniques like prefetch, which had limited utility in the past, may become much more useful in the future.

No comments: