Friday, May 25, 2012

DNS vs. large memory pages (technical)

As everyone knows, an important threat against the Internet is that of a coordinated DDoS attack against the root TLD DNS servers. The way I'd solve is with a simple inline device that both blocks some simple attacks from hitting the DNS server, but which can also answer simple queries, offloading the main server, even if it's failed. This can be done with $2000, half for the desktop machine, and the other half for the dual-port 10-gig Ethernet.

This thing would need to run at high rates, namely 10-gigabits/second and 10-million DNS queries per second. That's a lot of queries, but it's not as hard as you'd think. Custom drivers like PF_RING and DPDK already capture packets at rates of 50-million packets/second by cheating: they bypass the operating system and send the packets directly to user-mode memory. With very little code, you can parse the UDP packet and simple DNS queries, and either answer the query yourself, or forward the packet onto the real server.
At these rates, you hit fundamental hardware limits, like memory bandwidth and memory latency. Something as simple as "virtual-memory TLB misses" can become the primary limiting factor.

The .com TLD has 100,000,000 subdomains, and each record takes about 100 bytes. That means the server needs 10-gigabytes of RAM to hold all those records. These days, 16-gigs of RAM costs $160, so that's not really a lot of memory.

But it is a lot of virtual-memory. Every memory request needs to be translated from the "virtual" address to the "physical" address. This is done by dividing up memory into 4096 byte pages. The processor looks up the virtual-address in a "page-table" in order to calculate the physical address.

These page tables are 1/512 the size of the memory you use. This sounds small, but when you are using 10-gigabytes of memory, that means that you are using 20-megabytes of page tables.

Now let's trace the path of an incoming UDP DNS request. Each one is for a different domain, at random. This randomness is a problem, because the CPU cache only holds recently accessed data, but random domain queries haven't been recently accessed. That means as you walk through your data structures, each memory access will be a cache-miss. This means that instead of taking 1 nanosecond to read the L1 cache, 4 nanoseconds to read the L2 cache, or 10 nanoseconds to read the L3 cache, it'll take 80-nanoseconds to read DRAM.

At 10 million packets/second, that works out to only 100-nanoseconds per packet, so it sounds like we've used up all our time just waiting for that first cache miss. Things aren't as dire as they appear, though. First of all, we can process several packets in batches and issue "prefetches" on the data, so that much of the time the data will be in the cache by the time we need it. Secondly, a standard desktop supports 8 cores/threads, so our budget is more like 10 cache misses per packet rather than a single cache miss.

But we have a problem: virtual-memory. As I pointed out above, the page-tables alone for our massive .com DNS database is 20-megabytes in size, but our desktop CPU only has 8 megabytes of L3 cache. That means on average, not only will our desired memory cause a cache miss, it'll cause a second cache miss reading the page-table entry.

That's why a common solution in scenarios like this would be to move our code from user-mode into kernel-mode, the advantage being that in kernel-mode, you don't use virtual-memory. However, kernel-mode is Morally Wrong, and I'd rather have the Internet crash due to DDoS than put my code there.

The other solution is to use larger page tables. The small 4096 byte memory page isn't the only option. You can also use 2-megabyte pages. Both are supported simultaneously. That's because virtual-to-physical translation is a multi-step process, and you can set a flag to skip the last step for some addresses. If we allocate that large 10-gigabyte chunk as 5000 2-megabyte pages, instead of 2.5-million 4096-byte pages, then we have a lot less to worry about. That 40k for the upper page tables will easily fit within our cache.

Let's look at how virtual-memory translation works. I've draw the following diagram to illustrate it.
Note that the upper 16 bits are ignored. Today's 64-bit processors are really just 48-bit processors, but 'sshhhh', don't tell anybody, or they might get upset.

The first step in translating an address is to grab the CSR3 register. This register is at the root of the memory hierarchy. Each time that the operating-system does a context-switch to a different process, it overwrites this register with a pointer to the top of the page-table hierarchy.

This register points to a 4096-byte block of memory (called "Page Map Level 4" or simply "L4" in the diagram) that contains 512 pointers, each 8 bytes in size. Remember that we need 8 bits to store a byte having 256 combinations, that means we need 9 bits to address 512 combinations. Thus, we grab the first 9 bits of our 48-bit address, then to a lookup in the L4 table, and thus find a pointer to our L3 table.

We repeat the process three more times, grabbing the next 9 bits from the address, looking up that value in our current table, getting a pointer to the next step. This process ends in the actual memory page we are looking for. At that point, we simply append to low 12 bits onto the end of the final pointer, thus getting our physical-address from the virtual-address.

In practice, the L4 table will have only a single entry that we every use, unless you have more than 512-gigabytes of RAM in a system. At the L3 level, we'll only have a single page of 512 entries that we'll ever use. Things start getting big at the L2 level, which as noted above, requires 40-kilobytes worth of data to handle our 10-gigabyte DNS .com database. Finally, the L1 level is what destroys us, requiring 20-megabytes to handle our 10-gigabyte database.

Not that processors have a special cache called the "TLB" or "translation lookaside buffer" that caches the results from this "page table walk". Thus, you don't have to do this on every memory access. TLB misses, as the are called, take up only a few percent of normal CPU time. However, the more we tackle Internet-scale problems, the more TLB misses become a problem, and the more we have to look for solutions.

Lucky for us, we can skip the L1 stage. Page table entries are a full 64-bits in size, but they only ever use the high-order bits. They don't use the low-order bits, except as status flags. The lowest bit is a flag that means "skip the next level". This is showing in the following diagram, where the L2 lookup leads directly to a 2-megabyte page, rather than than to an L1 lookup to a 4096 byte page:
Likewise, the L2 can be skipped in when using 1-gigabyte pages. I haven't read the necessary documentation, but I assume that the L3 could also be skipped for 0.5-terabyte pages. Whether or not CPUs support that today, they'll probably support that 10 years from now.

All operating systems support the large pages. On Windows, use "VirtualAlloc()" with the flag "MEM_LARGE_PAGES". On Linux, compile the kernel with CONFIG_HUGETLBFS and use the call "mmap()".
The problem is that you may not be able to allocate 10-gigabytes worth of large pages. That's because as the operating system runs, it's constantly allocating/freeing the smaller 4096 byte pages, spreading them throughout the system. There may not be enough contiguous memory left to satisfy your request. (On Linux, you can set a boot option to allocate the memory for huge pages before this happens).

There is a downside the large memory pages. They can't be swapped of memory. Also, consider something like Apache that's constantly forking processes, setting "copy-on-write" bits in the page tables. Every fork forces a few 4096-byte pages to be copied, but now imagine how slow that would be if a couple 2-megabyte pages had to be copied for every fork. This means in general, small pages are better.

Fortunately, none of these disadvantages apply to our 10-gigabyte in-memory database. Indeed, had we not used large pages, we'd probably have configured the system to not page out that memory anyway.

Conclusion

This post describes an Internet-scale problem that can be serviced by desktop-scale hardware. The limiting factor isn't the hardware, but the software. With better software, we can tackle larger Internet-scale problems.

You can take any software, such as Nginx, BIND, SendMail, Snort, grab a tool like Intel's Vtune that monitors process-counters, create an Internet-scale workload to test them, and then see if virtual-memory issues like TLB-misses are a problem, and if so, change the memory allocators to large pages in appropriate places.

Update: Note that I haven't discussed the TLB (Translation Lookaside Buffer). This page walk is too expensive to do for every memory access, so the TLB acts as a cache of recent translations. Relative to cache misses, it isn't a huge problem, but we should think about being friendly to the TLB as well as the cache.



I drew the above diagrams using JavaScript and HTML5. For you enjoyment, I'm embedding the JavaScript code here:

2 comments:

Jess said...

This is almost too trivial to point out, but "If we allocate that large 10-gigabyte chunk as 5000 2-gigabyte..." should probably be "If we allocate that large 10-gigabyte chunk as 5000 2-MEGAbyte..."

Robert Graham said...

Bah, megabyte, gigabyte, what's the difference? Thanks, I'll go fix that.