I’m writing up my Shmoocon preso as a series of blogposts. Today I’m going to talk about custom network stacks.
The way network stacks work today is to let the kernel do all the heavy lifting. It starts with kernel drivers for Ethernet cards, which passes packets to the kernel’s TCP/IP stack. Upon the reception, the packet must make an arduous climb up the network stack until it finally escapes to user-mode. (User-mode is where applications run).
This picture from Wikipedia shows this in some detail. You can see that the packet doesn’t have an easy journey.
Over the years, Linux kernel writers have tried to optimize this with something called “zero-copy”. It’s a funny term because there are still multiple copies involved. Every new “zero-copy” advancement has meant only removing an extra copy at one stage, but leaving copies around for other stages.
That’s why “everyone knows” that you should move your code into the kernel. For one thing, it gets rid of the step of copying the packet from the kernel-mode buffers to user-mode memory. For another thing, you can tap into a mid-point in the stack, so the packet doesn’t have to climb all the way to the top.
But, there is another alternative. Instead of moving everything into the kernel we can move everything into user-mode.
This is done first by rewriting the network driver. Instead of a network driver that hands off packets to the kernel, you change the driver does that it doesn’t. Instead, you map the packet buffers into user-mode space.
Remember that the primary difference between “kernel-mode” and “user-mode” is memory protection. Memory addresses in user-mode programs don’t refer to physical memory addresses. Instead, they go through “paging tables” to get translated into physical addresses. What memory-mapping does is that once the driver allocates physical memory, you configure the application’s page tables to point to that memory.
The upshot is that it creates ZERO overhead networking. The network cards uses a feature called DMA (“direct memory access”) to copy the packet across the PCIe bus into memory, without the CPU being involved at all. The only overhead is that about once a millisecond, the CPU needs to wake up and tell the driver which packet buffers are free. Since it does so about 1000 packets at a time in bulk, the per-packet overhead is about 50 clock cycles.
In recent benchmark, Intel has demonstrated a system using an 8-core 2.0-GHz 1-socket server forwarding packets at a rate of 80-million packets/second. That means receiving the packet, processing it in user-mode (outside the kernel) and retransmission. That works out to 200 clock cycles per packet. It doesn’t actually consume that many clock cycles, it’s just that other parts of the system start to become the limiting factor at those speeds. The equivalent way going through the kernel costs about 20,000 clock cycles. That’s a 100 to 1 difference.
You should know these benchmarks. For example, I was having a discussion about DNS servers on Twitter. I was throwing around the number of “10 million DNS requests per second”. The other person said that this was impossible, because you’d hit the packets-per-second performance limit of the system. As the Intel benchmarks show, this is actually 12% the packet limit of the system.
If you wanted to build such a DNS server, how would you do it? Well, the first step is to get one of these zero-overhead drivers. Popular choices are the PF_RING project and Intel’s DPDK, both for Linux. FreeBSD has their ‘netmap’ interface. You can also just grab an open-source driver and build your own.
Your biggest problem is getting a user-mode TCP/IP stack. There are lots of these stacks around as university research projects. I don’t know of any open-source user-mode stack that is reliable. There is a closed-source stack called “6windgate” which is used in a lot of appliances.
But for something like a DNS server processing UDP, or an intrusion prevention system, you don’t want a full TCP/IP stack anyway, but a degenerate stack tuned to your application. It takes only 100 clock cycles to parse a UDP packet without having a full stack.
The general concept we are working toward is the difference between the “control plane” and the “data plane”. A 100 years ago, telephones were just copper wires. You need a switch board operator to connect your copper wire to your destination copper wire. In the digital revolution in the late 1960s and early 1970s, wires became streams of bits, and switchboards became computers. AT&T designed Unix to control how data was transmitted, but not to handle data transmission itself.
Thus, operating system kernels are designed to carry the slow rate of control information, not the fast rate of data. That’s why you get a 100 to 1 performance difference between custom drivers and kernel drivers.
That’s why all network appliances are based on the “control plane” vs. “data plane” concept. Most of the appliances you buy today are based on a traditional operating system like Linux, but they separate the system between the parts running the “control plane” and the parts running the “data plane”. Control information takes the slow path up the networking stack, the raw data takes the fast path around the stack.
This is actually a fairly straightforward idea. The problem is that most people have no experience with it. Thus, they still think in terms of trying to move stuff in to the kernel. They have a hard time believing that you can achieve 100x performance increase moving outside the kernel.
For example, back in the year 2000 at DefCon, I brought up the fact that my intrusion detection system (IDS) running on Windows on my tiny 11-inch notebook could handle a full 148,800 packets/second. Back then kernel drivers caused an interrupt for every packet, and hardware was limited to about 15,000 interrupts-per-second. People had a hard enough time accepting the 10x performance increase, that a tiny notebook could outperform the fastest big-iron server hardware, and that Windows could outperform their favorite operating system like Solaris or Linux. They couldn’t grasp the idea of simply turning off interrupts, and that if you bypass the operating system, it doesn’t matter whether you are running Windows or Linux. These days with PF_RING having the similar architecture to the BlackICE drivers, this is more understandable, but back then, it was unbelievable.
This is part of a series of blogposts about bypassing the kernel to achieve scalability. In following blogposts I’m going to discuss “multi-core synchronization” and “memory allocation”. All of these have the same idea: do the work yourself in user-mode rather than relying upon the kernel to do the heavy lifting for you.