Wednesday, January 27, 2016

Net ring-buffers are essential to an OS

Even by OpenBSD standards, this rejection of 'netmap' is silly and clueless.

BSD is a Linux-like operating system that powers a lot of the Internet, from Netflix servers to your iPhone. One variant of BSD focuses on security, called "OpenBSD". A lot of security-related projects get their start on OpenBSD. In theory, it's for those who care a lot about security. In practice, virtually nobody uses it, because it makes too many sacrifices in the name of security.

"Netmap" is a user-space network ring-buffer. What that means is the hardware delivers network packets directly to an application, bypassing the operating system's network stack. Netmap currently works on FreeBSD and Linux. There are projects similar to this known as "PF_RING" and "Intel DPDK".


The problem with things like netmap is that it means the network hardware no longer is a shareable resource, but instead must be reserved for a single application. This violates many principles of a "general purpose operating system".

In addition, it ultimately means that the application is going to have to implement it's own TCP/IP stack. That means it's going to repeat all the same mistakes of the past, such as "ping of death" when a packet reassembles to more then 65536 bytes. This introduces a security problem.

But these criticisms are nonsense.

Take "microkernels" like Hurd or IBM mainframes. These things already put the networking stack in user space, for security reasons. I've crashed the network stack on mainframes -- the crash only affects the networking process and not the kernel or other apps. No matter how bad a user-mode TCP/IP stack is written, any vulnerabilities affect just that process, and not the integrity of the system. User-mode isolation is a security feature. That today's operating-systems don't offer user-mode stacks is a flaw.

Today's computers are no longer multi-purpose, multi-user machines. While such machines do exist, most computers today are dedicated to a single purpose, such as supercomputer computations, or a domain controller, or memcached, or a firewall. Since single-purpose, single-application computers are the norm, "general purpose" operating systems need to be written to include that concept. There needs to be a system whereby apps can request exclusive access to hardware resources, such as GPUs, FPGAs, hardware crypto accelerators, and of course, network adapters.

These user-space ring-mode network drivers operate with essentially zero overhead. You have no comprehension of how fast this can be. It means networking can operating 10 times to even a 100 times faster than trying to move packets through the kernel. I've been writing such apps for over 20 years, and have constantly struggled against disbelief as people simply cannot believe that machines can run this fast.

In todays terms, it means it's relatively trivial to use a desktop system (quad-core, 3 GHz) to create a 10-gbps firewall that passes 30 million packets/second (bidirectional), at wire speed. I'm assuming 10 million concurrent TCP connections here, with 100,000 rules. This is between 10 and 100 times faster than you can get through the OpenBSD kernel, even if you simply configured it to simply bridge two adapters with no inspection.

There are many reasons for the speed. One is hardware. In modern desktops, the 10gbps network hardware DMAs the packet directly into the CPU's cache -- actually bypassing memory. A packet can arrive, be inspected, then forwarded out the other adapter before the cache writes back changes to DRAM.

Another reason is the nature of ring-buffers themselves. The kernel's drivers also use ring-buffers in the Ethernet hardware drivers. The problem is that the kernel must remove the packet from the driver's ring-buffers, either by making a copy of it, or by allocating a replacement buffer. This is actually a huge amount of overhead. You think it's insignificant, because you compare this overhead with the rest of kernel packet processing. But I'm comparing it against the zero overhead of netmap. In netmap, the packet stays within the buffer until the app is done with it.

Arriving TCP packets perform a long "pointer walk", following a chain of pointers to get from the network structures to file descriptor structures. At scale (millions of concurrent TCP connections), these things no longer fit within cache. That means each time you follow pointer you cause a cache miss, and must halt and wait 100 nanoseconds for memory.

In a specialized user-mode stack, this doesn't happen. Instead, you put everything related to the TCP control block into a 1k chunk of memory, then pre-allocated an array of 32 million of them (using 32-gigs of RAM). Now there is only a single cache miss per packet. But actually, there are zero, because with a ring buffer, you can pre-parse future packets in the ring and issue "prefetch" instructions, such that the TCP block is already in the cache by the time you need it.

These performance issues are inherent to the purpose of the kernel. As soon as you think in terms of multiple users of the TCP/IP stack, you inherently accept processing overhead and cache misses. No amount of optimizations will ever solve this problem.

Now let's talk multicore synchronization. On OpenBSD, it rather sucks. Adding more CPUs to the system often makes the system go slower. In user-mode stacks, synchronization often has essentially zero overhead (again that number zero). Modern network hardware will hash the address/ports of incoming packets, giving each CPU/thread their own stream. Thus, our hypothetical firewall would process packets as essentially 8 separate firewalls that only rarely need to exchange information (when doing deep inspection on things like FTP to open up dynamic ports).


Now let's talk applications. The OpenBSD post presumes that apps needing this level of speed are rare. The opposite is true. They are painfully common, increasingly becoming the norm.

Supercomputers need this for what they call "remote DMA". Supercomputers today are simply thousand of desktop machines, with gaming graphics cards, hooked up to 10gbps Ethernet, running in parallel. Often one process on one machine needs to send bulk data to another process on another machine. Normal kernel TCP/IP networking is too slow, though some now have specialized "RDMA" drivers trying to compensate. Pure user-space networking is just better.

My "masscan" port scanner transmits at a rate of 30 million packets per second. It's so fast it'll often melt firewalls, routers, and switches that fail to keep up.

Web services, either the servers themselves, or frontends like varnish and memcached, are often limitted by the kernel resources, such as maximum number of connections. They would be vastly improved with user-mode stacks on top of netmap.

Back in the day, I created the first "intrusion prevention system" or "IPS". It ran on a dual core 3 GHz machine at maximum gigabit speeds, including 2 million packets-per-second. We wrote our own user-space ring-buffer driver, since things like netmap and PF_RING didn't exist back then.

Intel has invested a lot in its DPDK system, which is a better than netmap, for creating arbitrary network-centric devices out of standard desktop/server systems. That we have competing open-source ring-buffer drivers (netmap, PF_RING, DPDK) plus numerous commercial versions means that there is a lot of interest in such things.

Conclusion

Modern network machines, whether web servers or firewalls, have two parts: the control-plane where you SSH into the box and manage it, and the data-plane, which delivers high-throughput data through the box. These things have different needs. Unix was originally designed to be a control-plane system for network switches. Trying to make it into a data-plane system is 30 years out of date. The idea persists because of the clueless thinking as expressed by the OpenBSD engineers above.

User-mode stacks based on ring-buffer drivers are the future for all high-performance network services. Eventually OpenBSD will add netmap or something similar, but as usually, they'll be years behind everyone else.

2 comments:

Matt said...

You wrote:
"Modern network machines, whether web servers or firewalls, have to parts: the control-plane …, and the data-plane…"
Really?

I thought all our troubles in life were because the data and control plane were not separated. We are stuck in the world of the von Neumann architecture where everything is in the same address space.

Paweł Małachowski said...

In todays terms, it means it's relatively trivial to use a desktop system (quad-core, 3 GHz) to create a 10-gbps firewall that passes 30 million packets/second (bidirectional), at wire speed.

We are actually doing it with DPDK using 2 Xeon cores and stateless ACLs. And also rate-limiting. And encapsulating packets. :)

It is worth noting that while PF_RING ZC and Netmap require modified drivers, DPDK and Snabb Switch come with their own. The bad thing is that you can't easily mix kernel and dataplane traffic using different queues. The good thing is that it requires thin driver that replaces NIC driver and exposes card (via UIO) to real driver inside DPDK.