This blog-post compares two open-source “packet logging” programs. These are simple programs that log network traffic directly to the disk. That blog-post finds that the multithreaded program is a lot faster than the single-threaded program, confirming people’s prejudices that in the modern world with multicore systems, multithreaded is better.
But the results are suspect. It finds that TWO-threaded program is SIX times faster. That doesn’t make sense. If the issue were truly just “multithreaded vs single-threaded”, then at most we’d expect at most a two-fold increase, not a six-fold increase.
Instead, the real problem here is the way that the application has to “wait” on either the network or the disk. One way to solve this waiting is to put the network portion on one thread, and the disk portion on another thread. That’s what Gulp does. It’s many times faster than Daemonlogger even on computers with only a single processing core.
But multithreaded is only one way to solve this problem. Another way would be to use asynchronous APIs, and/or larger buffers. It’s the same way that single-threaded programs have long dealt with “waiting”. like “C10K” web-servers that might be only single-threaded.
The true reason Gulp is faster has nothing to do with its multithreaded nature, but the way it cleverly uses Linux APIs in order to get out of the way. Network adapters want to DMA packets directly into a buffer at full speed, bypassing the CPU and operating system kernel. Disk adapters want to DMA data directly from memory at full speed, likewise bypassing the CPU and kernel. Today’s hardware can easily do this at many times 10-gbps speeds. The problem is that today’s operating system kernels get in the way. The trick to making this work is to figure out just the right operating system APIs to trick the kernel into getting out of the way. The reason Gulp is faster is because it does a better job getting out of the way than Daemonlogger, not because it’s multithreaded.
More the point, Gulp still fails at being “multicore”. Computers have been stuck at 3-GHz for the past decade, instead of getting faster, we now get multiple cores. Gulp scales to 2 cores, but not 12 cores. It’s no faster in a 12 core system than a 2 core system. (My laptop has 4 cores, my desktop has 12 cores).
The problem we face today is that people think “multithreaded” means “multicore”. It doesn’t. Multithreaded means running DIFFERENT tasks on a SINGLE core, like how Gulp runs one thread for capture and one for logging to disk, making it faster than Daemonlogger even on a single core. In contrast, multicore programming means running the SAME tasks on MANY cores, so making something faster simply means adding cores. Gulp fails at this.
Most software that people hail as being “multithreaded” fails at being truly “multicore”. A good example of this is the Linux kernel itself, which claims to scale to 64 core. It does, but only for selected applications and bencharmarks. Linux fails at being truly multicore for other tasks, such as packet-sniffing. A great many multithreaded applications fail to scale well on Linux.
Another example is PF_RING. It uses custom drivers to bypass the inefficiencies of the Linux kernel for 10gbps speeds, but then it uses “spinlocks” instead of “atomics” for synchronization, so it fails at being multicore. After about 4 cores, adding additional cores makes PF_RING go slower, not faster.
If you want a truly scalable system, instead of going “multithreaded”, you need to cheat. Today’s packet-sniffing adapters (PF_RING, Napatech) can split incoming traffic into separate streams by hashing TCP/IP socket info. So exploit that. Buy a cheap 8-core system, use one of these adapters to create 8 streams, and buy 8 high-speed disks (like SSDs). Simply run 8 separate instances of Gulp/Daemonlogger, each bound to a core, stream, and disk. When you want to analyze the traffic logged to the disks, you’ll have to recombine the streams back into a single one, but that’s not too difficult, especially when you are using a system costing $2000 that would otherwise cost you $50,000.
That original blog-post confirms your prejudices that multithreaded software is inherently better than single-threaded software, an important lesson for today’s multicore computers. But, when you look deeper at it, you find that the results are suspect and that it teaches entirely the wrong lessons about multithreaded software. Gulp fails at being multicore every much as Daemonlogger does.
Historical note: BlackICE, the first IPS written in 1998, was a two-threaded system, with one thread for packet-capturing (using a custom driver that looks a like like modern PF_RING) and another thread for analysis. It had the same "producer-consumer" relationship that Gulp has. While it was multithreaded, it wasn't truly multicore, and did not scale past two cores. I don't know for sure, but I'm told that IBM (which now sells BlackICE as "Proventia") has converted the software so that it's truly multicore.
Update: One tweeter took exception to my terminology, since nobody else makes the distinction/comparision between "multithreaded" and "multicore" the way I do. But that's entirely the point. Multithreaded programming techniques were developed for decades for either single core systems, or systems with a small number of cores. Those techniques your textbooks teach you fail when you get to 12 cores, like I have on my desktop. Just because a program (like Gulp) is multithreaded doesn't mean it's solved the problem of running on all 12 cores (which it doesn't). Thus, just because something is "multithreaded" doesn't mean that it's truly "multicore".
I forget who, but somebody (Azul Systems) has created a hashtable using atomic operations that scales to 1000 cores doing insertions and lookups simultaneously. Now THAT is true multicore.
Update: The 'cheat' solution I mention above is how people run high-speed Snort, a painfully single-threaded IDS. I think it's a bastard solution to the problem, but it turns out, customers are actually quite happy with it. (Which I guess is another lesson: what matters is how much customers like the pork sausage, not how it's made).
So lets you test this by using something like 'tcpreplay' at 10gbps. You'll find that the solution doesn't appear to work. That's because using tcpreplay, you take packets captured from slow networks and replay them at much higher speeds. On slow networks, like your home 10-mbps connection, a single TCP connection can use up the entire bandwidth. When you replay at 10-gbps, a single TCP connection captured at 10mbps is being replayed at 10gbps, which causes it to be sent to single virtualized adapter, which can't handle more than 1.25-gbps.
Thus, when testing your cheated Snort solution, you now have two separate metrics: maximum network speed, and maximum TCP connection speed.
But a truly multithreaded/multicore solution might not doing any better. Packets on a TCP connection must still be processed in-order, so you can't have one core process one packet on the TCP connection while another core processes another packet. Instead, to truly speed up the single-TCP-connection problem, you'll have to have multiple cores working together on a single packet. That's a hard problem, because chances are good that synchronization overhead (even using lightweight atomics) will cost more than you gain. Thus, a cheating solution may actually perform better on this metric than the proper solution.
Either way, I hope IDS/IPS evaluators like NSS start measuring single-TCP-connection speed along with max-network-speed.
Update: So how can you fix PF_RING to be multicore? Well, a good lesson is how PACKET_RX_RING does it. Both similarly-named solutions do roughly the same thing: create a memory mapped user-space ring-buffer for incoming packets. PF_RING does this with zero copy at 15-million packets/second bypassing the kernel, PACKET_RX_RING does this with making kernel copies at 1.5-million packets/second.
Ring-buffers are easily synchronized in a producer/consumer fashion. If there is only one consumer, then no special synchronization is needed. If there is one producer and many consumers, then the consumers need to synchronize among themselves, but not with the producer.
PACKET_RX_RING, while slow because of interference with the kernel, allows wait-free synchronization. The 12 threads trying read packets simply do a __atomic_compare_and_exchange() on the "current packet" pointer (which in x86 will be a lock compxchg instruction). If the operation succeeds, the current thread owns the packet, if it fails, the current thread tries again OR goes to sleep. (This synchronization also implies thread scheduling, so that threads can go to sleep, causing CPU cores to go to sleep, consuming less electricity).
PF_RING, while otherwise fast, does numerous "spinlock()s". When trying to read a packet, threads will furiously spin consuming vast amount of resources, causing the system to slow down as you add more threads.
On Linux, the 'spinlock()' wait primitive is thought to be very fast, because it has the best "best-case" performance. If there is no conflict, it is just as fast as an atomic primitive. However, when there is a lot of conflict, because you have a lot of threads, it has one of the worst "worst-case" performances, because they will be furiously spinning using up system resources.
So the upshot is that PF_RING needs to get rid of all "spinlocks" and use "atomics" instead, so that 12 cores are faster than 11 cores, and so that it allows the application to schedule threads to go to sleep instead of furiously spinning. As with PACKET_RX_RING, you shouldn't need more than one atomic compare-and-swap per packet read from the interface.
(Note: These comments are from playing with PF_RING last year. I used one 10gig transmit adapter and another receive adapter. I used the built-in sample apps off of 'dna0' that allow you to specify the number of threads. The more threads, the slower the packet receive, 1 thread did about 12-million packets/second, 12 threads did 1-million packets/second. Looking in the open-source part of the code, I saw evil spinlocks. I didn't disassemble the closed-source part of the code in order to see why the spinlocks were necessary).
Update: Ten years ago, the x86 lock prefix forces an uncached memory transaction, which took about 250 clock cycles on a 3-GHz Pentium 4. Today, with integrated memory controllers, it causes a L3 cache operation, which can be as low as 25 clock cycles on a 3-GHz Sandy Bridge processor.
The upshot is that "atomic" operations were expensive in the era of "multithreaded" code, but have become ten times cheaper in the era of "multicore" code.