Monday, January 09, 2012

Multithreaded teaches the wrong lessons about multicore

This blog-post compares two open-source “packet logging” programs. These are simple programs that log network traffic directly to the disk. That blog-post finds that the multithreaded program is a lot faster than the single-threaded program, confirming people’s prejudices that in the modern world with multicore systems, multithreaded is better.

But the results are suspect. It finds that TWO-threaded program is SIX times faster. That doesn’t make sense. If the issue were truly just “multithreaded vs single-threaded”, then at most we’d expect at most a two-fold increase, not a six-fold increase.

Instead, the real problem here is the way that the application has to “wait” on either the network or the disk. One way to solve this waiting is to put the network portion on one thread, and the disk portion on another thread. That’s what Gulp does. It’s many times faster than Daemonlogger even on computers with only a single processing core.

But multithreaded is only one way to solve this problem. Another way would be to use asynchronous APIs, and/or larger buffers. It’s the same way that single-threaded programs have long dealt with “waiting”. like “C10K” web-servers that might be only single-threaded.

The true reason Gulp is faster has nothing to do with its multithreaded nature, but the way it cleverly uses Linux APIs in order to get out of the way. Network adapters want to DMA packets directly into a buffer at full speed, bypassing the CPU and operating system kernel. Disk adapters want to DMA data directly from memory at full speed, likewise bypassing the CPU and kernel. Today’s hardware can easily do this at many times 10-gbps speeds. The problem is that today’s operating system kernels get in the way. The trick to making this work is to figure out just the right operating system APIs to trick the kernel into getting out of the way. The reason Gulp is faster is because it does a better job getting out of the way than Daemonlogger, not because it’s multithreaded.

More the point, Gulp still fails at being “multicore”. Computers have been stuck at 3-GHz for the past decade, instead of getting faster, we now get multiple cores. Gulp scales to 2 cores, but not 12 cores. It’s no faster in a 12 core system than a 2 core system. (My laptop has 4 cores, my desktop has 12 cores).

The problem we face today is that people think “multithreaded” means “multicore”. It doesn’t. Multithreaded means running DIFFERENT tasks on a SINGLE core, like how Gulp runs one thread for capture and one for logging to disk, making it faster than Daemonlogger even on a single core. In contrast, multicore programming means running the SAME tasks on MANY cores, so making something faster simply means adding cores. Gulp fails at this.

Most software that people hail as being “multithreaded” fails at being truly “multicore”. A good example of this is the Linux kernel itself, which claims to scale to 64 core. It does, but only for selected applications and bencharmarks. Linux fails at being truly multicore for other tasks, such as packet-sniffing. A great many multithreaded applications fail to scale well on Linux.

Another example is PF_RING. It uses custom drivers to bypass the inefficiencies of the Linux kernel for 10gbps speeds, but then it uses “spinlocks” instead of “atomics” for synchronization, so it fails at being multicore. After about 4 cores, adding additional cores makes PF_RING go slower, not faster.

If you want a truly scalable system, instead of going “multithreaded”, you need to cheat. Today’s packet-sniffing adapters (PF_RING, Napatech) can split incoming traffic into separate streams by hashing TCP/IP socket info. So exploit that. Buy a cheap 8-core system, use one of these adapters to create 8 streams, and buy 8 high-speed disks (like SSDs). Simply run 8 separate instances of Gulp/Daemonlogger, each bound to a core, stream, and disk. When you want to analyze the traffic logged to the disks, you’ll have to recombine the streams back into a single one, but that’s not too difficult, especially when you are using a system costing $2000 that would otherwise cost you $50,000.

Conclusion

That original blog-post confirms your prejudices that multithreaded software is inherently better than single-threaded software, an important lesson for today’s multicore computers. But, when you look deeper at it, you find that the results are suspect and that it teaches entirely the wrong lessons about multithreaded software. Gulp fails at being multicore every much as Daemonlogger does.


Historical note: BlackICE, the first IPS written in 1998, was a two-threaded system, with one thread for packet-capturing (using a custom driver that looks a like like modern PF_RING) and another thread for analysis. It had the same "producer-consumer" relationship that Gulp has. While it was multithreaded, it wasn't truly multicore, and did not scale past two cores. I don't know for sure, but I'm told that IBM (which now sells BlackICE as "Proventia") has converted the software so that it's truly multicore.


Update: One tweeter took exception to my terminology, since nobody else makes the distinction/comparision between "multithreaded" and "multicore" the way I do. But that's entirely the point. Multithreaded programming techniques were developed for decades for either single core systems, or systems with a small number of cores. Those techniques your textbooks teach you fail when you get to 12 cores, like I have on my desktop. Just because a program (like Gulp) is multithreaded doesn't mean it's solved the problem of running on all 12 cores (which it doesn't). Thus, just because something is "multithreaded" doesn't mean that it's truly "multicore".

I forget who, but somebody (Azul Systems) has created a hashtable using atomic operations that scales to 1000 cores doing insertions and lookups simultaneously. Now THAT is true multicore.


Update: The 'cheat' solution I mention above is how people run high-speed Snort, a painfully single-threaded IDS. I think it's a bastard solution to the problem, but it turns out, customers are actually quite happy with it. (Which I guess is another lesson: what matters is how much customers like the pork sausage, not how it's made).

So lets you test this by using something like 'tcpreplay' at 10gbps. You'll find that the solution doesn't appear to work. That's because using tcpreplay, you take packets captured from slow networks and replay them at much higher speeds. On slow networks, like your home 10-mbps connection, a single TCP connection can use up the entire bandwidth. When you replay at 10-gbps, a single TCP connection captured at 10mbps is being replayed at 10gbps, which causes it to be sent to single virtualized adapter, which can't handle more than 1.25-gbps.

Thus, when testing your cheated Snort solution, you now have two separate metrics: maximum network speed, and maximum TCP connection speed.

But a truly multithreaded/multicore solution might not doing any better. Packets on a TCP connection must still be processed in-order, so you can't have one core process one packet on the TCP connection while another core processes another packet. Instead, to truly speed up the single-TCP-connection problem, you'll have to have multiple cores working together on a single packet. That's a hard problem, because chances are good that synchronization overhead (even using lightweight atomics) will cost more than you gain. Thus, a cheating solution may actually perform better on this metric than the proper solution.

Either way, I hope IDS/IPS evaluators like NSS start measuring single-TCP-connection speed along with max-network-speed.


Update: So how can you fix PF_RING to be multicore? Well, a good lesson is how PACKET_RX_RING does it. Both similarly-named solutions do roughly the same thing: create a memory mapped user-space ring-buffer for incoming packets. PF_RING does this with zero copy at 15-million packets/second bypassing the kernel, PACKET_RX_RING does this with making kernel copies at 1.5-million packets/second.

Ring-buffers are easily synchronized in a producer/consumer fashion. If there is only one consumer, then no special synchronization is needed. If there is one producer and many consumers, then the consumers need to synchronize among themselves, but not with the producer.

PACKET_RX_RING, while slow because of interference with the kernel, allows wait-free synchronization. The 12 threads trying read packets simply do a __atomic_compare_and_exchange() on the "current packet" pointer (which in x86 will be a lock compxchg instruction). If the operation succeeds, the current thread owns the packet, if it fails, the current thread tries again OR goes to sleep. (This synchronization also implies thread scheduling, so that threads can go to sleep, causing CPU cores to go to sleep, consuming less electricity).

PF_RING, while otherwise fast, does numerous "spinlock()s". When trying to read a packet, threads will furiously spin consuming vast amount of resources, causing the system to slow down as you add more threads.

On Linux, the 'spinlock()' wait primitive is thought to be very fast, because it has the best "best-case" performance. If there is no conflict, it is just as fast as an atomic primitive. However, when there is a lot of conflict, because you have a lot of threads, it has one of the worst "worst-case" performances, because they will be furiously spinning using up system resources.

So the upshot is that PF_RING needs to get rid of all "spinlocks" and use "atomics" instead, so that 12 cores are faster than 11 cores, and so that it allows the application to schedule threads to go to sleep instead of furiously spinning. As with PACKET_RX_RING, you shouldn't need more than one atomic compare-and-swap per packet read from the interface.

(Note: These comments are from playing with PF_RING last year. I used one 10gig transmit adapter and another receive adapter. I used the built-in sample apps off of 'dna0' that allow you to specify the number of threads. The more threads, the slower the packet receive, 1 thread did about 12-million packets/second, 12 threads did 1-million packets/second. Looking in the open-source part of the code, I saw evil spinlocks. I didn't disassemble the closed-source part of the code in order to see why the spinlocks were necessary).


Update: Ten years ago, the x86 lock prefix forces an uncached memory transaction, which took about 250 clock cycles on a 3-GHz Pentium 4. Today, with integrated memory controllers, it causes a L3 cache operation, which can be as low as 25 clock cycles on a 3-GHz Sandy Bridge processor.

The upshot is that "atomic" operations were expensive in the era of "multithreaded" code, but have become ten times cheaper in the era of "multicore" code.

6 comments:

Luca Deri said...

I am the author of PF_RING and I would like to add some words about it. The use of spin/rw-locks is on the user-space library if and only if, a user decides to delegate to PF_RING the locking as he has multiple threads accessing the same socket/ring. Regardless of all the disquisition about locks you have made on the article, the performance penalty that you have when multiple threads access the same ring is NOT (mostly) due to the locking model you use, but to the fact that it's a bad application design. This is because whenever a thread has to stop on a lock, the CPU cache will be emptied and thus the performance degradation. Using spin locks is actually a good way to avoid that effect, but again due to this poor 1:n application design you will see the loss.
For this reason on PF_RING we have introduced the concept of clustering, where multiple consumers can read packets without any lock. In DNA, if you want to use multiple threads, you need to configure the NIC with multiple RX queues, so that each thread can read packets from a queue without any lock.
As conclusion, do not mix arguments about multicore/thread with poor application design. They are two different topics.

Robert Graham said...

You still don't understand multicore -- the idea of applying multiple cores to the same task.

Imagine a multicore IDS where a badly written PCRE regular expression can cause the CPU to consume a lot of CPU time. With your "clustering", a delay in processing a single packet delay will cause packet loss on which receive queue the thread is tied to.

One solution is to have a STAGED PIPELINE design, where "capture" threads" remove packets from the receive queue and hand off to separate "analysis threads". But that's a traditional multithreaded design, not a modern multicore design, and introduces a lot of synchronization overhead, reducing multicore scalability.

Now, we can fix your design by allowing the threads to roam across the RX queues, instead of tying a thread to a queue. Or, we could do something like have 4 receive queues but 8 threads, reducing synchronization overhead per queue.

But it would be better to simply fix PF_RING to make it multicore. there is no reason to use spinlocks when atomics are so cheap and easy.

Robert Graham said...

BTW, Luca, I've been working with "NIC DMAs into user-mode ring" for decades (plural). They've all been closed source, though.

Intel was an investor in my company [BlackICE]. We sat down with their engineers and discussed NIC design and how they could make things easier for us, such as by hashing TCP/IP connections into multiple queues, so that multiple instances of an IPS could feed from the queues without requiring multithreaded synchronization. That was back when the 'lock' prefix cost hundreds of clock cycles, instead of the 25 clock cycles it costs today.

I don't if we were important in Intel's decision to add the feature, but we did talk to them about it years before they added multiple receive queues to their hardware.

Luca Deri said...

As you said you are an expert, would you please share some code fragments/samples I could use in PF_RING that show me how to address the atomic problem you reported? Thanks.

rossjudson said...

I've built a 64 bit int -> 64 bit int lock-free data structure based on Click's NonBlockingHashtable (Cliff Click is also an author of the hotspot VM). These data structures are based on a state model, and use CAS to transition between states. There are a few places where an operation might be repeated once or twice on a CAS failure. These repeats have read barriers between them, and are designed such that there's never more than a few instructions happening before the CAS will succeed again. Packing the keys and values into the same cache line helps a lot too. The upshot is linear scaling with some pretty impressive performance. My old quad core i7 can do over 40 million concurrent updates a second, and it scales linearly to that point. It's time for me to buy new hardware and see how fast it can go. ;)

You can also memory-map these structures and achieve the same kinds of speeds, with full persistence. CAS doesn't stop working when the underlying memory is mapped.

Robert Graham said...

Do you have a link? I'd love to play with it.