Monday, August 20, 2012

Software networks: commodity x86 vs. network processors

“when Alexander saw the breadth of his domain he wept for there were no more worlds to conquer”

The website http://extremetech.com has a great graph showing how commodity Intel x86 processors have overtaken the world, first desktops in the 1980s, then the data center in the 1990s, then supercomputers in the 2000s. So what’s next for Intel to conquer?

There are two answers: mobile (phones, pads) and network appliances. You are probably aware of the first, with the battle raging with Windows8/Android on Intel “Atom” processors competing against ARM processors, but you might not have heard of the second fight.

Today’s network appliances (switches, routers, firewalls, etc.) are quickly transitioning from “hardware” to “software”. In the past, the way to get funding from venture capitalists would be to demonstrate how your competitive advantage would be based on a custom chip. Today, the VCs are throwing money at “software defined networks”. Hardware is dead, the future is software. In much the same way that Gartner once declared IDS was dead because it wasn’t based on hardware, future Gartner will say a thing is dead because it’s not based on software defined networks.

This doesn’t necessarily mean Intel x86 software, though. Today’s network appliance is built from what are known as “network processors”. These are multicore RISC (ARM, PowerPC, MIPS) chips with a few “fixed-function” units for offloading tasks like crypto and compression. They aren’t really any faster than Intel chips, but they consume less electrical power. They can do this because networking is an “embarrassingly parallel” task that can be split across many CPU cores. This allows the chips to have many cores clocked at lower speeds (typically 1 GHz), conserving power. In addition, the fixed-function units use less power than doing the same thing in software.

This poses a problem for Intel. Since network processors are just multicore CPUs running Linux, they can easily grow from network appliances to the rest of the data center and even supercomputers, taking back the markets that Intel has won over the last two decades. In fact, it’s a mystery why companies like Cavium or Broadcom/RMI aren’t already selling servers based on their designs. Because of this competitive pressure, Intel is attacking the network appliance market with a vengeance.

You see this in the latest changes to Intel’s desktop processors. There are oodles of small design tweaks for network appliances. Rather than fixed-function units, Intel prefers to extend its instruction set, such as adding instructions for AES encryption rather than a fixed-function AES unit. Rather than features dedicated solely for network processors, Intel adds the feature in a way that benefits all markets. For example, “hyperthreading” is a critical need for network processing, but also helps other applications to a lesser extent.

Some features really only apply to network appliances. The latest processors support a feature called “direct cache access”, where packets are DMAed from the network adapter into the CPU cache, bypassing memory. This provides little benefits for packets making the tortuous route through the Linux kernel, so doesn’t help desktops or servers. But this results in a major improvement for network appliances that bypass the Linux network stack.

All these changes mean that software running on commodity Intel desktops can in fact beat expensive “hardware” solutions. For example, the PF_RING project benchmarks forwarding packets at less than 4 microseconds (micro, not milli), which is faster than is possible with most network processors.

You might be skeptical at this point, especially if you’ve benchmarked a commodity Intel x86 system recently. They come nowhere near network wirespeeds. The problem here isn’t the hardware, but the software. The network stack in today’s operating systems is painfully slow, incurring thousands of clock cycles of overhead. This overhead is unnecessary. Custom drivers (like PF_RING or DPDK) incur nearly zero cycles of overhead. Even the simple overhead of 100 cycles for reading the packet from memory into the CPU cache is avoided.

A standard Linux system that struggles at forwarding 500,000 packets-per-second can forward packets at a rate a 50 million packets-per-second using these custom drivers. Intel has the benchmarks to prove this. This is a 100 to 1 performance difference, an unbelievable number until you start testing it yourself.

The problem you have to wrap your mind around isn’t “custom hardware to go faster” but “custom software to go faster”.

Intel helps you with their “DPDK” or “Data Plane Development Kit”. This kit provides not only the zero-overhead driver for the network cards, but also a large software library with things like lock-free data structures for scaling to massively multi-core processors on NUMA systems. With Intel’s DPDK, you can quickly prototype a network application running at 10gbps/15mpps wirespeed on a commodity desktop computer. (The term “data plane” distinguishes high-speed packet processing from the traditional “control plane” stack found in an OS like Linux).

There is also extensive third-party support, such as 6wind’s multi-core data-plane network stack, which works both on Intel and network processors.

You also have to learn new programming techniques. For example, historically it was believed that you had to put everything in the kernel to go faster, but in today’s world, you want everything in user-mode. These zero-overhead drivers DMA packets into memory that’s mapped into user-space, meaning zero packet copies and zero context switches. Likewise, by using “large pages”, you get all the advantages of user-space memory protection, with none of the speed disadvantages. User-mode is important. It makes software development vastly cheaper, while making the platform more stable. It currently lacks some of the libraries and support that exists for kernel-mode code, but this is rapidly changing.

Conclusion

The benchmark for “software” isn’t what you get out of a Linux (or other OS) network stack – that’s 100 times slower than the theoretical speed of the hardware. Instead, the benchmark is the same hardware running different software, such as the open-source PF_RING driver or Intel’s DPDK.

If you are building network applications or network appliances, stop whatever you are doing and go play with PF_RING or DPDK. Benchmark their sample applications, prototype something similar to your own application and benchmark that. Use this as your baseline. You’ll probably find that it outperforms that “hardware accelerated” product you’ve been building.


Appendix: Older posts on this same topic: here and here.
Appendix: BTW, my old IDS/IPS back in 1998 "BlackICE" was pure software running on Windows (but with zero-overhead drivers that bypassed the OS) outperformed hardware solutions. I point this out because so many people believe it can't work.

8 comments:

Anonymous said...

why bother with this when you could just use BSD + Netgraph and bypass the the software layer entirely?

Robert Graham said...

Because "BSD + Netgraph" doesn't handle wirespeed of 15 million packets/second.

Richard branson said...

Hello, Firstly thanks to post such a good article. I found it useful because it cleared my confusions about topic what you written. Regards.

bullion tips

profit.biz said...

Great, this is the best trading fundamentals that you have ever shared here. I appreciate your calls. keep share such profitable calls.

MCX Tips
Commodity Tips

Christopher Clark said...

@intel_chris here. As someone who is working on the dpdk you mention and has also been involved with hardware accelerators inside Intel and thus very biased, let me make a couple of points. First, Intel does do separate accelerators when they make sense. However, we also measure the cost it takes to get the data to the accelerator and back, the "offload cost". When the task can be done more efficiently on a core than the cost of moving the data to the accelerator and getting the results back, the result isn't an acceleration and we do it in software. Not only is the performance and wattage efficient, but it is cheaper for the software developer who doesn't have to rework their code to make it function with an accelerator. There can be other (e.g. security) advantages to keeping the data only on the core and off an external bus.

Christopher Clark said...

@intel_chris here. As someone who is working on the dpdk you mention and has also been involved with hardware accelerators inside Intel and thus very biased, let me make a couple of points. First, Intel does do separate accelerators when they make sense. However, we also measure the cost it takes to get the data to the accelerator and back, the "offload cost". When the task can be done more efficiently on a core than the cost of moving the data to the accelerator and getting the results back, the result isn't an acceleration and we do it in software. Not only is the performance and wattage efficient, but it is cheaper for the software developer who doesn't have to rework their code to make it function with an accelerator. There can be other (e.g. security) advantages to keeping the data only on the core and off an external bus.

software companies said...

The blog was absolutely fantastic! Lots of great information and inspiration, both of which we all need!b Keep 'em coming... you all do such a great job at such Concepts... can't tell you how much I, for one appreciate all you do!

AimIT Software- Software Development company

Tyson Supasatit said...

We just wrote a post explaining how the ExtraHop architecture takes advantage of multicore processing and OS bypass to achieve superior packet-processing performance. Our co-founders were the technical architects behind TMOS and the BIG-IP v9 at F5, where they learned how to ride the price-performance curve afforded by Moore's Law and beat other load-balancer vendors that depended on custom hardware.

We just announced real-time analysis for 400,000 tps at a sustained 20Gbps: http://www.extrahop.com/post/blog/good-reads/20gbps-realtime-transaction-analysis-2/

In any case, this post was an inspiration!