Thursday, February 04, 2010
Nehalem vs. IDS
The reason I think this is important is because while more expensive systems ("hardware IDS") may be faster or have more features, it's the software IDS on cheap desktop processors that defines the mainstream intrusion-detection industry. Indeed, unless the more expensive hardware vendors continue to innovate, the cheap software systems will overtake them. What's impressive about Intel's latest chip is that it contains more theoretical processing power than hardware-based IDS of just a few years ago -- as long as the software can be written to take advantage of it.
What is Nehalem?
Intel revamps its processors every few years. While the underlying technology changes, the product names usually remain the same. Thus, the latest processors have names like "Core", "Xeon", "Pentium", or "Centrino" -- the same product names as the previous generation. While the product names may look the same, the new Nehalem technology is very different from the previous generation (Core 2 aka. "Merom" aka. "Conroe).
By "Nehalem", I mean features Intel has added in this new chip, that were lacking in previous processors.
Past generations used tricks to get to 4-cores, such as putting two 2-core chips in the same package. However, each 2-core processor contained a separate cache, separate system interface, and so on. The pseudo 4-core design performed worse than a native 4-core design.
In Nehalem, all 4 cores share a common cache and memory controller, making multithreaded code much more efficient.
More importantly, Intel has designed the chip to scale to even more cores. In a few weeks, they will be shipping cheap 6 core desktops processors, and expensive 8 core server processors.
Nehalem reintroduces "hyperthreading", the ability for each core to run two threads simultaneously. Thus, a standard desktop supports 8 threads (on 4 cores), and that massive server system will support 128 concurrent threads.
We knew that the world was going multicore several years ago, with Nehalem, that world has arrived. IDS software has to be redesigned to take advantage of this.
Unfortunately, today's IDS software is essentially single threaded. This means it can only take advantage of 1% of the power of the biggest Nehalem systems. Software IDS needs to be rewritten from the ground up to support massive multithreading. It should support 16 threads at minimum, with the capability to scale to 128 threads in the future.
The second difficulty is that synchronization is expensive. Simply splitting a task between two threads will usually cost more in synchronization than in gains in using two processors.
The last generation (Core 2 "Conroe") made synchronization faster, and Nehalem makes it faster still. This makes it more likely that engineers can write software to take advantage of all the threads.
Lower memory latency
In the older Intel systems, it takes roughly 250 clock cycles to fetch data from main memory. Imagine a gigabit intrusion detection system running on a 3-GHz processor processing 3-million packets/second. This leaves only 1000 clock cycles per packet, or only 4 memory accesses per packet.
Hardware IDS solves this problem by using expensive low-latency memory, software IDS solves this by trying to "prefetch" as much as it can into a small high-speed cache. Even with software tricks, latency lags behind hardware solutions.
Nehalem improves this by copying an idea from AMD, and improving upon it. Several years ago, AMD combined its glue logic ("chipset") with the processor. This got rid of the middleman that sat between the processor and memory, lowering latency. This meant that for latency sensitive applications, AMD has been the best choice for processors. Nehalem is Intel's first processor that similarly combines the memory controller with the processor. Their current design is even better than AMD's. It reduces memory latency nearly in half compared to Intel's previous generation.
In addition, the "hyperthreading" feature also hides memory latency. In hyperthreading, a single core runs two threads simultaneously. When one thread has to stop for 100 cycles waiting for data to arrive from main memory, the other thread gets to run at full speed.
The combination of on-board memory controller and hyperthreading means that memory latency is much less of a concern than it was in previous software IDS designs.
More memory bandwidth
Intrusion-detection is a low-bandwidth application. While 10-Gbps might seem fast to us, it's slow compared to the 100-Gbps memory bandwidth that processors have.
Yet, more memory bandwidth can still help intrusion-detection. One way to solve the memory latency problem is heavy prefetching of data that the code might need. The more you do this, the more you saturate the memory bus.
The way Nehalem does this is with three separate channels to memory. This is why your Nehalem motherboard has 3 or 6 slots for memory. Multiple channels means that while one channel is full grabbing memory, it won't interfere with another channel.
The benefit to pattern matching is not as much as you would think. Typical pattern-match DFAs and NFAs have limitations other than just comparing bytes. In my experience, these new instructions do little to improve the "average case" for pattern matching, although they do seem to improve the "worst case" by quiet a lot.
These instructions can speed up other things, such as BASE64 decoding, gzip decompression, or AES decryption. Thus, their biggest value may not be in pattern-matching, but in adding more features to IDS.
4-cycle L1 cache
This is a bad thing. Previous generations of processors had a 3-cycle level 1 cache latency. The Nehalem slows this down to 4-cycles. This wouldn't be bad if they bumped up the base frequency from 3-GHz to 4-Ghz, but they haven't really increased speed by that much.
This won't matter for most code, but intrusion-detection systems match patterns using something called a DFA. The maximum theoretical speed of a DFA is governed by level 1 cache latency, with the cost of one level 1 cache hit per byte.
Numerous other processor improvements mean that overall the processor will be faster, but this dis-improvement disapoints me.
An increasing concern is not speed but electrical power usage. The faster computers get, the more electricity they use. This in turn makes designing cool data centers harder.
Synchronization between threads is often done through something called "spin locks", where a thread goes into a tight loop testing a value in memory waiting for it to change. This consumes a lot of electrical power.
Intel introduced an instruction several generations ago called "mwait". It does the same thing as a spin lock (waiting for a value in memory to change), but instead of spinning executing instructions really fast, it stops an waits.
In Nehalem, individual processor cores can go into a deep sleep state. Thus, not only is the processor not spinning executing instructions, the entire core gets shut off. The memory controller monitors the value in memory, waiting for it to change, then wakes the core back up again.
This matters for intrusion-detection. A box has to be configured for peak network traffic where all 4 cores of a multicore chip will be active. Most of the time, though, network traffic will be below that peak. Because of MWAIT, cores can be turned off, consuming a fraction of the peak power.
If you do the math, this only comes out to $200 per year per IDS. This is not a lot of money compared to a high-end IDS that costs $100,000, but it matters to people who put freeware on a $500 box. It also matters to people who design expensive data centers and want the coolest machines in them, so that don't have to spend as much on air conditioning.
This is not strictly a Nehalem feature, but it's worth mentioning here.
Intel's gigabit Ethernet cards are designed for IDS. They will happily sniff incoming packets and put them in a ring buffer with little interaction by the host CPU. This isn't how the operating-system likes to see packets, which means the standard operating-system driver will interfere with this process slowing things down. A custom driver bypassing the operating systems means that you should be able to sniff packets at 1-gbps, or even 10-gbps.
The chips support additional features, such as doing checksum calculations on the packets.
In theory, the chips should also be able to hash incoming TCP packets (IP and port), and send the packet directly to the cache of the processor core that will process that packet. I'm not sure I trust this works right, but it's something to keep in mind.
I'm playing around with those new instructions to see what I can do with them to speed up pattern-matching and protocol-analysis. It's not going to be a vast improvement, but they should be interesting.
The biggest change is multiprocessor designs. Software has to run on not just 2 or 4 processors, but 16 processors. Massive multithreading is difficult to design. It's something software IDS is going to struggle with for some time.