Thursday, July 10, 2008

Wireshark "TurboCap"

I just noticed that CACEtech is now selling a sniffing adapter "TurboCap for Windows. (CACEtech is the company that funds Wireshark development - and if you are a cybersecurity geek, you should have experience with Wireshark).

This product addresses the problem that operating-systems (Windows, Linux, BSD, et al.) are not optimized for sniffing packets. Thus, if you wanted to sniff a fast network with Wireshark, you'd be lucky sniffing at a rate of 300,000 packets per second. This product claims to allow sniffing at 3-million packets per second, which is the max theoretical speed for full-duplex gigabit Ethernet.

This product addresses the problem with a custom driver. You cannot use this card for normal networking. Although it physically is a network card, it will not appear as one of the network cards under Windows. It is a special "sniffing" card instead. It will only be available for custom sniffing applications, such as Wireshark.

The first product that replaced the network stack with a custom sniffing driver was the Network General "Sniffer"™ back in the 1980s. This is the product that gave us the name "packet-sniffer". It was the first to achieve "wire-speed" sniffing performance.

Many sniffing products have since used this idea. I used to work at Network General. When I founded my own company and created the BlackICE intrusion-detection system in 1998, I likewise used this concept. We wrote a custom sniffing driver for the 3c905 hardware. This happened to be the chip used in Dell notebooks of the time. The upshot of this was that my Dell notebook could do wirespeed 100-mbps intrusion-detection while other products at the time struggled at 10-mbps. This was an unbelievable speed back in the day, although custom drivers are more common now, so most intrusion-detection products now support wirespeed.

When writing a custom driver, or tweaking the existing drivers for better speed, there are a number of issues that you address.


Standard network drivers use tiny buffers, often a mere 64k. You want a lot more for a sniffing application. You might allocate a 100-megabyte buffer within the driver for holding packets.


In order to fit variable sized packets into a tiny buffer, most cards will fragment the packets in to 64 byte, 128 byte, or 256 byte chunks. The network driver must then reassemble the fragments back into whole packets before sending them up the network stack. Note that this is a wholly different sort of fragmentation at the hardware level unrelated to the fragmentation that occurs at the TCP/IP level.

A good choice for fragment size is 2048 bytes. It's large enough to hold standard Ethernet packets without needing reassembly. Only GigE jumbo frames would need to be reassembled.


The operating-system stack is designed so that incoming packets cause a hardware interrupt. This causes the operating system to halt its current task, run the driver code to deal with incoming packet, then resume. Handling an interrupt is efficient if there are few of then (less then 10,000 per second), but extremely inefficient if there are many. Sending 3-million interrupts per second at a typical operating system will cause it to lock up.

The alternative solution is "polling", where the software constantly tries to read the next packet. This means that there is no overhead from interrupts if the packets are arriving very fast, but means that the CPU is pegged at 100% utilization even if there is no traffic at all.

A hybrid method is to poll on a timer interrupt. In this method, you set up a timed interrupt (such as 10,000 per second), then poll the card to see if any packets have arrived since the last interval.


There are two ways of getting packets off the network card into memory. The first is "programmed input-output". In this mode, the CPU reads the bytes from the network chip, and then writes them into main memory. The second method is "direct memory access (DMA)". In this method, the network card writes the packets directly to memory, bypassing the CPU.

The CPU still needs to be involved with DMA. It must tell the adapter where buffers are in memory. The driver must continually refresh the list of free buffers. Thus, as the code processes incoming packets, it will free up those buffers and send them back to the network hardware for reuse in DMA.


In much the same way that handling an interrupt is expensive, there is a lot of overhead in transferring control from driver (which runs in kernel-mode) to the application (which runs in user-mode). You would likewise lock up the system trying to do this for every packet at 3-million packets per second.

The trick to get around this is to map the buffer in both kernel space and user space. In this manner, the user-mode sniffing application can read packets directly from the buffer without a kernel-mode transition.


Consider a 3-GHz CPU trying to sniff packets on a full duplex Ethernet at 3-million packets per second. Simple math shows that you have only 1000 CPU cycles per packet. That is your "budget".

This budget gets used up pretty fast. The problem isn't necessarily the number of instructions that can be executed (CPUs can execute multiple instructions per cycle), but memory access. The CPU can access a register in 1-cycle, first level cache in 3-cycles, second level cache in 25-cycles, and main memory in 250-cycles. In other words, if the software attempts to read memory, and it's not in the cache (a "cache miss"), it must stop and wait 250-cycles for the data to be read.

Thus, a 1000-cycle-per-packet budget equates to a 4-cache-miss-per-packet budget.

The packets are DMAed by the driver into memory, but not the cache. Therefore, reading the first byte of a packet will result in a cache miss. This leave only 750-cycles remaining.

In addition, the header information (packet length, timestamp, etc.) are located in a different place in memory. This will also cause a cache miss, leaving only 500-cycles left. Multiple CPUs must be synchronized with a full memory access, which has the same cost as a cache miss. This leaves 250-cycles left to process the packet. If you do something like a TCP connection table lookup, you've probably got another cache miss. These leaves 0-cycles left to process the packet.

Thus, we've quickly exceeded our CPU budget without actually doing anything.

With most drivers, you can locate the packet headers with the packet data. By combining them into the same location, they can be read together without a separate cache miss. CPU's have a pre-fetch instruction. The packet-read API can be implemented so that whenever the software reads the current packet, it "pre-fetches" the next packet into cache. Thus, the headers and data will be available next time without a cache miss.

If you use a ring buffer and a producer-consumer relationship, you won't need to use a traditional memory lock to synchronize the driver with the application. If you are even more clever with your sniffing API, you pre-fetch several future packets, and you allow the application to peek at the next packet allowing it to pre-fetch its TCP connection entry.

Putting this all together, I've proven that you'll need at least 4-cache misses to process a packet, and thus handling 3-million packets per second is impossible. Then, I've shown tricks to show how you can get around this and process a packet without any cache misses.


There are a long list of other optimizations you can do. For example, you'll want to align your buffers on cache-line boundaries. You'll also want to set the processor affinity flags so that the driver uses one CPU core while the user-mode process uses the remaining cores.


CACEtech claims "wirespeed" performance, which implies 3-million packets-per-second. I don't know if they've implemented all these tricks. Their cards are a little pricey ($900 each), so I'm not willing to buy one just to play with it. However, for anybody running network tools like Wireshark or Snort, they should logically give a huge boost in performance.


Anonymous said...

Rob - I agree that the only analyzers that come close to FDX 1 or even 10 Gig capture rates are the dedicated hardware ones. That is the reason that people have to learn how to focus on the packet flows that they need to review/troubleshoot - Filtered focus. You cannot swallow an iceberg whole, so you chip out what you need. Even the 1 Gig Sniffer could not handle the 3MPPS most of the time and when it could it was for a very short time.
However for the money the TurboCap is quite good and has been tested to handle reasonable rates. Remember that ClearSight, Fluke and Wildpackets analyzer use the Wireshark capture and decode engine. I enjoyed your disclosure on how to optimize capture rates as we did learn alot at Sniffer. Try to come to Sharkfest 2009, at Stanford June 15-18 and get to mingle with the developers and see Harry and Len with some other impressive people like Larry Rogers and Stephen Stuart the main developer behind the new Google M-Lab, who works for Vint Cerf.
A good developer like you might be able to get the Wireshark community excited to use the methods we learned at NGC to make Wireshark better for the ~50 million+ people that have downloaded it. I was at NGC from 88 to 99. We started this industry. TimO

Robert Graham said...

Hmmm, I failed to make it clear. I believe that TurboCap PROBABLY works as advertised. It uses an Intel network chip similar to the one I used in BlackICE/Proventia, which has been able to capture, analyze, retransmit at 3-million-packets-per-second SUSTAINED for many years now.

WinPcap is very good, better than the AVERAGE Unix/BSD/Linux solution.

You do realize that as a libertarian, I am vehemently opposed to net-neutrality, and Vint Cerf's lobbying the government to regulate the Internet on behalf of large monopolies like Google. I have opposed Vint Cerf's leftist politics since the early 1990s.