See [ Part 2 ] of this series.
I was reading this blog post about ASICs. ASICs are like herbs. While scientists cannot find any benefits to having herbs in shampoo, the public widely believes they make a difference. Therefore, it's impossible to find shampoo that doesn't have herbs in it.
There are lots of firewalls, e-mail appliances, and intrusion-prevention products that don't have herbs^H^H^H^H ASICs, which yet manage to have good performance. For example, the my (former) Proventia product could handle 5-gbps of real-word traffic with 30-microseconds of latency. And it does this without taking shortcuts. When you look under the hood of ASIC-based systems, you'll find that it's not the ASIC that made them fast, but some sort of sacrifice they've made (such as not analyzing HTTP responses).
To make code run as fast as ASICs, we have to use special techniques. For example, imagine writing a high-performance DNS server. When packets arrive in a buffer, they reset the cache flags. Reading the first bytes of the packet will cause a cache-miss, which causes the processor to halt for 300-cycles. Likewise, when resolving a random name, the name is unlikely to be in the cache, which is another 300-cycle hit. On a multi-processor system, threading locks require hard bus transactions, which can be as much as a 600-cycle hit.
When trying to process 3-million requests per second on a 3-GHz x86 processor, you have only 1000-cycles. The above requirements would seem to indicate that you need a minimum 1200-cycles, but you can use tricks to get past this. When processing an incoming request, you can execute a cache-prefetch instruction on the next packet. Thus, that packet will already be in the cache by the time you get to it, thus avoiding cache misses when you start processing packets. Likewise, when you get to the DNS name, instead of reading the table entry, you can execute a prefetch on it, then continue processing the previous packet. Lastly, instead of using normal synchronization primitives that lock the bus, you can construct the code with producer-consumer queues that don't require bus locking.
Thus, with careful coding, you can get rid of all the processor stalls.
Network ASICs from Cavium, RMI, Consentry, etc. solve the problem of processor stalls a different way. They are aggressively multi-threaded, so that when the processor stalls on one thread, they continue executing a different thread. They will help network applications that have frequent stalls (e.g. Snort), but would have no benefit on code that has engineered around the stalls (e.g. Proventia). At ISS, we jokingly referred to all the "hardware accelerators" as "decelerators".
These chips are useful for power consumption, but even there Intel has almost caught up with the Core 2 Duo, and is likely even to surpass them later this year with their 45nm process with a hafnium dielectric.