Wednesday, January 25, 2023

I'm still bitter about Slammer

Today is the 20th anniversary of the Slammer worm. I'm still angry over it, so I thought I'd write up my anger. This post will be of interest to nobody, it's just me venting my bitterness and get off my lawn!!

Back in the day, I wrote "BlackICE", an intrusion detection and prevention system that ran as both a desktop version and a network appliance. Most cybersec people from that time remember it as the desktop version, but the bulk of our sales came from the network appliance.

The network appliance competed against other IDSs at the time, such as Snort, an open-source product. For much the cybersec industry, IDS was Snort -- they had no knowledge of how intrusion-detection would work other than this product, because it was open-source.

My intrusion-detection technology was radically different. The thing that makes me angry is that I couldn't explain the differences to the community because they weren't technical enough.

When Slammer hit, Snort and Snort-like products failed. Mine succeeded extremely well. Yet, I didn't get the credit for this.

The first difference is that I used a custom poll-mode driver instead of interrupts. This the now the norm in the industry, such as with Linux NAPI drivers. The problem with interrupts is that a computer could handle less than 50,000 interrupts-per-second. If network traffic arrived faster than this, then the computer would hang, spending all it's time in the interrupt handler doing no other useful work. By turning off interrupts and instead polling for packets, this problem is prevented. The cost is that if the computer isn't heavily loaded by network traffic, then polling causes wasted CPU and electrical power. Linux NAPI drivers switch between them, interrupts when traffic is light and polling when traffic is heavy.

The consequence is that a typical machine of the time (dual Pentium IIIs) could handle 2-million packets-per-second running my software, far better than the 50,000 packets-per-second of the competitors.

When Slammer hit, it filled a 1-gbps Ethernet with 300,000 packets-per-second. As a consequence, pretty much all other IDS products fell over. Those that survived were attached to slower links -- 100-mbps was still common at the time.

An industry luminary even gave a presentation at BlackHat saying that my claimed performance (2-million packets-per-second) was impossible, because everyone knew that computers couldn't handle traffic that fast. I couldn't combat that, even by explaining with very small words "but we disable interrupts".

Now this is the norm. All network drivers are written with polling in mind. Specialized drivers like PF_RING and DPDK do even better. Networks appliances are now written using these things. Now you'd expect something like Snort to keep up and not get overloaded with interrupts. What makes me bitter is that back then, this was inexplicable magic.

I wrote an article in PoC||GTFO 0x15 that shows how my portscanner masscan uses this driver, if you want more info.

The second difference with my product was how signatures were written. Everyone else used signatures that triggered on the pattern-matching. Instead, my technology included protocol-analysis, code that parsed more than 100 protocols.

The difference is that when there is an exploit of a buffer-overflow vulnerability, pattern-matching searched for patterns unique to the exploit. In my case, we'd measure the length of the buffer, triggering when it exceeded a certain length, finding any attempt to attack the vulnerability.

The reason we could do this was through the use of state-machine parsers. Such analysis was considered heavy-weight and slow, which is why others avoided it. State-machines are faster than pattern-matching, many times faster. Better and faster.

Such parsers are now more common. Modern web-servers (nginx, ISS, LightHTTPD, etc.) use them to parse HTTP requests. You can tell if a server does this by sending 1-gigabyte of spaces between "GET" and "/". Apache gives up after 64k of input. State-machines keep going, because while in that state ("between-method-and-uri"), they'll accept any number of spaces -- the only limit is a timeout. Go read the nginx source-code to understand how this works.

I wrote a paper in PoC||GTFO 0x21 that shows the technique to implement the common wc (word-count) program. A simplified version of this wc2o.c. Go read the code -- it's crazy.

The upshot is that when Slammer hit, most IDSs didn't have a signature for it. If they didn't just fall over, what they triggered on were things like "UDP flood", not "SQL buffer overflow". This lead many to believe what was happening was DDoS attack. My product correctly identified the vulnerability being exploited.

The third difference with my product was the event coalescer. Instead of a timestamp, my events had a start-time, end-time, and count of the number of times the event triggered.

Other event systems sometimes have this, with such events as "last event repeated 39003 times", to prevent the system from clogging up with events.

My system was more complex. For one things, an attacker may deliberately intermix events, so it can't simply be 1 event that gets coalesced this way. For another thing, the attacker could sweep targets or spoof sources. Thus, coalescing needed to aggregate events over address ranges as well as time.

Slammer easily filled a gigabit link with 300,000 packets-per-second. Every packet triggered the signature, thus creating 300,000 events-per-second. No system could handle that. To keep up with the load, events had to be reduced somehow.

My event coalescing logic worked. It reduced the load of events from 300,000 down to roughly 500 events-per-second. This was still a little bit higher load than the system could handle, forwarding to the remote management system. Customers reported that at their consoles, they saw the IDS slowly fall behind, spooling events at the sensor and struggling to ship them up to the management system.

The problem is so accurate that it's a big flaw in IDS still to this day. Snort often has signatures that throw away the excess data, but it's still easy to flood them with packets that overload their event logging.

What was exciting for me is that I'd designed all this in theory, tested using artificial cases, unsure how it would stand up to the real world. Watching it stand up to the real world was exciting: big customers saw it successfully work in practice, with the only complaint that at the centralized console, it fell behind a little.

The point is that I made three radical design choices, unprecedented at the time though more normal now, and they worked. And yet, the industry wasn't technical enough to recognize that it worked.

For example, a few months later I had a meeting at the Pentagon where a Gartner analyst gave a presentation claiming that only hardware-based IDS would work, because software-based IDS couldn't keep up. Well, these were my customer. I didn't refute Gartner so much as my customer did, with their techies standing up and pointing out that when Slammer hit, my "software" product did keep up. Gartner doesn't test products themselves. They rightly identified the problem with other software using interrupts, but couldn't conceive there was a third alternative, "poll mode" drivers.

I apologize to you, the reader, for subjecting you to this vain bitching, but I just want to get this off my chest.

No comments: