Friday, June 12, 2009

Asynchronocity and Internet Scale

Schools teach you the wrong way to write network code. They teach you the "synchronous" method. You send a request, wait for a response, then process the response. This doesn't scale to large programs that must interacts with thousands of peers at gigabit speeds. These types of programs require "asynchronous" coding.


The problem is that while you are waiting for a response, you can't do anything else useful. You can't simultaneously interact with a second system, for example. Normally, this isn't a problem because computers respond so quickly that you don't notice the wait. You can also hide it by using multiple threads, but if you had 10 threads, then 10 slow systems will noticeably slow your code.

Asynchronous coding solves this problem by never waiting. It sits in a loop processing events, either incoming packets, or timeout events.

Let's use a TCP connection as an example. As everyone knows, the client sends a SYN packet to the server, the server responds with a SYN-ACK, then the client sends an ACK. This SYN-SYNACK-ACK is known as the "three-way-handshake".

In synchronous code, you send a SYN, then stop and wait for a SYN-ACK. When you get a response packet, you first test it to make sure it conforms to the SYN-ACK you were expecting, otherwise you handle some sort of error.

In asynchronous code, the receive thread sits in an "event dispatch loop". It processes incoming packets. If an incoming SYN-ACK is received, it looks it up in a connection table to see if anybody has sent a SYN packet. If so, it dispatches the SYN-ACK as appropriate.

Imagine you are writing a port scanner, like nmap. One way you could write this is to launch many threads, where each one sends out a SYN packet, then stops and waits for the SYN-ACK. This could could generate thousands of packets per second.

Or, you could write your mapping program with two threads: one that does nothing but sends out SYN packets, and a second thread that receives SYN-ACKs in response. This code could generate a million packets per second.

Recently, a hacker released a TCP DoS tool called NKiller2. The tool uses asynchronous network code. It can appear confusing to people accustomed to synchronous programming. A synchronous coder might expect it to launch many threads, where each thread sends out a SYN and waits for responses for that one connection. This would be too slow - it would probably DoS itself creating too man threads before it was able to DoS the victim.

Instead, NKiller2 is written asynchronously. It runs two threads, one thread that spews out SYN packets, and another thread that responds to incoming packets. This may not be obvious, because both steps are part of the same thread of execution. The code has an event dispatch loop that looks like the following:

while () {
. . .…
send_syn_probe(Target, Sniffer);
. . .
state = check_replies(Target, Sniffer, &reply);
switch (state)
{
case S_SYNACK:
send_probe(reply, Target, S_SYNACK);
}
}

If you are used to synchronous programming, you might assume that the "send_syn_probe()" and "check_replies()" function are related, that it first sends a SYN then checks for a reply to that SYN. That's NOT what's going on.

Instead it's really running two threads, one that sits in a loop sending SYNs, and another that sits in a loop processing replies. The code just combines both into the same loop. You could put the "send_syn_probe()" function at the bottom of the loop, AFTER the "check_replies()", and the code would behave the same.

Or, you could create two versions of this program. Create one that sends SYNs, but has the "check_replies()" commented out. Create a second program with "send_syn_probe()" commented out, but which only receives replies. Now run them both at the same time, and you'll get identical results as the original program.

This code also uses the technique of being completely "stateless". One way to write this code would be for it to create a small connection record. However, since it is creating millions of connections, it would need a large table in memory to track what each connection is doing. Instead, it's much simpler. It will reply to a SYN-ACK packet regardless if it sent a matching SYN packet.

That would be one (of many) easy ways to see if somebody is running this tool against you. Whenever you suspect somebody is DoSing you, send them a SYN-ACK packet out of the blue. If it's a normal, stateful system that tracks SYNs it sent, then the suspected attacker will respond with some sort of error. If it is stateless, Internet scale attacker, they will respond with a data packet.


Internet scale programming like this is all around us. When the Internet worms were ravaging the Internet, a common technique was to set up "tarpits". A tarpit would accept an incoming TCP connection, but never respond. The worm on the other end would stop and wait for a response. Since the tarpit would never respond, the worm would wait forever, stopping its spread. Some worms would launch a hundred threads, each thread would eventually find a tarpit and be halted. (Note: I first tried this with the Morris Worm, it effectively slowed it down, but it would eventually timeout connections and move on - the first worm was written better than most following worms).

Another example of this is Internet-wide scanning. Kaminksy used this approach for scanning for DNS servers: have one thread spew out DNS packets, and a second thread receive them. I used the same technique for scanning for SNMP vulnerabilities. I wrote it for the military to scan Class A networks (with 16-million addresses), but it would scale to the entire Internet. My SNMP scanner was also stateless: it would accept any SNMP response regardless if it actually sent the system a request. This was actually pretty interesting seeing how many SNMP responses didn't match correctly with a request I sent (such as multi-homed hosts).

It works the other way around, too. IronPort used this approach to receiving large amounts of e-mail. They called the operating system they built around this idea "AsyncOS". (They also use this for sending spam).

Asynchronicity is why BlackICE/Proventia IPS is faster than application gateways. Fundamentally, they do the same thing: process application layer data and block it. However, BlackICE does this asynchronously, with a single thread. Application-layer gateways tend to be written synchronously, with a limited amount of threads waiting for data.

Conclusion

They teach you synchronous coding in school because it's easy to understand. However, in order to write software to "Internet scale", you have to learn how to write asynchronous code. This applies to worms, DoS tools, port scanners, firewalls, IPS, e-mail gateways, and so on.

5 comments:

c0uchw4rrior said...

Good post Rob, as usual.

One correction, though. Class A networks contain ~16.8 million IPs (2^24), not 24 million. A Class A does, however, have 24 bits of IP space.

Anonymous said...

Synchronous communication *can* scale to internet size. See mailinator as an example.

http://mailinator.blogspot.com/2008/02/kill-myth-please-nio-is-not-faster-than.html

http://mailinator.blogspot.com/2008/08/benchmarking-talkinator.html

"Seriously, if you still believe asynchrounous/NIO beats synchronous/IO (and it did when linux threads killed the system after just a few hundred were started) you might want to do some research. Threaded socket programming is back.. and with a vengeance."

Also, I think that people like synchronous IO programming more than async, possibly because it's more 'procedural'... Things happen in the textual order you write them. That's something that's worth trading for a little performance, if need be.

Async IO handling spreads the state machine all over the place, and I've seen numerous security bugs that have come simply from that. It's easier to see and verify the logic and steps in linear synchronous programming.

Finally, really what's needed is an async IO library that *looks* synchronous. Write your code in the linear synchronous style, and let the library take care of insync/async performance tradeoffs. Of course for that, you need a language with support for continuations.

ry said...

http://tinyclouds.org/node

Anonymous said...

Enough already about how great BlackICE/Proventia is!

ithilgore said...

You are wrong about this part of your post:

"This code also uses the technique of being completely "stateless". One way to write this code would be for it to create a small connection record. However, since it is creating millions of connections, it would need a large table in memory to track what each connection is doing. Instead, it's much simpler. It will reply to a SYN-ACK packet regardless if it sent a matching SYN packet.

That would be one (of many) easy ways to see if somebody is running this tool against you. Whenever you suspect somebody is DoSing you, send them a SYN-ACK packet out of the blue. If it's a normal, stateful system that tracks SYNs it sent, then the suspected attacker will respond with some sort of error. If it is stateless, Internet scale attacker, they will respond with a data packet. "

Nkiller2 uses the technique of client syn-cookies to keep track of the SYNs it sends without any additional memory overhead. Essentially, it encodes the quadruple { src port, src IP address, dst port, dst IP address } along with a secret key into the TCP sequence field and upon receiving a SYN-ACK it can deduce whether or not this belonged to a SYN it previously sent by subtracting 1 from the TCP ACK field, and checking the number against the current packet's reencoded quadruple. The encoding is done using the sha1 hash algorithm. Thus, you cannot detect the use of this tool by the technique you mentioned: sending back a bogus SYN-ACK, because Nkiller2 will just ignore it. You can try it yourself using hping2 or some similar tool.
There are might be other ways to detect Nkiller2 however and the most obvious one is checking whether your service is down :)

-- ithilgore (author of Nkiller2)