Friday, April 10, 2015

Scalability of the Great Cannon

Here is a great paper on China's Great Cannon, which was used to DDoS GitHub. One question is how scalable such a system can be, or how much resources it would take for China to intercept connections and replace content.

The first question is how much bandwidth China needs to monitor. According to the this website, in early 2015 that's 1.9-terabits/second (1,899,792-mbps).

The second question is how much hardware China needs to buy in order to intercept network traffic, reassemble TCP streams, and insert responses. The answer is about one $1000 desktop computer per 10-gbps of traffic. In other words, China can deploy the Great Cannon using $200,000 worth of hardware.

This answer is a little controversial. Most people think that a mere desktop computer could not handle 10-gbps of throughput, much less do anything complicated with it like reassembling TCP streams. However, they are wrong. Intel has put an enormous amount of functionality into their hardware to solve precisely this problem. Unfortunately, modern software like Linux or Windows is a decade behind hardware advances, and cannot take advantage of this.

The first step is to bypass the operating system. This sounds a bit odd, but it's not hard to do. There are several projects (DPDK, PF_RING, netmap) that disconnect the network card from the kernel and connect it directly to your software. They are fairly straight forward to use. My GitHub account has several examples; I ought to produce more. These applications are able to transmit and receive packets with zero overhead. My apps handle 30-million packets/second, Intel has prototypes that do 80-million packets/second. This is far great than the 10-million packets/second you'd need to handle for a Great Cannon.

Another trick is how TCP reassembly is handled. Almost everyone does it the wrong way, which involves buffering every packet that comes in. The correct way is to write parsers as state-machines, buffering the state between packets, and not the packets themselves. Out-of-order packets need to still be buffered, but this is a small percentage of the total. With a zero overhead driver and state-machine parsers, you'll find that you can easily keep up with a 10-gbps stream.

Those are the basics, but there are a bunch of other issues you'll need to solve. For example, consider the issue of jitter, when Linux interrupts your thread. That'll cause whichever packet you are currently processing to be delayed several milliseconds. This is death for a network device. Luckily, you can tell Linux to reserve specific CPU cores as off-limits, allowing you to run threads on those cores without ever getting interrupted. This allows microsecond-scale latency/jitter, which given that traffic to/from China has about a 250-millisecond latency, is meaningless.

I've built such systems. They really do work. Indeed, you could use the product I built (now sold by IBM) and do exactly what China did.

So the deal is this. You'd think with a billion people and terabits of Internet traffic, that such a system would be hard. In fact, in the relative scale of things, it's trivial.

I did a presentation at Shmoocon on this a couple years ago:

So I have a lab at some with a bunch of computers doing 10-gbps. The components are:

  • XS708E 10-gbps switch $840
  • Supermicro half-length 1U $331
  • 3.2 GHz quad Ivy Bridge $199
  • 32-gigs RAM $295
  • 10-gbps dual-port Intel NIC $190

No comments: