Friday, October 03, 2008

TCP Selective ACK considered evil

In my previous post, I pointed out that the claims of a new TCP DoS are probably true. Like the researchers that discovered the issues, I too have been playing around in TCP stacks, and I find weirdness.

One thing that has annoyed me recently is the way that stacks abuse the "selective ack" feature. In the past, the receiver would only acknowledge the continuous data received. If a packet were lost on the network, and a gap appeared, the sender wouldn't know the fate of the packets after the discontinuity. This was solved with "selective acks", where the receiver could say "I received your first 100,000 bytes, and bytes 101,000-108,000, but I'm missing those in between". This increased the speed at which TCP stacks could recover from lost packets and retransmit the necessary data.

I'm seeing something unexpected, though. When clients are downloading large files, the servers aren't immediately retransmitting the lost packets. The file download past the gap continues for a very long time. So far, I've seen the file download continue for 3-megabytes before the server goes back and fills in the gap. This can be easily 20 seconds later.

This annoys me because my network monitoring tools like Ferret have to buffer all that data. My TCP stack has to process data in-order, so I have to buffer 3-megabytes until I can process the retransmitted packet.

In the old days, kernels had a fixed amount of buffer space, usually not very big. They wouldn't be able to buffer 3-megabytes like this. This implies that the kernel is allocating more memory on demand, controlled by the other side of the connection.

This suggests a way that I can bluescreen a desktop computer. I would have a user follow a link to download a simple webpage. I then intentionally miss a packet as I send data in response. As the client selectively acknowledges my data, I continue streaming forever, but I never fill in that gap. A well-designed TCP stack would put a limit on how much memory it would allocated. A poorly designed stack would allocate data until it ran out, at which point the machine would likely crash the next time another kernel process needed more memory.

When an application runs out of memory, it can use "virtual memory" paged to disk. The kernel cannot. If it runs out of a virtual memory, just the application will crash. When the kernel runs out of memory, it bluescreens.

I could do the same thing with a server. I could connect to the web server, send a few bytes, "drop" a packet, then continue to stream data forever after that. An Apache web-server will only accept the first 16-kilobytes, but I don't care, because the kernel hasn't delivered the first 16-kilobytes yet - it is waiting for me to retransmit that missing packet before all the data goes up the stack to Apache.

I can't explain why the server is taking 20 seconds to retransmit data. One idea is that operating systems have specialized "sendfile" functions that hand off the kernel the responsibility for sending the contents of a file across the network socket. Maybe the reason it takes so long is that the missing packet isn't buffered in memory: the kernel has to re-read the data from the disk in order to re-transmit it. On a busy file server, this can take many seconds. If my theories are true, I could DoS a server by forcing it to go back and retransmit lots of 1-byte chunks all over the disk. I could cause the disk heads to grind away for very low bandwidth.

I'm not sure how selective-acks work with other mechanisms. In order for a TCP stack to acknowledge a "FIN" flag is to acknowledge the next byte after the flag. I can do that with selective-acks, without acknowledging the data right before it. This puts the TCP state machine into a weird place. One part knows that the FIN has been received and behave accordingly, but another part is still trying to retransmitted the data. This conflict wasn't possible with old stacks because acknowledging the FIN also acknowledged all the data up to it.

I find selective-acks annoying and I'm just writing a simple network monitoring application. I'm sure they cause stack designers a lot more headaches, and that if I write an active stack, I could cause a lot of problems. That's why I believe those researchers when they say they have found problems.


decius said...

Interesting observations. I presume that your TCP window size in that case is 3 megs and thats why the transmitter is sending so much data before it goes back and fills in the lost packet -- your kernel has told the transmitting side that you are willing to buffer that much. If the transmitter sends more than you've indicated that you're willing to buffer, you'll just start dropping packets.

Robert Graham said...

Nope. The receiver always reported a window size of 64k. The data that was selectively acked was not counted against it. Thus, with a missing packet of 1.5k, and 3-megs of SACKed data, the transmitter believed it could send (65536-1500) bytes.

I made the same assumption as you. As it turns out, modern TCP stacks lie about their window size and can buffer a lot more data than they claim.

jweyrich said...

Just FYI, SACK can disabled on IOS by using % no ip tcp selective-ack
Doing so, the client won't be able to use SACK because the sender won't advertise SACK is supported within the SYN or SYN/ACK.

A pretty interesting thing you probably noticed, is the fact that when the server retransmit a packet after a SACK from the client, it stops sending packet for several milliseconds. So, I imagine here's the entry-point for a very low-bandwidth DoS.

The hardest part (at least for me) is to capture ALL the packets and apply post-filtering to detect SACKs, and then verify the intervals between the last acknowledged packet and the resent one.

Let me know if it helped in any way.

jweyrich said...

I forgot to mention you can also reconstruct the exact receiver's queue 'grouping' the information received from the client SACKs.
For attacking purposes, it works like a report agent that tells the attacker whether the attack is succeeding or not.

decius said...

You sure that TCP window scaling wasn't enabled? (RFC 1323)

Neil said...

"When the kernel runs out of memory, it bluescreens."

This isn't true on Windows XP (or Server 2003 -- my memory is a little fuzzy on where the changes were made) and later (and it wasn't true, in a precise sense, prior to that).

On Windows 2000, running a box out of non-paged pool (which is what we should be using to hold what's in the TCP receive window) could potentially cause a bluescreen when another driver tried to allocate NPP and the allocation failed. This is no longer true.

I suspect that there's something more going on with your retransmits. Perhaps you could post a sample network trace somewhere so we could all look at it? My own experience with selective ACKs suggests that there is more to the story -- I've always seen the retransmit within a very short period of time.

Adrian said...

Perhaps you're seeing the effects of dynamic TCP (or generic?) socket buffer sizing going on?

fgont said...

Did you ever post the packet traces? - I was nterested in having a look at them.

BTW, you may find oir TCP security paper interesting: