Sunday, October 21, 2018

TCP/IP, Sockets, and SIGPIPE

There is a spectre haunting the Internet -- the spectre of SIGPIPE errors. It's a bug in the original design of Unix networking from 1981 that is perpetuated by college textbooks, which teach students to ignore it. As a consequence, sometimes software unexpectedly crashes. This is particularly acute on industrial and medical networks, where security professionals can't run port/security scans for fear of crashing critical devices.

An example of why this bug persists is the well-known college textbook "Unix Network Programming" by Richard Stevens. In section 5.13, he correctly describes the problem.
When a process writes to a socket that has received an RST, the SIGPIPE signal is sent to the process. The default action of this signal is to terminate the process, so the process must catch the signal to avoid being involuntarily terminated.
This description is accurate. The "Sockets" network APIs was based on the "pipes" interprocess communication when TCP/IP was first added to the Unix operating system back in 1981. This made it straightforward and comprehensible to the programmers at the time. This SIGPIPE behavior made sense when piping the output of one program to another program on the command-line, as is typical under Unix: if the receiver of the data crashes, then you want the sender of the data to also stop running. But it's not the behavior you want for networking. Server processes need to continue running even if a client crashes.

But Steven's description is insufficient. It portrays this problem as optional, that only exists if the other side of the connection is misbehaving. He never mentions the problem outside this section, and none of his example code handles the problem. Thus, if you base your code on Steven's, it'll inherit this problem and sometimes crash.

The simplest solution is to configure the program to ignore the signal, such as putting the following line of code in your main() function:

   signal(SIGPIPE, SIG_IGN);

If you search popular projects, you'll find that this there solution most of the time, such as openssl.

But there is a problem with this approach, as OpenSSL demonstrates: it's both a command-line program and a library. The command-line program handles this error, but the library doesn't. This means that using the SSL_write() function to send encrypted data may encounter this error. Nowhere in the OpenSSL documentation does it mention that the user of the library needs to handle this.

Ideally, library writers would like to deal with the problem internally. There are platform-specific ways to deal with this. On Linux, an additional parameter MSG_NOSIGNAL can be added to the send() function. On BSD (including macOS), setsockopt(SO_NOSIGPIPE) can be configured for the socket when it's created (after socket() or after accept()). On Windows and some other operating systems, the SIGPIPE isn't even generated, so nothing needs to be done for those platforms.

But it's difficult. Browsing through cross platform projects like curl, which tries this library technique, I see the following bit:

#ifdef __SYMBIAN32__
/* This isn't actually supported under Symbian OS */
#undef SO_NOSIGPIPE
#endif

Later in the code, it will check whether SO_NOSIGPIPE is defined, and if it is, to use it. But that fails with Symbian because while it defines the constant in the source code, it doesn't actually support it, so it then must be undefined.

So as you can see, solving this issue is hard. My recommendation for your code is to use all three techniques: signal(SIGPIPE), setsocktopt(SO_NOSIGPIPE), and send(MSG_NOSIGNAL), surrounded by the appropriate #ifdefs. It's an annoying set of things you have to do, but it's a non-optional thing you need to handle correctly, that must survive later programmers who may not understand this issue.


Now let's talk abstract theory, because it's important to understanding why Stevens' description of SIGPIPE is wrong. The #1 most important theoretical concept in network programming is this:
Hackers control input.
What that means is that if input can go wrong, then it will -- because eventually a hacker will discover your trusting of input and create the necessary input to cause something bad to happen, such as crashing your program, or taking remote control of it.

The way Steven's presents this SIGPIPE problem is as if it's a bug in the other side of the connection. A correctly written program on the other side won't generate this problem, so as long as you have only well-written peers to deal with, then you'll never see this. In other words, Stevens trusts input isn't created by hackers.

And that's indeed what happens in industrial control networks (factories, power plants, hospitals, etc.). These are tightly controlled networks where the other side of the connection is by the same manufacturer. Nothing else is allowed on the network, so bugs like this never happen.

Except that networks are never truly isolated like this. Once a hacker breaks into the network, they'll cause havoc.

Worse yet, other people may have interest in the network. Security professionals, who want to stop hackers, will run port/vuln scanners on the network. These will cause unexpected input, causing these devices to crash.

Thus we see how this #1 principle gets corrupted, from Stevens on down. Stevens' textbook teaches it's the peer's problem, a bug in the software on the other side of the connection. This then leads to industrial networks being based on this principle, as the programmers were taught in the university. This leads to persistent, intractable vulnerabilities to hackers in these networks. Not only are they vulnerable now, they can't be fixed, because we can't scan for vulnerabilities in order to fix them.


In this day and age of "continuous integration", programmers are interested not only in solving this in their code, but solving this in their unit/regression test suites. In the modern perspective, until you can create a test that exercises this bug, it's not truly fixed.

I'm not sure how to write code that adequately does this. It's not straightforward generating RSTs from the Sockets API, especially at the exact point you need them. There's also timing issues, where you may need to do something a million times repeatedly to just to get the timing right.

For example, I have a sample program that calls send() as fast as it can until it hit the limit on how much this side can buffer, and then closes the socket, causing a reset to be sent. For my simple "echo" server trying to echo back everything it receives, this will cause a SIGPIPE condition.

However, when testing a webserver, this may not work. A typical web server sends a short amount of data, so the send() has returned before you can get a RST packet sent. The web server software you are testing needs to be sending a large enough response that it'll keep sending until it hits this condition. You may need to run the client program trying to generate this error a ton of times until just the right conditions are met.


Conclusion

I don't know of any network programming textbook I like. They all tend to perpetuate outdated and incomplete information. That SIGPIPE is ignored so completely is a major cause of problems on the Internet.

To summarize: your code must deal with this. The most appropriate solution is signal(SIGPIPE) at the top of your program. If that doesn't work for you, then you may be able to use pthread_sigprocmask() for just the particular threads doing network traffic. Otherwise, you need the more platform specific methods for BSD and Linux that deal with this at the socket or function call level.

It persists because it's a bug in the original Sockets definition from 1981, it's not properly described by textbook, it escapes testing, and there is a persistent belief that if a program receives bad input, it's the sender's responsibility to fix, rather than the receiver's responsibility.


No comments: