Monday, November 20, 2017

Why Linus is right (as usual)

People are debating this email from Linus Torvalds (maintainer of the Linux kernel). It has strong language, like:
Some security people have scoffed at me when I say that security
problems are primarily "just bugs".
Those security people are f*cking morons.
Because honestly, the kind of security person who doesn't accept that
security problems are primarily just bugs, I don't want to work with.
I thought I'd explain why Linus is right.

Linus has an unwritten manifesto of how the Linux kernel should be maintained. It's not written down in one place, instead we are supposed to reverse engineer it from his scathing emails, where he calls people morons for not understanding it. This is one such scathing email. The rules he's expressing here are:
  • Large changes to the kernel should happen in small iterative steps, each one thoroughly debugged.
  • Minor security concerns aren't major emergencies; they don't allow bypassing the rules more than any other bug/feature.
Last year, some security "hardening" code was added to the kernel to prevent a class of buffer-overflow/out-of-bounds issues. This code didn't address any particular 0day vulnerability, but was designed to prevent a class of future potential exploits from being exploited. This is reasonable.

This code had bugs, but that's no sin. All code has bugs.

The sin, from Linus's point of view, is that when an overflow/out-of-bounds access was detected, the code would kill the user-mode process or kernel. Linus thinks it should have only generated warnings, and let the offending code continue to run.

Of course, that would in theory make the change of little benefit, because it would no longer prevent 0days from being exploited.

But warnings would only be temporary, the first step. There's likely to be be bugs in the large code change, and it would probably uncover bugs in other code. While bounds-checking is a security issue, its first implementation will always find existing code having latent bounds bugs. Or, it'll have "false-positives" triggering on things that aren't actually the flaws its looking for. Killing things made these bugs worse, causing catastrophic failures in the latest kernel that didn't exist before. Warnings, however, would have equally highlighted the bugs, but without causing catastrophic failures. My car runs multiple copies of Linux -- such catastrophic failures would risk my life.

Only after a year, when the bugs have been fixed, would the default behavior of the code be changed to kill buggy code, thus preventing exploitation.

In other words, large changes to the kernel should happen in small, manageable steps. This hardening hasn't existed for 25 years of the Linux kernel, so there's no emergency requiring it be added immediately rather than conservatively, no reason to bypass Linus's development processes. There's no reason it couldn't have been warnings for a year while working out problems, followed by killing buggy code later.

Linus was correct here. No vuln has appeared in the last year that this code would've stopped, so the fact that it killed processes/kernels rather than generated warnings was unnecessary. Conversely, because it killed things, bugs in the kernel code were costly, and required emergency patches.

Despite his unreasonable tone, Linus is a hugely reasonable person. He's not trying to stop changes to the kernel. He's not trying to stop security improvements. He's not even trying to stop processes from getting killed That's not why people are moronic. Instead, they are moronic for not understanding that large changes need to made conservatively, and security issues are no more important than any other feature/bug.

Update: Also, since most security people aren't developers, they are also a bit clueless how things actually work. Bounds-checking, which they define as purely a security feature to stop buffer-overflows is actually overwhelmingly a debugging feature. When you turn on bounds-checking for the first time, it'll trigger on a lot of latent bugs in the code -- things that never caused a problem in the past (like reading past ends of buffers) but cause trouble now. Developers know this, security "experts" tend not to. These kernel changes were made by security people who failed to understand this, who failed to realize that their changes would uncover lots of bugs in existing code, and that killing buggy code was hugely inappropriate.

Update: Another flaw developers are intimately familiar with is how "hardening" code can cause false-positives, triggering on non-buggy code. A good example is where the BIND9 code crashed on an improper assert(). This hardening code designed to prevent exploitation made things worse by triggering on valid input/code.

Update: No, it's probably not okay to call people "morons" as Linus does. They may be wrong, but they usually are reasonable people. On the other hand, security people tend to be sanctimonious bastards with rigid thinking, so after he has dealt with that minority, I can see why Linus treats all security people that way.


XenoPhage said...

I was with you right up until you commented that bounds checking is primarily for debugging ... While I'll agree that it's not primarily a security feature, I'm not sure how you can argue it's primarily a debugging feature. Or, am I misunderstanding your use of debugging?

Generally speaking, you should be able to remove debugging code and still have good code that performs the function it was written for. If you remove bounds checking, that code can now misbehave and sometimes crash when an improper value is sent to it.

Does that make sense? Am I on the right track?

Rsh said...

I believe the author wanted to say that bounds checking will primarily initially uncover mostly bugs and hence will be most useful for debugging. There is a lot of buggy software out there that does out of bounds memory access.

Anonymous said...

Bounds checking is for debugging because attempting to make an access outside the bounds is a bug, by definition. The correct fix isn't to crash the program when such an access is detected. The correct fix is to change the logic so the attempt is never made in the first place. If the index is coming from outside the program (user input or input from another program/system), there should be explicit bounds checking on that value, and an appropriate code path should be followed.

Greg Nation said...

> I was with you right up until you commented that bounds checking is primarily for debugging

First, I completely agree with what Avi said. Second, bounds checking is primarily for debugging because bounds checking every array access in the kernel would be a huge hit to performance. One philosophy behind the C programming language is to trust the developer.

Darrell said...

The bounds checking uses hardware traps and, if it is like what I think it is, it marks the unused or boundary memory so that a when the addresses are accessed an interrupt is called. It isn't debugging code in the programs or kernel, just a hardware feature.

ajx said...

its nice having access to millions of machines every time a zero day gets published. no exec stack patch should have been main lined in 1992.

XenoPhage said...

Hrm.. ok, so it sounds like for any internal code, with no external access, bounds checking is more for debugging. And you'd put bounds checking in place for any external accesses. Makes sense.

simbo1905 said...

The JVM has a little used option -ea where additional assert statement in the source code are enabled. Unit testing code can run with the addition checks whereas live code has them stripped out at byte code load time. The JVM has bounds checking built-in but a developer can use assert statements to fail fast on other bugs like null arguments where a value must be provided or negative numbers when a positive number is expected.

What's interesting is that absolutely it's for debugging purposes (as asset statements are disabled by default) and the owner of the process opts into the runtime behaviour of the an AssertionError being thrown. An application programmer typically doesn't expect Error types to be thrown. The normal type to throw is Exception. So "assert( arg != null)" is worse than trying to dereference the null which would throw a NullPointerException. Typically Error causes the thread to exit a loop. That can hang a server process or cause it to shut down. In contrast the Exception would be handled by a catch all at the bottom of the stack and a server process main loop would continue.

To Linus's point a Java application programmer certainly doesn't expect a library or JVM programmer to call System.exit() to crash the process when a null parameter slips in. Likewise an application programmer doesn't want library developers to use normal code to check for unexpected arguments and manually throw Error types. I have seen fury expressed several times when library developers have done such things and taken down mission critical server applications. If you have ever witnessed such fury you would know that Linus is right. It is utterly unacceptable to change library or system code to start crashing or killing a process due to problematic arguments. A global team all pagered out of bed on a crash hunt on a live system will always react to anyone uncovering old bugs by stopping the application in a very hostile manner and for good reason. It can cause serious financial losses when they uncovered bugs didn't.

Unknown said...

Bounds checking is indeed for uncovering bugs in your code. It can not and should not replace input validation. These are two completely different tasks in your code. Once the input is certified to be in a valid format, your code should not pass around improper data structures.

Brox said...

I mean it is a kernel of an OS! It's not any applicational programs. So I think we are expecting everything inside the kernel to be right. I think refusing to inner wrong call doesn't make it right at all.

Brox said...
This comment has been removed by the author.