Thursday, July 30, 2015

A quick review of the BIND9 code

BIND9 is the oldest and most popular DNS server. Today, they announced a DoS vulnerability that would crash the server with a simply crafted query.  I could use my "masscan" tool to blanket the Internet with those packets and crash all publicly facing BIND9 DNS servers in about an hour. A single vuln doesn't mean much, but if you look at the recent BIND9 vulns, you see a pattern forming. BIND9 has lots of problems -- problems that critical infrastructure software should not have.


Its biggest problem is that it has too many feature. It attempts to implement every possible DNS feature known to man, few of which are needed on publicly facing servers. Today's bug was in the rarely used "TKEY" feature, for example. DNS servers exposed to the public should have the minimum number of features -- the server priding itself on having the maximum number of features is automatically disqualified.

Another problem is that DNS itself has some outdated design issues. The control-plane and data-plane need to be separate. This bug is in the control-plane code, but it's exploited from the data-plane. (Data-plane is queries from the Internet looking up names, control-plane is zones updates, key distribution, and configuration). The control-plane should be on a separate network adapter, separate network address, and separate port numbers. These should be hidden from the public, and protected by a firewall.

DNS should have hidden masters, servers with lots of rich functionality, such as automatic DNSSEC zone signing. It should have lightweight exposed slaves, with just enough code to answer queries on the data-plane, and keep synchronized with the master on the control-plane.

But what this post is really about is looking at BIND9's code. It's a nicer than the OpenSSL code and some other open-source projects, but there do appear to be some issues. The bug was in the "dns_message_findname()" function. The function header looks like:

isc_result_t
dns_message_findname(dns_message_t *msg, dns_section_t section,
    dns_name_t *target, dns_rdatatype_t type,
    dns_rdatatype_t covers, dns_name_t **name,
    dns_rdataset_t **rdataset);

The thing you should notice here is that none of the variables are prefixed with const, even though all but one of them should be. A quick grep shows that lack of const correctness is pretty common throughout the BIND9 source code. Every quality guide in the world strongly suggests const correctness -- that's it's lacking here hints at larger problems.

The bug was an assertion failure on the "name" parameter in the code above, as you can see in the picture. An assertion is supposed to double-check internal consistency of data, to catch bugs early. But this case, there was no bug being caught -- it was the assertion itself that was the problem. The programmers are confused by the difference between in, out, and in/out parameters. You assert on the expected values of the in and in/out parameters, but not on write-only out parameters. Since the function doesn't read them, their value is immaterial. If the function wants it to be NULL on input, it can just set it itself -- demanding that the caller do this is just bad.

By the way, assertions are normally enabled only for testing, but not for production code. That's because they can introduce bugs (as in this case), and have performance problems. However, in the long run, aggressive double-checking leads to more reliable code. Thus, I'm a fan of such aggressive checking. However, quickly glancing at the recent BIND9 vulns, it appears many of them are caused by assertions failing. This may be good, meaning that the code was going to crash (or get exploited) anyway, and the assertion caught it early. Or, it may be bad, with the assertion being the bug itself, or at least, that the user would've been happier without the assertion triggering (because of a memory leak, for example). If the later is the case, then it sounds like people should just turn off the assertions when building BIND9 (it's a single command-line switch).

Last year, ISC (the organization that maintains BIND9) finished up their BIND10 project, which was to be a re-write of the code. This was a fiasco, of course. Rewrites of large software project are doomed to failure. The only path forward for BIND is with the current code-base. This means refactoring and cleaning up technical debt on a regular basis, such as fixing the const correctness problem. This means arbitrarily deciding to drop support for 1990s era computers when necessary. If the architecture needs to change (such as separating the data-plane from the control-plane), it can be done within the current code-base -- just create a solid regression test, then go wild on the changes relying upon the regression test to maintain the quality.

Lastly, I want to comment on the speed of BIND9. It's dog slow -- the slowest of all the DNS servers. That's a problem firstly because slow servers should not be exposed to DDoS attacks on the Internet. It's a problem secondly because slow servers should not be written in dangerous languages like C/C++ . These languages should only be used when speed is critical. If your code isn't fast anyway, then you should be using safe languages, like C#, Java, or JavaScript. A DNS server written in these languages is unlikely to be any slower than BIND9.

Conclusion

The point I'm trying to make here is that BIND9 should not be exposed to the public. It has code problems that should be unacceptable in this day and age of cybersecurity. Even if it were written perfectly, it has far too many features to be trustworthy. It's feature-richness makes it a great hidden master, it's just all those feature get in the way of it being a simple authoritative slave server, or a simple resolver. They shouldn't rewrite it from scratch, but if they did, they should choose a safe language and not use C/C++.




Example#2: strcpy()

BIND9 has 245 instances of the horribly unsafe strcpy() function, spread through 94 files. This is unacceptable -- yet another technical debt they need to fix. It needs to be replaced with the strcpy_s() function.

In the file lwresutil.c is an example of flawed thinking around strcpy(). It's not an exploitable bug, at least not yet, but it's still flawed.

lwres_getaddrsbyname(...)
{ unsigned int target_length;

target_length = strlen(name);
if (target_length >= sizeof(target_name))
return (LWRES_R_FAILURE);
strcpy(target_name, name); /* strcpy is safe */
}

The problem here, which I highlighted in bold. The problem is that on a 64-bit machine, an unsigned int is only 32-bits, but string lengths can be longer than a 32-bit value can hold. Thus, a 4-billion byte name would cause the integer to overflow and the length check to fail. I don't think you can get any name longer than 256 bytes through this code path, so it's likely not vulnerable now, but the "4-billion bytes of data" problem is pretty common in other code, and frequently exploitable in practice.

The comment /* strcpy is safe */ is no more accurate than those emails that claim "Checked by anti-virus".

Modern code should never use strcpy(), at all, under any circumstances, not even in the unit-test code where it doesn't matter. It's easy to manage projects by simply grepping for the string "strcpy()" and whether it exists or not, it's hard managing project with some strcpy()s. It's like being some pregnant.






1 comment:

The Ubiquitous Mr. Lovegroove said...

Even if you masscan'ed the internet and crashed every BIND out there, doubt it would make much difference. Root servers are probably patched a while ago - ISC tends to embargo critical bugs and BIND's popularity is <20%.