Wednesday, July 15, 2015

Software and the bogeyman

This post about the July 8 glitches (United, NYSE, WSJ failed) keeps popping up in my Twitter timeline. It's complete nonsense.

What's being argued here is that these glitches were due to some sort of "moral weakness", like laziness, politics, or stupidity. It's a facile and appealing argument, so scoundrels make it often -- to great applause from the audience. But it's not true.


Layers and legacies exist because working systems are precious. More than half of big software projects are abandoned, because getting new things to work is a hard task. We place so much value on legacy, working old systems, because the new replacements usually fail.

An example of this is the failed BIND10 project. BIND, the Berkeley Internet Name Daemon, is the oldest and most popular DNS server. It is the de facto reference standard for how DNS works, more so than the actual RFCs. Version 9 of the project is 15 years old. Therefore, the consortium that maintains it funded development for version 10. They completed the project, then effectively abandoned it, as it was worse in almost every way than the previous version.

The reason legacy works well is the enormous regression testing that goes on. In robust projects, every time there is a bug, engineers create one or more tests to exercise the bug, then add that to the automated testing, so that from now on, that bug (or something similar) can never happen again. You can look at a project like BIND9 and point to the thousands of bugs it's had over the last decade. So many bugs might make you think it's bad, but the opposite is true: it means that it's got an enormous regression test system that stresses the system in peculiar ways. A new replacement will have just as many bugs -- but no robust test that will find them.

A regression test is often more important than the actual code. If you want to build a replacement project, start with the old regression test. If you are a software company and want to steal your competitors intellectual property, ignore their source, steal their regression test instead.

People look at the problems of legacy and believe that we'd be better off without it, if only we had the will (the moral strength) to do the right thing and replace old system. That's rarely true. Legacy is what's reliable and working -- it's new stuff that ultimately is untrustworthy and likely to break. You should worship legacy, not fear it.

Technical debt

Almost all uses of the phrase "technical debt" call it a bad thing. The opposite is true. The word was coined to refer to a good thing.

The analogy is financial debt. That, too, is used incorrectly as a pejorative. People focus on the negatives, the tiny percentage of bankruptcies. They don't focus on the positives, what that debt finances, like factories, roads, education, and so on. Our economic system is "capitalism", where "capital" just means "debt". The dollar bills in your wallet are a form of debt. When you contribute to society, they are indebted to you, so give you a marker, which you can then redeem by giving back to society in exchange something that you want, like a beer at your local bar.

The same is true of technical debt. It's a good thing, a necessary thing. The reason we talk about technical debt isn't so that we can get rid of it, but so that we can keep track of it and exploit it.

The Medium story claims:
A lot of new code is written very very fast, because that’s what the intersection of the current wave of software development (and the angel investor / venture capital model of funding) in Silicon Valley compels people to do.
This is nonsense. Every software project of every type has technical debt. Indeed, it's open-source that overwhelmingly has the most technical debt. Most open-source software starts as somebody wanting to solve a small problem now. If people like the project, then it survives, and more code and features are added. If people don't like it, the project disappears. By sheer evolution, that which survives has technical debt. Sure, some projects are better than others at going back and cleaning up their debt, but it's something intrinsic to all software engineering.

Figuring out what user's want is 90% of the problem, how the code works is only 10%. Most software fails because nobody wants to use it. Focusing on removing technical debt, investing many times more effort in creating the code, just magnifies the cost of failure when your code still doesn't do what users want. The overwhelmingly correct development methodology is to incur lots of technical debt at the start of every project.

Technical debt isn't about bugs. People like to equate the two, as both are seen as symptoms of moral weakness. Instead, technical debt is about the fact that fixing bugs (or adding features) is more expensive the more technical debt you have. If a section of the code is bug-free, and unlikely to be extended to the future, then there will be no payback for cleaning up the technical debt. On the other hand, if you are constantly struggling with a core piece of code, making lots of changes to it, then you should refactor it, cleaning up the technical debt so that you can make changes to it.

In summary, technical debt is not some sort of weakness in code that needs to be fought, but merely an aspect of code that needs to be managed.


More and more, software ends up interacting with other software. This causes unexpected things to happen.

That's true, but the alternative is worse. As a software engineer building a system, you can either link together existing bits of code, or try to "reinvent the wheel" and write those bits yourself. Reinventing is sometimes good, because you get something tailored for your purpose without all the unnecessary complexity. But more often you experience the rewriting problem I describe above: your new code is untested and buggy, as opposed to the well-tested, robust, albeit complex module that you avoided.

The reality of complexity is that we demand it of software. We all want Internet-connected lightbulbs in our homes that we can turn on/off with a smartphone app while vacationing in Mongolia. This demands a certain level of complexity. We like such complexity -- arguing that we should get rid of it and go back to a simpler time of banging rocks together is unacceptable.

When you look at why glitches at United, NYSE, and WSJ happen, it because once they've got a nice robust system working, they can't resist adding more features to it. It's like bridges. Over decades, bridge builders get more creative and less conservative. Then a bridge fails, because builders were to aggressive, and the entire industry moves back into becoming more conservative, overbuilding bridges, and being less creative about new designs. It's been like that for millennia. It's a good thing, have you even seen the new bridges lately? Sure, it has a long term cost, but the thing is, this approach also has benefits that more than make up for the costs. Yes, NYSE will go down for a few hours every couple years because of a bug they've introduced into their system, but the new features are worth it.

By the way, I want to focus on the author's quote:
Getting rid of errors in code (or debugging) is a beast of a job
There are two types of software engineers. One type avoids debugging, finding it an unpleasant aspect of their job. The other kind thinks debugging is their job -- that writing code is just that brief interlude before you start debugging. The first kind often gives up on bugs, finding them to be unsolveable. The second type quickly finds every bug they encountered, even the most finicky kind. Every large organization is split between these two camps: those busy writing code causing bugs, and the other camp fixing them. You can tell which camp the author of this Medium story falls into. As you can tell, I have enormous disrespect for such people.

"Lack of interest in fixing the actual problem"

The NYSE already agrees that uptime and reliability is the problem, above all others, that they have to solve. If they have a failure, it doesn't mean they aren't focused on failures as the problem.

But in truth, it's not as big a problem as they think. The stock market doesn't actually need to be that robust. It's more likely to "fail" for other reasons. For example, every time a former President dies (as in the case of Ford, Nixon, and Reagan), the markets close for a day in mourning. Likewise, wild market swings caused by economic conditions will automatically shut down the market, as they did in China recently.

Insisting that code be perfect is absurd, and impossible. Instead, the only level of perfection the NYSE needs is so that glitches in code shut down the market less often than dead presidents or economic crashes.

The same is true of United Airlines. Sure, a glitch grounded their planes, but weather and strikes are a bigger problem. If you think grounded airplanes is such an unthinkable event, then the first thing you need to do is ban all unions. I'm not sure I disagree with you, since it seems every flight I've had through Charles de Gaulle airport in Paris has been delayed by a strike (seriously, what is wrong with the French?). But that's the sort of thing you are effectively demanding.

The only people who think that reliability and uptime are "the problem" that needs to be fixed are fascists. They made trains "run on time" by imposing huge taxes on the people to overbuild the train system, then putting guns to the heads of the conductors, making them indentured servants. The reality is that "glitches" are not "the problem" -- making systems people want to use is the problem. Nobody likes it when software fails, of course, but that's like saying nobody likes losing money when playing poker. It's a risk vs. reward, we can make software more reliable but at such a huge cost that it would, overall, make software less desirable.


Security and reliability are tradeoffs. Problems happen not because engineers are morally weak (political, stupid, lazy), but because small gains in security/reliability would require enormous sacrifices in other areas, such as features, cost, performance, and usability. But "tradeoffs" are a complex answer, requiring people to thinki. "Moral weakness" is a much easier, and more attractive answer that doesn't require much thought, since everyone is convinced everyone else (but them) is morally weak. This is why so many people in my Twitter timeline keep mentioning that stupid Medium article.

No comments: