Tuesday, July 03, 2012
More leap-seconds, not fewer
Last week, a “leap-second” was added to clocks to account for the changing speed of Earth’s rotation, as is done about once every three years. This caused many servers throughout the Internet to crash. The popular sites Reddit and LinkedIn were down for an hour as a result. There isn’t just one leap-second bug; instead it’s a bug spread throughout all software that assumes a minute is always exactly 60 seconds long, and can never be 61 or 59 seconds long. Because of the pervasive nature of this bug, many have suggested getting rid of leap-seconds.
A far better solution would be instead to increase the number of leap-seconds. I propose that every 700 hours (roughly one month), we either insert or remove 1 second, depending upon whether the current time is slightly ahead or slightly behind. (700 hours is a good choice because it’s not aliased to a month or day, meaning it will be on a different day and different hour every month).
The problem with the leap-second bug is that it could go three years without being detected, and then suddenly cause large parts of the Internet to crash. With my proposal, any new bug could only last 30 days without being detected. This is short enough to happen during the beta period of software, meaning it will rarely be seen in shipping software. Or if it is seen in shipping software, it’ll be a few early adopters who see it.
Bug-avoidance is a fallacy among programmers. They use numerous tricks, like “try{}catch(…)” to prevent software from crashing, even in the face of bugs. This just leads to inherently unstable software. Instead, software needs to be delicate, so that bugs cannot hide, but show themselves early, such as using “assert()” in shipping code. The earlier a bug shows itself, the faster it gets fixed, and the more stable the software becomes.
It’s not just programmers, scientists often have this fallacy, too. Bad scientists try to make their experiments extremely hard to reproduce, to prevent other scientists from proving their theory wrong. Good scientists do the opposite, clearly documenting their experiments to make it very easy for other scientists to prove them wrong – if in fact their theory is wrong. Theories that survive the effort to prove them wrong are the ways we consider “fact”. Likewise, stable, robust, reliable software is software that twitches at the smallest error.
Thus, more leap-seconds (like every 700 hours) is just as good, or even better, a solution than getting rid of leap-seconds.
eh, the real problem is that these big sites were using Linux. if they had been using a sane kernel this would never happened happened. [/semi-troll]
ReplyDelete