Tuesday, July 03, 2012

More leap-seconds, not fewer

Last week, a “leap-second” was added to clocks to account for the changing speed of Earth’s rotation, as is done about once every three years. This caused many servers throughout the Internet to crash. The popular sites Reddit and LinkedIn were down for an hour as a result. There isn’t just one leap-second bug; instead it’s a bug spread throughout all software that assumes a minute is always exactly 60 seconds long, and can never be 61 or 59 seconds long. Because of the pervasive nature of this bug, many have suggested getting rid of leap-seconds.

A far better solution would be instead to increase the number of leap-seconds. I propose that every 700 hours (roughly one month), we either insert or remove 1 second, depending upon whether the current time is slightly ahead or slightly behind. (700 hours is a good choice because it’s not aliased to a month or day, meaning it will be on a different day and different hour every month).

The problem with the leap-second bug is that it could go three years without being detected, and then suddenly cause large parts of the Internet to crash. With my proposal, any new bug could only last 30 days without being detected. This is short enough to happen during the beta period of software, meaning it will rarely be seen in shipping software. Or if it is seen in shipping software, it’ll be a few early adopters who see it.

Bug-avoidance is a fallacy among programmers. They use numerous tricks, like “try{}catch(…)” to prevent software from crashing, even in the face of bugs. This just leads to inherently unstable software. Instead, software needs to be delicate, so that bugs cannot hide, but show themselves early, such as using “assert()” in shipping code. The earlier a bug shows itself, the faster it gets fixed, and the more stable the software becomes.

It’s not just programmers, scientists often have this fallacy, too. Bad scientists try to make their experiments extremely hard to reproduce, to prevent other scientists from proving their theory wrong. Good scientists do the opposite, clearly documenting their experiments to make it very easy for other scientists to prove them wrong – if in fact their theory is wrong. Theories that survive the effort to prove them wrong are the ways we consider “fact”. Likewise, stable, robust, reliable software is software that twitches at the smallest error.

Thus, more leap-seconds (like every 700 hours) is just as good, or even better, a solution than getting rid of leap-seconds.


Another idea is “leap-milliseconds” instead of “leap-seconds”, which would happen about once a day, going either forward or backwards. This would automatically “smear” the change across a lot of updates rather than having a catastrophic single update.

In any event, the “ntpd” (the service that synchronizes time) could likewise smear the time automatically, never changing time by more than 1 millisecond per minute after the first synchronization at system startup.

For programmers, the correct solution to stop using “wall clock” or “real world” time. Operating systems come with “internal time” for precisely this reason, such as GetTickCount64() on Windows or clock_gettime(CLOCK_MONOTONIC) on Linux. These measure the progress of time without regard to the wall clock, and are not reset when the computer’s view of the wall clock time resets. (There was a good blog post I saw several months ago describing in detail why wall clock time was bad, but I can’t find it anymore. Anybody have a link?)

This should be one of the checks that “static analysis” code analyzers should check. Checking the wall clock isn’t itself bad, but checkers should be able to find when there has been a difference calculated between two wall clock times.

1 comment:

Anonymous said...

eh, the real problem is that these big sites were using Linux. if they had been using a sane kernel this would never happened happened. [/semi-troll]