Tuesday, April 03, 2007

Why did the ANI patch take so long?

Microsoft made us all laugh with the comment in their blog about the ANI patch:
I'm sure one question in people's minds is how we're able to release an update for this issue so quickly

Um, no, the question in everybody's mind was why it took so long.

Actually, there is an answer why it took so long. As documented, there was a bug with a RealTek audio control panel. Even if Microsoft makes ZERO code changes, simply rebuilding the product will lead to bugs like this, either within Windows itself, or 3rd party drivers, applications, or control panels. This bug happened because of something wrong in RealTek's code, not Microsoft's code.

Few people realize this but when Microsoft tests a patch prior to shipping, they also test popular third party applications. They find conflicts due to other people's code. When they encounter such an issue, they change their patch until the 3rd party bug no longer appears. In some cases, they have changed the Windows specification just to fix some weirdness in a popular application. Microsoft doesn't like to talk about this because they don't want to insult other people, but this sort of thing happens a lot. What appears to be "Microsoft's fault" is actually Microsoft covering for other vendors.

Everybody complains that Microsoft takes a week to ship a bug fix for an actively exploited bug, but the fact that they can test 30 different versions of Windows (various patch levels, 2k/XP/2003/Vista, Itanium/x86/x64) and thousands of apps on top of that in under a week is simply amazing.

Many in the community don't think such thorough testing is needed. However, there is a good chance that cost to those running RealTek audio because of this bug will be greater than the cost to those getting exploited. The reason others can create pseudo-patches for such bugs is the don't have to suffer the consequences when it causes more problems for a customer than it solves. Since this QA process is so long, Microsoft tries to schedule when it fixes bugs, so that they can test many of them all at once, rather than retesting every time.

On the other hand, I'm not sure if Microsoft's timeline is shrinking The time between when a developer checks in and the patch ships to customers should be a measured part of a "secure development lifecycle" - and it should be continuously shrinking. No matter how big the problem, engineers shouldn't be making excuses for why it takes so long. For example, instead of rebuilding the affected DLL, maybe Microsoft should instead "patch" it: overwrite a 'jmp' instruction in the affected area to some dead padding area in the DLL that contains the fix, then 'jmp' back. Or do something else: there is always a clever way to solve problems. I'm curious whether (a) Microsoft has been tracking this time-to-patch, and (b) whether it's been shrinking or growing, and (c) whether they are doing researching on fixing bugs other than rebuilding the affected files.

No comments: