I’d love a blogpost written from the perspective of a chipmaker - Why this issue exists. I’d never question their competency, but it seems like a violation of expectations in hindsight. Based on my very limited understanding of these issues.— SwiftOnSecurity (@SwiftOnSecurity) January 4, 2018
The tl;dr version is this: the CPUs have no bug. The results are correct, it's just that the timing is different. CPU designers will never fix the general problem of undetermined timing.
CPUs are deterministic in the results they produce. If you add 5+6, you always get 11 -- always. On the other hand, the amount of time they take is non-deterministic. Run a benchmark on your computer. Now run it again. The amount of time it took varies, for a lot of reasons.
That CPUs take an unknown amount of time is an inherent problem in CPU design. Even if you do everything right, "interrupts" from clock timers and network cards will still cause undefined timing problems. Therefore, CPU designers have thrown the concept of "deterministic time" out the window.
The biggest source of non-deterministic behavior is the high-speed memory cache on the chip. When a piece of data is in the cache, the CPU accesses it immediately. When it isn't, the CPU has to stop and wait for slow main memory. Other things happening in the system impacts the cache, unexpectedly evicting recently used data for one purpose in favor of data for another purpose.
Hackers love "non-deterministic", because while such things are unknowable in theory, they are often knowable in practice.
That's the case of the granddaddy of all hacker exploits, the "buffer overflow". From the programmer's perspective, the bug will result in just the software crashing for undefinable reasons. From the hacker's perspective, they reverse engineer what's going on underneath, then carefully craft buffer contents so the program doesn't crash, but instead continue to run the code the hacker supplies within the buffer. Buffer overflows are undefined in theory, well-defined in practice.
Hackers have already been exploiting this defineable/undefinable timing problems with the cache for a long time. An example is cache timing attacks on AES. AES reads a matrix from memory as it encrypts things. By playing with the cache, evicting things, timing things, you can figure out the pattern of memory accesses, and hence the secret key.
Such cache timing attacks have been around since the beginning, really, and it's simply an unsolvable problem. Instead, we have workarounds, such as changing our crypto algorithms to not depend upon cache, or better yet, implement them directly in the CPU (such as the Intel AES specialized instructions).
What's happened today with Meltdown is that incompletely executed instructions, which discard their results, do affect the cache. We can then recover those partial/temporary/discarded results by measuring the cache timing. This has been known for a while, but we couldn't figure out how to successfully exploit this, as this paper from Anders Fogh reports. Hackers fixed this, making it practically exploitable.
As a CPU designer, Intel has few good options.
Fixing cache timing attacks is an impossibility. They can do some tricks, such as allowing some software to reserve part of the cache for private use, for special crypto operations, but the general problem is unsolvable.
Fixing the "incomplete results" problem from affecting the cache is also difficult. Intel has the fastest CPUs, and the reason is such speculative execution. The other CPU designers have the same problem: fixing the three problems identified today would cause massive performance issues. They'll come up with improvements, probably, but not complete solutions.
Instead, the fix is within the operating system. Frankly, it's a needed change that should've been done a decade ago. They've just been putting it off because of the performance hit. Now that the change has been forced to happen, CPU designers will probably figure out ways to mitigate the performance cost.
Thus, the Intel CPU you buy a year from now will have some partial fixes for these exactly problems without addressing the larger security concerns. They will also have performance enhancements to make the operating system patches faster.
But the underlying theoretical problem will never be solved, and is essentially unsolvable.