Monday, January 29, 2018

The problematic Wannacry North Korea attribution

Last month, the US government officially "attributed" the Wannacry ransomware worm to North Korea. This attribution has three flaws, which are a good lesson for attribution in general.

Monday, January 22, 2018

"Skyfall attack" was attention seeking

After the Meltdown/Spectre attacks, somebody created a website promising related "Skyfall/Solace" attacks. They revealed today that it was a "hoax".

It was a bad hoax. It wasn't a clever troll, parody, or commentary. It was childish behavior seeking attention.

For all you hate naming of security vulnerabilities, Meltdown/Spectre was important enough to deserve a name. Sure, from an infosec perspective, it was minor, we just patch and move on. But from an operating-system and CPU design perspective, these things where huge.

Page table isolation to fix Meltdown is a fundamental redesign of the operating system. What you learned in college about how Solaris, Windows, Linux, and BSD were designed is now out-of-date. It's on the same scale of change as address space randomization.

The same is true of Spectre. It changes what capabilities are given to JavaScript (buffers and high resolution timers). It dramatically increases the paranoia we have of running untrusted code from the Internet. We've been cleansing JavaScript of things like buffer-overflows and type confusion errors, now we have to cleanse it of branch prediction issues.

Moreover, not only do we need to change software, we need to change the CPU. No, we won't get rid of branch-prediction and out-of-order execution, but there things that can easily be done to mitigate these attacks. We won't be recalling the billions of CPUs already shipped, and it will take a year before fixed CPUs appear on the market, but it's still an important change. That we fix security through such a massive hardware change is by itself worthy of "names".

Yes, the "naming" of vulnerabilities is annoying. A bunch of vulns named by their creators have disappeared, and we've stopped talking about them. On the other hand, we still talk about Heartbleed and Shellshock, because they were damn important. A decade from now, we'll still be talking about Meltdown/Spectre. Even if they hadn't been named by their creators, we still would've come up with nicknames to talk about them, because CVE numbers are so inconvenient.

Thus, the hoax's mocking of the naming is invalid. It was largely incoherent rambling from somebody who really doesn't understand the importance of these vulns, who uses the hoax to promote themselves.

Thursday, January 04, 2018

Some notes on Meltdown/Spectre

I thought I'd write up some notes.

You don't have to worry if you patch. If you download the latest update from Microsoft, Apple, or Linux, then the problem is fixed for you and you don't have to worry. If you aren't up to date, then there's a lot of other nasties out there you should probably also be worrying about. I mention this because while this bug is big in the news, it's probably not news the average consumer needs to concern themselves with.

This will force a redesign of CPUs and operating systems. While not a big news item for consumers, it's huge in the geek world. We'll need to redesign operating systems and how CPUs are made.

Don't worry about the performance hit. Some, especially avid gamers, are concerned about the claims of "30%" performance reduction when applying the patch. That's only in some rare cases, so you shouldn't worry too much about it. As far as I can tell, 3D games aren't likely to see less than 1% performance degradation. If you imagine your game is suddenly slower after the patch, then something else broke it.

This wasn't foreseeable. A common cliche is that such bugs happen because people don't take security seriously, or that they are taking "shortcuts". That's not the case here. Speculative execution and timing issues with caches are inherent issues with CPU hardware. "Fixing" this would make CPUs run ten times slower. Thus, while we can tweek hardware going forward, the larger change will be in software.

There's no good way to disclose this. The cybersecurity industry has a process for coordinating the release of such bugs, which appears to have broken down. In truth, it didn't. Once Linus announced a security patch that would degrade performance of the Linux kernel, we knew the coming bug was going to be Big. Looking at the Linux patch, tracking backwards to the bug was only a matter of time. Hence, the release of this information was a bit sooner than some wanted. This is to be expected, and is nothing to be upset about.

It helps to have a name. Many are offended by the crassness of naming vulnerabilities and giving them logos. On the other hand, we are going to be talking about these bugs for the next decade. Having a recognizable name, rather than a hard-to-remember number, is useful.

Should I stop buying Intel? Intel has the worst of the bugs here. On the other hand, ARM and AMD alternatives have their own problems. Many want to deploy ARM servers in their data centers, but these are likely to expose bugs you don't see on x86 servers. The software fix, "page table isolation", seems to work, so there might not be anything to worry about. On the other hand, holding up purchases because of "fear" of this bug is a good way to squeeze price reductions out of your vendor. Conversely, later generation CPUs, "Haswell" and even "Skylake" seem to have the least performance degradation, so it might be time to upgrade older servers to newer processors.

Intel misleads. Intel has a press release that implies they are not impacted any worse than others. This is wrong: the "Meltdown" issue appears to apply only to Intel CPUs. I don't like such marketing crap, so I mention it.

Statements from companies:

Wednesday, January 03, 2018

Why Meltdown exists

So I thought I'd answer this question. I'm not a "chipmaker", but I've been optimizing low-level assembly x86 assembly language for a couple of decades.

The tl;dr version is this: the CPUs have no bug. The results are correct, it's just that the timing is different. CPU designers will never fix the general problem of undetermined timing.

Let's see if I've got Metldown right

I thought I'd write down the proof-of-concept to see if I got it right.

So the Meltdown paper lists the following steps:

 ; flush cache
 ; rcx = kernel address
 ; rbx = probe array
 mov al, byte [rcx]
 shl rax, 0xc
 jz retry
 mov rbx, qword [rbx + rax]
 ; measure which of 256 cachelines were accessed

So the first step is to flush the cache, so that none of the 256 possible cache lines in our "probe array" are in the cache. There are many ways this can be done.

Now pick a byte of secret kernel memory to read. Presumably, we'll just read all of memory, one byte at a time. The address of this byte is in rcx.

Now execute the instruction:
    mov al, byte [rcx]
This line of code will crash (raise an exception). That's because [rcx] points to secret kernel memory which we don't have permission to read. The value of the real al (the low-order byte of rax) will never actually change.

But fear not! Intel is massively out-of-order. That means before the exception happens, it will provisionally and partially execute the following instructions. While Intel has only 16 visible registers, it actually has 100 real registers. It'll stick the result in a pseudo-rax register. Only at the end of the long execution change, if nothing bad happen, will pseudo-rax register become the visible rax register.

But in the meantime, we can continue (with speculative execution) operate on pseudo-rax. Right now it contains a byte, so we need to make it bigger so that instead of referencing which byte it can now reference which cache-line. (This instruction multiplies by 4096 instead of just 64, to prevent the prefetcher from loading multiple adjacent cache-lines).
 shl rax, 0xc

Now we use pseudo-rax to provisionally load the indicated bytes.
 mov rbx, qword [rbx + rax]

Since we already crashed up top on the first instruction, these results will never be committed to rax and rbx. However, the cache will change. Intel will have provisionally loaded that cache-line into memory.

At this point, it's simply a matter of stepping through all 256 cache-lines in order to find the one that's fast (already in the cache) where all the others are slow.