Thursday, February 09, 2012

My notes on Intel's Transactional Memory (it's like a cmpxchg64b)

People are excited that Intel is adding "transactional memory" aka "TSX" features in 2013. Ars Technica has an article here, but it feels like gobbledeegook to me. I'm not sure I can provide a less confusing description, but as a programmer who writes multicore code core, I think I can provide a different description.


It's not about making synchronization "easier" (as Ars Technica claims), but about making synchronization more "scalable". The problem with multicore code is that, after a point, adding more cores doesn't make the software any faster. That's because, after a point, the synchronization overhead needed by adding additional cores exceeds the additional processing power. This is known as "Amdahl's Law". A lot of multithreaded software can use 2, 3, or maybe 4 cores, but must software fails to scale to the 12 cores you get on a high-end Intel desktop.

Transactional memory reduces the cost of synchronization, and thus, makes software more scalable.

Average programmers won't use transactional memory themselves, but instead, it's the operating system, Java VMs, and assorted libraries that will use it. The programmers will just notice that their code now scales better.

An example of this is the code "atomic {x++; y++;}" on the Ars Technica article. That's precisely the sort of thing you can't do with TSX instructions. TSX transactions work on a 64 byte cache line. Thus, the above code snippet would only work if 'x' and 'y' were on the same cache line. These complicated details means it's unlikely to that high-level language constructs will matter. Instead, programmers will see low-level constructs, such as the current "intrinsics" used for SSE/AVX code that map from C/C++ to x86 assembly instructions. Programmers who write operating systems or libraries will use these instrinsics, but most programmers won't.


At a lower level, what these instructions really do is change how programmers use "spinlocks" and "wait-free algorithms".

A "spinlock" is when an additional number is used to synchronize access to memory. Before messing around with a chunk of data, code first changes the spinlock from '0' to '1', then messes around, then changes it back from '1' to '0'. If another thread comes along during this time, it notices that the spinlock is '1'. The other thread then "spins", repeatedly checking the spinlock value over and over again until it goes back to zero.

Notice that in some cases, the above logic wouldn't work. If two threads hit the start of this code at precisely the same time, both will see that the spinlock variable is '0', both will change it to '1' at the same time, and both will proceed, causing bad things to happen. Thus, this change is done in what's called an "atomic" operation. On the x86, this is done with a "lock cmpxchg" instruction. This locks out other processor cores from changing that value at the same time. In this scenario, one core will succeed in changing the value from '0' to '1', but the other core will fail, and know to start spinning.

The "wait-free" techniques take a different approach. Instead of using a single atomic instruction to synchronize access, they use a series of atomic instructions. This allows both threads to make forward progress, without forcing one thread to stop and wait for another.

Whether "spinlocks" or "wait-free" is better depends on a lot of factors. Wait-free is better from a theoretical point of view, but since atomic instructions are slow (30 clock cycles, or L3 cache speed), wait-free can be slower in practice.


So here's what TSX does. If you are thinking in terms of "spinlocks", it makes the code more efficient. Change your data so that it fits within a single cache-line, then write your "spinlock" as normal, but with a TSX extension ("xacquire lock cmpxchg", instead of "lock cmpxchg"). What happens is that, on older processors, the "xacquire" is ignored as a NOP (its value is 0xF2, which is meaningless to older processors in this context). On newer processors (2013 or later), the processor keeps track of what's really happening, and makes it more efficient. If two threads protected by that spinlock don't actually write into each other's data, then it's as if the spinlock didn't exist.

Better yet, if you are thinking in terms of "wait-free" code, you can do even more useful things. Wait-free code is written using something called "CAS2", or "double compare and swap". This compares two values simultaneously, and if both values match, swaps both to new values. This is often done swapping two pointers (to two new pointers), or a pointer plus integer. On 32-bit x86 this is the 'lock cmpxchg8b' instruction for swapping 8 bytes; on 64-bit x86 this is the 'lock cmpxchg16b' instruction, for swapping 16 bytes.

Because TSX operates on a cache line, which is 64 bytes wide, this effectively becomes a "lock cmpxchg64b" instruction, or CAS8. This makes wait-free code dramatically more efficient.

Conclusion


So the upshot is this. The problem is that software can only take advantage of a small number of processor cores before synchronization becomes too expensive. The new "transaction" features reduce the cost of synchronization, and hence, allows software to take advantage of more processing cores. There are two high-speed synchronization: spinlocks using a single atomic and wait, and wait-free that uses multiple atomics. In particular, it extends the old instruction that allowed 16 byte transactions (cmpxchg16b) to something that allows 64 byte transactions (think of it as "cmpxchg64b", but more efficient).


From the author of the Ars Technica piece:


 Peter Bright 
@ 
  Honestly, I read your blog post, and I don't think you've read Intel's docs.


No, I have read the Intel docs (the spec on two Intel blogposts). The difference is that that the Ars Technic posts just repeats them, going bottom-up, starting with the details. I'm trying to go top-down: starting with how people code today, and how the transactional features change that. For example, if you've every used the existing transactional feature "cmpxchg16b", you realize how much nicer life will be with these changes.

3 comments:

Unknown said...

I've written plenty of code using cmpxchg. TSX interests me precisely because it isn't cmpxchg. It's closer to LL/SC, but even then, it's different (and better).

First, why isn't it cmpxchg? cmpxchg only detects _changes_. If the value written is the same as the previous value, cmpxchg doesn't notice (hence the ABA problem). TSX is more like LL/SC, insofar as it detects _writes_, even if they don't change the _value_.

cmpxchg16 is not DCAS, since the values must be contiguous in memory. cmpxchg16 compares a single 16-byte value and optionally replaces it with another 16-byte value. DCAS would compare two independent 8-byte values and replace them with two separate 8-byte values.

DCAS can implement cmpxchg16 trivially. CAS can implement DCAS non-trivially (algorithms exist to turn single CAS into arbitrary N-comparison, N-update CAS).

More interestingly, the documentation implies it detects _reads_ too. It isn't very clear about what it does with the read- and write- sets, but the implication is that it will properly handle transactions that contain write-after-read dependencies. If so, this makes the transactional protection very strong indeed.

I don't see where you're getting the 64 byte cache-line restriction from. The strong implication is that it can build a read- and write-set from operations scattered throughout memory.

Anyway. HLE is indeed for eliding locks; to make taking the spinlock a no-op when there are no write conflicts.

But RTM is not. RTM is proper transactional memory, and can be used for a whole lot more: specifically, it can be used for things that aren't structured as locked programs.

Robert Graham said...

It isn't very clear about what it does with the read- and write- sets

I thought it was pretty clear. After you do an XACQUIRE, every time you read from memory, it adds it to the readset. Every write adds it to your writeset. If anything else in the system writes to your read/write sets, the transaction is aborted.

In other words, in the old days, you checked at the START of your transaction whether a conflict was possible, but with TSX, it checks at the END of a transaction to see if a conflict happened.

I don't see where you're getting the 64 byte cache-line restriction from

Because things are tracked per-cacheline. Anything that hits the entire cacheline aborts the transaction.

That implies a lot low-level optimizations. It's like in the old-days when you had to align 128-bit fields for SSE. You can't think solely in high-level constructs, you have to worry about low-level details.

You are going to have to stick your data in one or two cachelines, because if you have to something with 10 cachelines lasting hundreds of instructions, almost certainly it's going to end up aborting your transaction, and you've gained nothing over doing a standard spinlock.


HLE is indeed for eliding locks; to make taking the spinlock a no-op when there are no write conflicts

That's the biggest question I have. I measured the "lock cmpxchg" has having a 30 clock penalty, roughly equal to the L3 cache latency. This gets removed by HLE.

But, at the end of the transaction, when it commits the changes (by flushing the L1 cache to L3 cache), it'll still take a 30 clock latency.

Since that is (or should) be the dominating performance factor, I don't think it buys you much when there is no contention.

What it does buy you is when you are doing multiple atomic operations with wait-free code, replacing multiple operations, each with it's own 30 clock latency, with a single atomic operation (the commit) at the end.

Also, what it buys you is when there is heavy contention, such as the hash table example Intel shows, much of the time, neither thread has to wait. Neither thread spins, and both threads just incur the 30 clock hit with the commit.

At least, that's my guess.

Unknown said...

I thought it was pretty clear. After you do an XACQUIRE, every time you read from memory, it adds it to the readset. Every write adds it to your writeset. If anything else in the system writes to your read/write sets, the transaction is aborted.

I was wondering how if it handled any of the more subtle non-conflicting cases. I'm presuming not. But some conflicts aren't really conflicts; inc/inc doesn't conflict (as long as there's no readback) for example. Nor does inc/dec (again, as long as there's no corresponding read).

Because things are tracked per-cacheline. Anything that hits the entire cacheline aborts the transaction.

It says that, but then section 8.3.8.2 (Runtime Considerations) muddies the water, at least in my view. It says that conflicting accesses to a cacheline "may" prevent the transaction from committing. It also rather implies that not all writes to a cache line necessarily cause conflicts. This is weaker than the statements about the read and write sets, which is why I'm a bit unclear about it.

Anyway. We already get that with pretty much any LL/SC implementation (they'll all spuriously fail), so we already make sure that unrelated values are padded to full cache lines to avoid sharing and spurious conflicts, don't we?

You are going to have to stick your data in one or two cachelines, because if you have to something with 10 cachelines lasting hundreds of instructions, almost certainly it's going to end up aborting your transaction, and you've gained nothing over doing a standard spinlock.

While Intel is not disclosing the limits, it says that "an implementation of RTM and HLE will typically provide sufficient resources for executing common transactional regions". To me, that implies more than a couple of cache lines--especially as we want to make sure that unrelated variables are on separate cache lines, to avoid spurious conflicts.