Disabling interrupts prevents you from using interrupt-unsafe logic inside the c...

davidtgoldblatt · on June 16, 2016

> Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped.

Presumably the semantics of the "lock these two cachelines" instruction trap when one of the cachelines you're trying to lock isn't present, rather than when you perform an operation on that cacheline subsequent to the locking.

> It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.

I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

To be clear, I don't think this instruction is a good idea; I just don't think it's obviously bad.

> As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).

HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging? Certainly that will never be cheap. I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process. There's also plenty of situations in which data structures are touched mostly by one thread, but periodically need contention management. The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

pizlonator · on June 16, 2016

> Presumably the semantics of the "lock these two cachelines" instruction trap when one of them is paged out, rather than deferring the trap until you write to one of the cachelines you've locked.

Right, I didn't think of that.

> I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

I was talking about disabling interrupts and not lock-freedom in general or CAS-2. I agree that CAS-2 would make those things simpler if it was also fast enough. Sorry about the confusing context switch.

> HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

Can you define what "some domains" is? I think it's important to be specific.

> I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging?

"Contended-and-racy" means that if you took the lock away, you'd have a data race. It means that if you used a transaction then the race detection that aborts a transaction would conclude that there had been a race and abort. Contended-and-racy is exactly the case where TM is not a speed-up, pretty much by definition.

If contended-but-not-racy critical sections were common, then it would make sense to pay some base cost for executing in a TM critical section since it would increase concurrency. But most contended critical sections are also racy. TM won't give you a speed-up in a racy section, so you end up paying all of the cost without getting any of the benefit.

> I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process.

Yeah, and also deadlock avoidance. And that lock-free algorithms are often faster in the uncontended and contended-and-racy cases. I'm a big fan of lock-free algorithms based on conventional word CAS.

But that's different than TM or 2-cache-line-CAS. Lock-free algorithms based on conventional word CAS tend to be faster than locks in the uncontended case. TM, and probably 2-CAS, is slower than locks. It's easy to justify handling uncommon scheduling pathologies, or using exotic approaches for avoiding deadlock, if it also gives you a speed-up. Not so much if it's slow like TM.

> The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

It's true that the uncontended cost of locks is dominated by the uncontended cost of CAS. But this cost is much smaller than the unconteded cost of HTM, and I suspect that the unconteded cost of CAS is also lower than the uncontended cost of 2-cache-line-CAS. If the uncontended cost of 2-cache-line-CAS is more than 2x more than the uncontended cost of CAS, then probably using a lock is better.

gpderetta · on June 15, 2016

Transactions compose and TM is potentially easier to use.

Also some algorithms necessarily require locking multiple mutexes (think for example of a wait-for-any operation). HTM in principle would allow implementing them more efficiently.

pizlonator · on June 15, 2016

Transactions don't compose with I/O or any OS side-effects that the TM system isn't in the loop on, while locks compose fine with those things. One module that uses locks is free to call another module that uses locks so long as the relationship is one-way like user->kernel, so locks compose also, just with different constraints. In short, transactions don't compose with I/O while locks don't compose with cyclic relationships. So, the statement that "transactions compose" is false in both directions: it's false because transactions don't compose in general, and it's false because it implies that locks don't, while they in fact do.

The only empirical evidence about transactions being easier to use is actually contrary: even with the availability of HTM, it's not used nearly as widely as locks, because if you enable HTM for every lock then everything runs a lot slower (or doesn't run at all).

gpderetta · on June 16, 2016

Under the proposed C++ TM, synchronised blocks can do i/o. They are not transactional, but do compose.

If you have two locked data structures and you want an operation to manipulate both atomically you need to expose the locks. And if you want the operation itself to compose with others... well get hairy.

TM is widely used I believe by least Haskel and Closure programmers, even with non hardware accelerated versions because of the ease of use.

HTM is another story, it's availability is still relatively restricted. Time will tell.

pizlonator · on June 16, 2016

Closure is indeed TM-based, but that's not an A/B test of ease-of-use. I don't think there exists evidence to suggest that transactions are easier to use.

I think you are confused about the meaning of composability. The synchronized/atomic TM proposal for C++ is roughly single-global-lock. This means that you can still deadlock if you use synchronized to protect IPC with another process, which then does an IPC call back to you, which you service on another thread that also uses synchronized. Whenever these transactional proposals fall back on locking, they immediately bring in the pitfalls of locking.

The C++ TM proposal is a joke. There's no good reason to use concurrency if you won't get a speed-up, and the current design of synchronized is sure to make it more difficult to get speed-ups.