Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Errata prompt Intel to disable TSX in Haswell, early Broadwell CPUs (techreport.com)
206 points by geoffgasior on Aug 12, 2014 | hide | past | favorite | 83 comments


For those wondering, the TSX instruction set is http://en.wikipedia.org/wiki/Transactional_Synchronization_E...

I first heard about transactional memory when Sun had plans to implement it for its UltraSPARC Rock processor. There is a decent overview of the concept at http://en.wikipedia.org/wiki/Transactional_memory


There has been a CPU with transactional memory running big business servers. The Vega boxes from Azuul (~800 cores) running their JVMs. I know of two good internet discussions available talking about their real world experience using them

Gil Tene https://groups.google.com/forum/#!searchin/mechanical-sympat...

Cliff Click http://www.azulsystems.com/blog/cliff/2009-02-25-and-now-som...

It can be enlightening to hear from a real application of these technologies. I am sad to see it disabled. Had it in the back of my head to play around with it some time.


PowerPC A2 (Blue Gene/Q) and POWER 8 support hardware transactional memory as well


It's such a shame. The Intel one is our best chance to get HTM under my desk. Alas :)


> Some of the savviest software developers are likely building TSX-enabled software right about now.

Really a non-statement statement. The only people who are using CPU instructions in their code will be low level maintainers, ie. kernel devs and other OS devs. So this doesn't effect really anyone except those folks (and of course the benefits it may have provided up the stack via layers of abstractions).


Yours also sounds like a "non-statement statement" e.g.:

https://www.mikeperham.com/2013/12/31/rubys-gil-and-transact...


This is something implemented in the ruby language/platform -- not something a programmer who is using ruby for a project would worry about/implement.


The finance sector produces a lot of low level code driving business applications directly. But, broadly you are right I think.


As a systems guy who was intially very skeptical of Transactional Memory, I've been very excited about TSX for a while. Unfortunately, it seems that the initial implementation in the haswell processors wasnt very efficient[1]. Im looking forward to future hardware implementations by intel.

1. http://natsys-lab.blogspot.ru/2013/11/studying-intel-tsx-per...


As someone who has used TSX to optimize synchronization primitives, you can expect to see a ~15-20% performance increase, if (big if) your program is heavy on disjoint data access, i.e. a lock is needed for correctness, but conflicts are rare in practice. If you have a lot of threads frequently writing the same cache lines, you are probably going to see worse performance with TSX as opposed to traditional locking. It helps to think about TSX as transparently performing optimistic concurrency control, which is actually pretty much how it is implemented under the hood.

While the API of TSX (_xbegin(), _xabort(), _xend()) gives the appearance of implementing transactional memory, it is really an optimized fast path - the abort rate will determine performance. The technical term for what TSX actually implements is 'Hardware Lock Elision'.

If you are going to use TSX, don't use it directly unless you are writing your own primitives (i.e. transactional condition variables [0]), prefer using Andi Kleene's TSX enabled glibc [1] or the speculative_spin_mutex [2] in TBB.

[0] - http://transact09.cs.washington.edu/9_paper.pdf

[1] - https://github.com/andikleen/glibc

[2] - http://www.threadingbuildingblocks.org/docs/doxygen/a00248.h...


Thanks for the links. Im aware of Andi's work and enjoy reading his posts on TSX. You're absolutely right about HLE vs HTM.

Mostly I wish there was a better fallback than locking when transactions fail, but I suppose this is more in line with thinking of them as optimistic locks (such as what I assume the speculative spin lock does).


Do you have any insight into how it compares to lock-free techniques?

A full-blown (say) lock-free hash table is almost insurmountable - I recall seeing an efficient reusable implementation a couple of years ago, but most before that had serious performance or correctness issues. However, if you take this into account early enough in the design of the system, you can usually do with much simpler lock free data structures (seqlock and friends).

transactional memory is, of course, a useful abstraction when you can't (or didn't) take this into account upfront.


Lock-freedom is all about progress, not performance. i.e. generally the only reason to use a lock-free algorithm is if you have issues with ill-behaved code stalling other processes. (I can think of several simple ways to make a fast concurrent hash table, none of which provide lock-freedom, yet which avoid taking locks when necessary.)

Transactional memory is an access abstraction that presents memory as something over which transactions can be made. It says nothing about whether the implementation is lock-free (though the better implementations approach, or are lock-free). There exist software TM algorithms which lock during the transaction, which lock only briefly at the end of the transaction, and which are lock-free, and some which are a combination of these.

TSX (specifically HLE) only takes a lock if the transaction fails without taking a lock, in order to guarantee forward progress in the absence of an ill-behaved transaction (assuming fair locks). Were the software fallback implemented using lock-free techniques (possible with the more general RTM portion of TSX), this guarantee would extend to ill-behaved transactions as well.


> Lock-freedom is all about progress, not performance. i.e. generally the only reason to use a lock-free algorithm is if you have issues with ill-behaved code stalling other processes.

Lock freedom is about progress, but in many practical cases (short time between load-linked/store-conditional pair, for example) it beats locks and everything else in performance -- basically, in the best case you do the same work as locks (i.e. synchronize caches among CPUs), and in the worst case, you spin instead of suspending the process.

Of course, if you do a lot of work between the load and store (or whatever pair of operations need to surround your concurrent operation), and there's a good chance of contention, then ... yes, it will not perform well. But that's not how you should use it (or how it is used most of the time).

And .. transactional memory is an abstraction the leaks differently than locks, but you must still be aware of the failure modes. It's higher level, but does not magically resolve contention.


And it was the only widespread TM implementation in metal.

So N years more till we see software transactional memory with dedicated hardware support.


Strong LL/SC provides the ability to do real software transactional memory. I believe ARM has the strongest LL/SC implementation, with the architecture supporting transaction blocks up to 2KB in size! Although ARM has specified LL/SC such that vendors can choose to only support blocks much smaller than 2KB.

Note that people have totally confused the terms. Originally "software transactional memory" meant using primitives like [strong] LL/SC to get lock-free, wait-free algorithms. Strong LL/SC has been proven to be a universal primitive which can be used to implement most known lock-free, wait-free algorithms.

"Hardware transactional memory" gave you access to lock-free, wait-free algorithms in a much more convenient manner. But really it's a difference of degree because strong LL/SC requires _significant_ hardware support.

These days people (like the PyPy project and most STM libraries) use "software transactional memory" for implementations which just use a mutex under the hood. It's emulated STM; they're just being fanciful.

I believe TSX is more like strong LL/SC. Maybe stronger. OTOH I'm not sure if it's universal enough to implement wait-free algorithms.


More like N more months. They say they've already fixed it.


The techreport.com article says:

  The obvious initial targets for TSX optimization are server-class applications like transactional database servers.
Looking at your wiki links though, I would think this would be awesome for almost any high-performance multi-threaded program. Am I wrong about that?


In theory yes, though thinking in terms of transactions is subtly different from thinking in terms of threading critical sections. Particularly, they differ in how they handle contention. Locks block while transactions abort/fail. Simply retry-ing a transaction is also not a good idea which is why it is currently suggested to fallback on locks [1]. So I assume the suggestion of using it for dbs is because it would be more straight-forward as dbs are already used to transactions.

1. https://software.intel.com/en-us/blogs/2013/06/23/tsx-fallba...


It won't make any difference if the load is embarrassingly parallel, like rendering one frame of video per core. In those cases the amount of independent work is huge compared to time spent dealing with locks.


The idea with HLE at least, was that anywhere you had a highly contended lock that protected not-highly-write-contended data, it was a drop-in performance boost. Really sucks that they had to pull the plug on it.


>Software developers who wish to continue working with TSX will have to avoid updating their systems to newer firmware revisions—and in doing so, they'll retain the risk of TSX-related memory corruption or crashes.

Say I bought a TSX enabled CPU specifically for that feature, I wonder if Intel will give me my money back... (they can have their broken CPU of course too)


Theoretically, yes - I almost returned my laptop with an i7 2630QM CPU, which should, as indicated on Intel's website, support AES instructions. But the specific BIOS of that laptop has them disabled and there was no way to re-enable them. So I contacted the manufacturer, and after a few emails they agreed that yes, I can return the product as "not as described" but they said that a new BIOS is also in the works for that laptop and if I am patient it will be made available soon - and it was, AES instructions were enabled after a patch issued 2 months later.


I fell victim to the 2630QM AES bit disabled back in 2011 too. Though, ASUS emailed me a new bios image within 12 hours of reporting the issue.


Why would anyone even disable that ever o.O


Export compliance.


Yes, because the government treats encryption as a "munition".


They haven't done that in over a decade...


Long time ago I bought a ThinkPad laptop with a preinstalled Linux distribution that didn't fully support the hardware installed. I should have contacted the seller about it too.


The real question is why does this depend on the BIOS


What do you mean? The BIOS is a configuration tool - it's entire purpose is to allow you to configure the CPU. It's not that the instructions depended on the BIOS, but that they couldn't be enabled with it. If you have a switch set to off glued in the off position, you can't turn it on until it's unglued


In case of AES I guess US crypto export regulations running amok...


But all BIOSes and motherboards are made in China.


If the guys on the label are American, then US regulations apply, too.

Even if not, the US stick their nose anywhere, even where it does not belong...


Presumably so OEMs can sell higher clocked systems without other new features.


I wondered similarly, but I wondered on the scale of companies that buy hundreds of thousands of these chips. Does Intel give a discount?


[deleted]


That was unwarranted.


Slightly warranted. Intel likes to have a good reason when you return their CPU's.

The old pentium line you had to submit some form of proof that errors in floating point calculations would actually cause issues in your use of the computer (since it wouldn't cause issues for most business).


Sorry, he sounded like a concern troll. Was spreading doubt without any proof.


Unsurprising. I don't think the details of TSX have been revealed, but the implementation potentially has complex interactions with the cache subsystem: http://www.realworldtech.com/haswell-tm.

Given that TSX is one of the features that distinguishes some of the more expensive Haswell SKU's, is Intel going to issue a refund for affected customers?


I've been tempted recently by the Devil's Canyon repackaging of Haswell that for the first time has some of the workstation/server features (VT-d, TSX, but no ECC) enabled on an overclockable model. Losing TSX definitely cuts down on that temptation a bit, but they've still got the combination of full virtualization capabilities and much higher single-threaded performance than anything else out there.


Four cores is really not that attractive... also a lack of ECC with the amount of ram systems have today is starting to seem irresponsible. The probability of getting bitflips is macroscopic.


I understand that the 4790K isn't really a tradional workstation chip, but it's definitely the right thing for my workload which involves a simulation that's only partially multithreaded and would benefit a lot more from higher single-core speed than it would benefit from more than 4 cores. And ECC would be nice, but in my opinion isn't a hard requirement yet for all workstation usage the way it is for servers; I can certainly get by without it, since the base failure rate of the software I'm using is much larger than that caused by hardware memory corruption.


Yeah, this is exactly why high performance chips have a microarchitecture that lags client chips. Such bugs would devastate the value of those chips. This will affect haswell Xeon. I bet some serious discussions for delaying that chip are ongoing right now.


Here's the current list of Errata's from June: http://www.intel.com/content/dam/www/public/us/en/documents/...

I see two about TSX:

  HSD87 X No Fix          Intel® TSX Instructions May Cause Unpredictable System behavior
  Problem: Under certain system conditions, Intel TSX (Transactional Synchronization Extensions) instructions may result in unpredictable system behavior.
  Implication: Due to this erratum, use of Intel TSX may result in unpredictable behavior.
  Workaround: It is possible for the BIOS to contain a workaround for this erratum.
  Status: For the steppings affected, see the Summary Table of Changes

  HSD114 X No Fix         Intel® TSX Instructions May Cause Unpredictable System behavior
  Problem: Under a complex set of internal timing conditions and system events, software using the Intel TSX (Transactional Synchronization Extensions) instructions may observe unpredictable system behavior.
  Implication: This erratum may result in unpredictable system behavior. Intel has not observed this erratum with any commercially available system.
  Workaround: It is possible for the BIOS to contain a workaround for this erratum.
  Status: For the steppings affected, see the Summary Table of Changes
HSD114 above seems to be the bug from the techreport article.


So.. will everyone simply accept the microcode update that disables those instructions? Or will there be some kind of compensation for newer laptops etc? Wonder how Apple will deal with this, for example.

Or is this such a non-issue that nobody cares?


Yes, everyone will accept the microcode update. TSX is still pretty new and I doubt a lot of code uses it. Intel tends to phase features like this in over a few generations of CPUs for this reason.

The only real impact is that code that relied on TSX would need to rely on fallback methods of accomplishing the same tasks. Since there's probably not much of that floating around, there's very little impact at this time.


How do CPU vendors issue a microcode update such that no hacker could make their own malicious update? It's obviously a cryptographically signed update, and the CPU checks that the update is signed by the expected key, but does that mean every CPU is hardcoded to check for a specific key? What happens if that key is compromised? Is every CPU at risk forever at that point, or can the CPU be updated via software to check for a newly issued key? Also, were is the private key stored? Some vault somewhere? How would a CPU vendor know if the key had been surreptitiously copied, and some rogue group was issuing malicious updates to a specific target's CPU?

Is it even advantageous for an attacker to have the capability of issuing microcode updates to a target computer? What sort of attacks could you mount via microcode updates?


I think they are mostly undocumented, but here is some research I found online: http://inertiawar.com/microcode/

I'm guessing the private RSA keys are extremely well guarded, probably stored in a HSM that only allows signing, not key extraction, so that the keys cannot even be revealed to Intel.

On the other hand, who knows. There's certainly been cases of code signing keys on the loose (Adobe, etc) and even a compromised HSM host (Fedora, someone managed to sign compromised openssh .rpms)


Given that Intel processors largely run the servers that the modern world runs on, I would expect that the NSA/CIA has done a full security audit of their microcode processes. The US government is nothing if not thorough about these types of potential security issues.


And in exchange for that audit they gained permission to get their own microcode updates signed, too?


I'm not sure why they'd want it. Unless there's secret non-volatile storage on Intel chips, there's relatively little of value that you can do using a rogue microcode update. This is because Intel chips reset to factory microcode when rebooted.

It's plausible that a rogue microcode update could be used to bypass TPM static-root-of-trust protections, though, and a microcode update could certainly bypass TXT's dynamic root of trust. This might enable a bootable USB stick that would load malicious microcode and then reboot warmly enough to preserve the microcode and then launch the OS with a TXT bypass.

Even so, I don't really see the point. So far, essentially every BIOS can be freely (or freely using an exploit) reflashed from kernel mode, and a new image could contain malicious SMM code, and SMM code can also bypass both static and dynamic roots of trust.


There are plenty interesting things you could do in processor firmware. Here are three examples:

Tamper with AES and randomness instructions.

Plant a very obscure privilege escalation exploit. Given the prevalence of java, activex and google nacl, escaping sandboxes is a big thing.

Tamper with the MMU to make a certain software invisible.

If they had the opportunity, it's not unlikely they did something like this. Perhaps just in a directed attack. They've done extensive firmware patching in the past in you believe last year's leaks.

A more interesting question is whether they didn't have to, because they had a say in making the silicon in the first place. The risk is there, it's not crazy to see it. Even FreeBSD who was the last mainstream OS to use the randomness instructions unadultered doesn't do that anymore.


Make RDRAND less random, store AESNI key data in a place you can later exfiltrate it from, provide SMM capabilities to the current execution context with some magic opcode (I grant that the latter is easier done through a malicious firmware update - but that's also easier to detect, since x86 code is a known format)...

Modulo the microcode signature, flash is better locked down these days: Flash updates are typically arbitrated by the firmware (though that's also just one signature away), requiring a reboot, while microcode updates are still free for all (in ring0 - realtek already lost a driver signing key once, why not again?).


Flash updates arbitrated by firmware? In theory, yes, but in most systems they're not. [1][2]

[1] http://www.syscan.org/index.php/download/get/6e597f6067493dd... [2] http://mjg59.dreamwidth.org/30773.html


I have a board normally running UEFI Secure Boot with no flash lock enabled at all right here on my desk - with UEFI replaced by a sane coreboot implementation (which locks down flash and SMM memory and signals unconditionally on boot).

So yes, I'm quite aware of the immense set of faults in UEFI implementations (some of which are encouraged by UEFI's design, where more layers of UEFI are added to mitigate them).

But as an attacker I wouldn't want to assume that I run into any single of the many UEFI implementation quirks and adapt my attack to everyone of them.

And I really hope for Intel that Tianocore won't become an endless stream of portable UEFI security issues - otherwise the IBVs might get second thoughts about standardizing on a single codebase.


Putting my tinfoil hat on for a moment, these are the exact types of attacks that China and the US are alleged to have executed. I would trust Intel to lock down security on microcode updates (which is exactly why you can't find the answers to any of the questions you asked), but I would also trust that state actors (i.e. spies backed by a trillion dollar economy) have the resources to break any security that any one company could dream up.

I would guess that the first layer of protection comes from limiting the scope of microcode updates. Perhaps there's a semiconductor engineer out there who knows more?


Well. Microcode updates are lost on reboot, and they can only be applied from privileged code. Which means you'll have to be able to slip rogue code into the BIOS or the kernel, at which point you already have full control. The "negative ring" levels of code (SMI, etc) are quite powerful already.

But a trojanized microcode update file inside an otherwise regular BIOS would be a nice hiding spot, hard to detect and analyze, at least for anyone outside Intel.


Or you could just use SMM or IPMI.


Use those how? Would you go into detail?


See, for example: http://www.eecs.ucf.edu/%7Eczou/research/SMM-Rootkits-Secure...

If you're an OS, then you need some kind of exploit to update SMM code. But if you're the BIOS, then you have complete control over what happens in SMM mode.


No one outside of a few bleeding edge performance computing programmers will notice. Software transational memory is a cool feature, don't get me wrong, but it requires specialized infrastructure for even programmers to get use out of it, and most end users won't be harmed at all.


I think this is the latest specification update for at least some of the relevant processors:

http://www.intel.com/content/dam/www/public/us/en/documents/...

It was last updated in June, so I guess it doesn't contain this latest erratum. Can't wait until it's updated... though I don't know if Intel is likely to disclose details.



I read through this and found the TSX errata as #136 on a list of 140 issues. I was pretty surprised; some of these issues are major, showstopper crash bugs (admittedly mostly only affecting operating systems developers, and many with workarounds.)

Just reading through these issues, I see a handful that deal with specifically with external (proprietary?) hardware integrations, like with integrated Intel graphics boards. Why is the CPU loaded up extra jazz for with specific integrations like this? Is this to support a system-on-a-chip optional kind of configuration, or is this Intel trying to give themselves some sort of advantage in the graphics adapter market?

If it is the latter, then it serves them right to have these errata and instability due to adding all that extra competitive advantage nonsense on the chip.


> Under a complex set of internal timing conditions and system events, software using the Intel TSX (Transactional Synchronization Extensions) instructions may result in unpredictable system behavior.

Yeah, that's nice and descriptive. Damn. :/


So, what's the bug exactly?


I get the feeling that few people know, and they're not telling. Maybe to prevent exploits from being written against the existing CPUs.


Dang. I bought an i4770 rather than i4770K specifically because of TSX.


I also did this, missing out on 100mhz of base clock (3.4ghz vs 3.5ghz for the k).

TSX seemed like a once in a decade step forward, though as I understand it the restrictions with the cache size (and thus the amount of memory you can write to before the transaction gets too big) meant it wasn't very practical for much beyond optimistic lock acquisition.

For example, PyPy isn't planning on doing a TSX port even with their enthusiasm for transactional memory.

Also influencing my decision, I bought a 2600k a few years back, but never bothered overclocking it, which I guess was an admission that the excitement I found for hardware when I was a child was dead. I guess you either have the money, or the time, but rarely both.

It's disappointing that this microcode update isn't being done in such a way that you can re-enable it after agreeing to a disclaimer that it's not for production use. I'm not sure what the mechanism for this would look like, but given that Intel sold cards that unlocked Hyperthreading, I'm sure it's possible.

http://www.engadget.com/2010/09/18/intel-wants-to-charge-50-...

Edit: The article has been updated saying that it will be possible to enable TSX for development purposes on Haswell-EP at least.


> For example, PyPy isn't planning on doing a TSX port even with their enthusiasm for transactional memory.

Do you have more information about their reasoning behind this? From my point of view this is the highest profile software project to potentially make use of HTM, and I recall reading that the plan was to eventually introduce hardware acceleration.


Here are a few links:

http://pypy.org/tmdonate.html (Search for "haswell")

http://grokbase.com/t/python/pypy-dev/13bvt3kg70/pluggable-h...

It seems to boil down to:

* The cache size (which determines the amount of memory you can write to in a transaction before having to commit back) is insufficient, causing excessive transaction aborts.

* There is no mechanism to bypass the HTM, writing to memory within a transaction that is not rolled back. This exacerbates the small cache size, since all memory writes have a cost, not just the ones you want rolled back in the case of a transaction abort.

Interestingly, this does not bode well for HTM on a platform with many smaller cores, say a hypothetical 64 core ARM. Each core will have a tiny amount of L1 cache, severely limiting transaction size.

And many smaller cores is exactly where you'd want the benefits of HTM, since the overhead of synchronization is higher in proportion to the work each core can do.


In reference to the sibling post that gives more details about pypy, I'd like to call to your mind the history of the vector extensions for x86.

First revision: MMX. It reused the same registers as the older x87 floating point coprocessor (even though the x87 transistors lived on the same die). As a result, legacy x87 code and MMX code had to transition using an expensive EMMS instruction.

Second revision: (well, ignoring some small changes to MMX) ... SSE. Finally got its own registers, but lacked a lot of real-world capability.

Third revision: SSE2, finally got to a level of parity with competing vector extensions (see, for example, PowerPC's Altivec).

And so forth.

I guess the take-home lesson for me is that these new TSX instructions are indeed fascinating to play around with, but I wouldn't expect it to blow the doors off. Intel will incrementally refine it.

(The incremental approach also gives Intel a chance to study how it's being used and keeps AMD playing catch-up.)


The other big problem with MMX was that it was integer only. While that might have been ok for some application 3D games and other software that could really use the boost needed floating point and not only couldn't benefit (since it was integer only) it actually interfered (since, as you said, it reused the registers).

AMD's 3DNow had single precision floating point support, so it was actually somewhat useful. SSE followed 3DNow and added single precision support (as well as fixing the register stuff). SSE2 added double precision support.


Right, thanks for those additional details.

Today, no one would use MMX instructions (since SSE is vastly superior). I expect Intel will continue to add TSX capabilities which will eventually produce some nice results for parallel code.


The 4790K (aka "Devil's Canyon") supports TSX. It's one of the only "unlocked" CPUs that Intel produces that does.

Yes, really:

http://ark.intel.com/products/80807/Intel-Core-i7-4790K-Proc...

...which is why I specifically bought it. So yes, I too am a little annoyed by this since I bought it specifically to develop TSX applications.


Just curious if you'll forego updating your BIOS so you don't lose the TSX instructions, despite the errata?

Hopefully you can upgrade to a Broadwell or later once Intel starts shipping fixed silicon. Haswells will be updated to the new microcode once a replacement is available. (At least, that's our plan.)


Since I'm having no problems, I'll likely forego updating the BIOS. However, that might not mean much since I think Microsoft distributes Intel microcode updates as part of OS updates. But usually you can remove the specific kb patch if needed.

But yes, I'll be looking forward to the replacement Broadwell. My previous workstation was a Core 2 DUO E8400 which I just replaced with the 4790K.


Running a recent Ubuntu or Debian?

Install the microcode by running: sudo apt-get install intel-microcode

Alternatively install a BIOS update once available.


sudo pacman -S intel-ucode


That's too bad, I was really looking forward to playing around with TSX on some side projects.


Well they should consider themselves fortunate since Haswell Xeon isn't yet on the market.

Are there any scenario where Transactional Memory are useful in consumer environment?


Mostly for developers to start testing their code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: