Their atomic operations used to be extremely costly from a utilization perspective. They would shut down all other threads in a warp while the thread performing the atomic ran alone. Is that still the case?
Fixed as of Maxwell to the best of my knowledge. But even then, I found them more efficient for reduction operations in global memory than any other method (using fixed-point math for places where a deterministic sum was required).