It actually almost never does. To see that you'd need to benchmark. It's pretty difficult get good utilization on GPU on either compute or memory bandwidth side. A lot of kernels irretrievably fuck up both. You need long, coalesced reads/writes, and judicious use of the memory hierarchy, or else everything gets very slow very quickly.