Saying "When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done" seems deceptive, though. The OOO processor is (often) doing speculative work, which may be unneeded, and a stalled in-order CPU won't be using quite the same amount of power as if the execution units were actually switching.
Though I'm pretty sure the NVIDIA design must speculative prefetch and hoist memory reads aggressively to be performance competitive (see the huge cache sizes!), which also burns power.