My understanding was that GPU instruction level parallelism is quite limited compared to CPUs (since multiple "threads" are running on each hardware core) and I wasn't aware that GPUs had any meaningful OOO execution.
This is arguable depending on your definition of ILP. CPUs try to extract parallelism from a single instruction stream in order to execute many instructions from the same stream in parallel. This is very costly in silicon area per instruction executed. GPUs don't need to do this because the programs run on them are "embarrassingly parallel" - they have lots of available parallelism and explicitly tell the GPU where it is. So GPUs execute many more instructions in parallel than CPUs, but they don't usually do any work to try and find implicit parallelism
If I'm wrong, I'd be happy to learn more.