Sometimes there are "canonical forms" for these operations depending on what chip is being targeted, where the hardware can automatically break data dependencies and improve hardware-level parallelism, as long as the instruction is encoded in the right form.
I don't know that this is necessarily the reason here, but it's one possible explanation.
I don't know that this is necessarily the reason here, but it's one possible explanation.