It's basically only x86 among modern ISAs that lets you do base + literal + register * literal, aarch64 only gives you base + register shifted by a literal, and I believe RISC-V is similar: https://gcc.godbolt.org/z/dz768768z
OOE doesn't necessarily save you if you end up with a hard dependency on the value of the read (and even if it did, the little cores on ARM SoCs are in-order). This is a pretty obvious candidate for macro-op fusion, but I'm not sure whether this actually happens (and if it happens on ARM little cores, etc.)
If it is not on the hot path, it is likely free, but not guaranteed. If it is on the hot path then it is wasting a whole cycle. And of course in highly ALU-dependent code it is another instruction, so a fraction of a clock.
What do you mean it wastes a whole cycle? It may indeed have worse performance due to blowing the instruction cache, but I don’t see why would out-of-order execution be slower on the hot path - I doubt there would be too many hot paths without any dependence on memory fetches outside specific benchmarks - the memory loads will take significantly more time even if they hit cache.