The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.
I mean, if the innermost loop is something like 3 assembly instructions, two extra instructions cmp and jg do not make any difference, if jg never executes?
If they are not in the critical path, it doesn't matter. There is no instruction cache issues as the loop is tiny. Also as the loop is tiny it will fit in the u-op cache (or even in the loop cache), so decoding is not an issue either. The only problem is potential lack of vectorization, but a good vector ISA can in principle handle the bound checking with masked reads and writes (but now the check is no longer a predictable branch, but it might end up in the critical path, although it is not necessarily a big cost, or even measurable, anyway).
Forget about the second order effects. The reason the extra instructions in first approximation do not matter is that loops typically are limited by carried loop dependencies.
Think about this: a machine with infinite execution units and memory bandwidth, potentially could execute all iterations of a loop at the same time, in parallel.
Unless each loop iteration depends somehow on the result of the previous iteration. Then only independent instructions of that iteration can execute in parallel and the loop is latency-chain bound (especially when it involves memory accesses). This is often the case. Because branch prediction breaks dependencies, bound checking is never part of a dependency chain, so it is often free or nearly so. For more optimized code, the assumption of infinite resources is of course not warranted and execution bandwidth and possibly even memory bandwidth need to be taken into consideration.
I am by no means an expert, but I believe what you have in mind would likely fit in i-cache without a problem, so you wouldn’t see a significant difference.
There is an interesting talk titled ‘the death of optimizing compilers’ that argues that for most code these optimizations are almost completely meaningless, and in the hot loops where it actually matters, they are not good compared to humans (and sometimes 100x or more improvements are possible and left on the table). While I don’t completely agree with its points, it is a good talk/slides to read through.
Its not that compilers are stupid, they just dont know what humans know about their data, it's ranges, invariants, symmetries etc. They work on most general case, which can be horribly inefficient.
The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.