This is true for real software, not microbenchmarks. All of the shootout benchmarks will fit in their entirety in L1i cache -- which makes reducing the executable size pointless.
Incidentally, this is probably the largest reason why so many people still use -O3 -- it wins in exactly the kind of programs that are used as simple and common benchmarks. It solidly loses on almost everything else.
Do you have any data on that? Most CPU bound programs should have pretty good instruction locality, negating the effects of smaller code. But without some numbers this is pointless guesswork.
Incidentally, this is probably the largest reason why so many people still use -O3 -- it wins in exactly the kind of programs that are used as simple and common benchmarks. It solidly loses on almost everything else.