I think I get what you're saying, I think people would read too much into them though. The results are very dependent on the type of program you have. I wouldn't be surprised if there were 10s of percentage point differences in benchmarks between various examples.
That's the point. Benchmarks would show what code is already squeezed dry by basic LLVM tools and what code could benefit from superoptimization, which (as other comments point out) is the only reasonable basis for deciding to give Souper a try.