I agree that subjective (human) testing is better than comparing using metrics, ...

I agree that subjective (human) testing is better than comparing using metrics, but the downside is that you can only look at so many images, and it depends quite a lot on the image content how the various codecs perform.

When doing a visual comparison, imo the best way is to start with the original versus a codec, to find out what bitrate you consider "good enough" for that image — this can vary wildly between images (on some images "small" will be fine while on others even "large" is not quite good enough). Then compare the various codecs at that bitrate.

There's a temptation to compare things at low qualities (e.g. "tiny") because there it's of course easier to see the artifacts. You cannot extrapolate codec performance at low quality settings to high quality though, e.g. it's not because AVIF looks better than JXL at "tiny" that it also looks better at "large". So if you want to do a meaningful comparison, it's best to compare at the quality you actually want to use.

At Cloudinary we did a large subjective study on 250 different images, at the quality range we consider relevant for web delivery (medium quality to near-visually lossless). We collected 1.4 million opinions via crowdsourcing in order to get accurate mean opinion scores. The results are available at https://cloudinary.com/labs/cid22.

One important thing to notice is that codec performance depends not only on the codec itself but also on the encoder and the encoder settings that are used. If you spend more time on encoding, you can get better results. A fair comparison is one that uses the best available encoders, at similar (and relevant) bitrates, and at similar (and relevant) encode speeds.

It's almost impossible to do subjective evaluation for all possible encoder settings though, or to redo the evaluations each time a new encoder version is released. This is why objective metrics are useful. There are many metrics, and some are better than others. You can measure how good a metric is by measuring how well it correlates with subjective results. According to our experiments, currently the best metrics are SSIMULACRA 2, Butteraugli 3-norm, and DSSIM. Older metrics like PSNR, SSIM, or even VMAF do not perform that well — probably indeed partially because some encoders are optimizing for them.

Here are some aggregated interactive plots that show both compression gains (percentage saved over unoptimized JPEG, at a given metric score) and encode speed (megapixels per second):

SSIMULACRA 2: https://sneyers.info/benchmarks/tradeoff-relative-SSIMULACRA...

Butteraugli: https://sneyers.info/benchmarks/tradeoff-relative-Butteraugl...

DSSIM: https://sneyers.info/benchmarks/tradeoff-relative-DSSIM.html

(Note that Butteraugli and DSSIM are distortion metrics, not quality metrics, so "less is better")