One major use of the public datasets in the academic community is to serve as a ...

CydeWeys · on April 16, 2020

This sounds like a revisioning system would help a lot. Have a quarterly or annual release cycle or something, so that when you want to compare performance across techniques, you just train both of them to the same target (and ideally all the papers coming out at roughly the same time would already be using the same revision anyway).

You'd always work with a versioned release when training models, and you'd only typically work with HEAD when you were specifically looking to correct flaws in the data (as the authors in the linked article are).