This is a long-solved problem far predating AI.
You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.
Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.
This is a long-solved problem far predating AI.
You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.
Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.