I've worked on commercial systems where N<=10,000 in the evaluation set and the confidence interval there is probably not so good as 0.1% for that. For instance there is a lot of work on this data set (which we used to tune up a search engine)
and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.
NIST's TREC workshop series uses Cyril Cleverdon's methodology ("Cranfield paradigm") from the 1960s, and more could surely be done at the evaluation front:
- systematically addressing sampling error;
- more than 50 queries;
- more/all QRELs;
- full evaluation instead of system pooling;
- study IR not just of the English language (this has been picked up by CLEF and NTCIR in Europe and Japan, respectively)
- to devise metrics that take energy efficiency into account.
- ...
At the same time, we have to be very grateful to NIST/TREC for executing an international (open) benchmark annually, which has moved the field forward a lot in the last 25 years.
https://ir-datasets.com/gov2.html
and sometimes it as bad as N=50 queries with judgements. I don't see papers that are part of TREC or based on TREC data dealing with sampling errors in any systematic way.