That is an excellent prompt to tuck away in your back pocket and try again futur...

minraws · 2025-04-22T06:19:30 1745302770

If you keep the prompt the same at some point the data will appear in training set and we might have answer.

So even though today it might be a good check it might not remain as such a good benchmark.

I think we need a way to keep updating prompts without increasing complexity in someway to properly verify model improvements. ARC Deep Research anyone?

red_trumpet · 2025-04-22T09:44:54 1745315094

Well, to test research capabilities, one could just adopt the year (2024->2025) in the prompt.

minraws · 2025-04-23T05:25:38 1745385938

I am not sure what happens if some site keeps tracking these metrics and that manages to find its way into the training data.

There are some NBA fan sites that do keep track of some of these tournament level final metrics.

ljsprague · 2025-04-22T09:11:23 1745313083

Wouldn't somebody need to answer the question below? Or do you mean the discussion of its weakness might somehow make it stronger the next time it's trained?

minraws · 2025-04-23T05:27:53 1745386073

I think it can be both, what happens if discussing weakness provides more relavent links for the question and help the model that is trained scraped web data to learn somehow.

I am not sure if the model will need the exact answer or just the backlinks to site where they can find them is enough. Maybe just documenting how to do it could do the job as well...