It seems critical to have diverse, inclusive, and equitable data for model train...

appleorchard46 · 2025-03-27T19:30:46 1743103846

I'm calling it now. My prediction is that, 5-10 years from now(ish), once training efficiency has plateaued, and we have a better idea of how to do more with less, curated datasets will be the next big thing.

Investors will throw money at startups claiming to make their own training data by consulting experts, finetuning as it is now will be obsolete, pre-ChatGPT internet scrapes will be worth their weight in gold. Once a block is hit on what we can do with data, the data itself is the next target.

0cf8612b2e1e · 2025-03-27T19:24:30 1743103470

Funny you should say that. There was a push to have more officially collected DIET data for exactly this reason. Unfortunately such efforts were recently terminated.

nonethewiser · 2025-03-27T19:21:22 1743103282

Or take more inputs. If there are differences between race and gender and thats not captured as an input we should expect the accuracy to be lower.

If an x-ray means different things based off the race or gender we should make sure the model knows the race and gender.

red75prime · 2025-03-28T02:31:09 1743129069

And not applying fairness techniques to the resulting model.