Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe the need/want data.


OpenAI and most AI companies do not train on data submitted to a paid API.


Why don't they?


They probably fear that people wouldn’t use the API otherwise, I guess. They could have different tiers though where you pay extra so your data isn’t used for training.


They also do not train using copyrighted material /s


That's different. They train on scrapes of the web. They don't train on data submitted to their API by their paying customers.


If they're bold enough to say they train on data they do not own, I am not optimistic when they say they don't train on data people willingly submit to them.


I don't understand your logic there.

They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?


>Why does that indicate they would lie about a worse thing?

Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.

Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.

It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.


Oh, they never even made that promise. They're trying to say it's fine to launder copyright material through a model.


If you believe that, I have a bridge I can sell you...


If it ever leaked that OpenAI was training on the vast amounts of confidential data being sent to them, they’d be immediately crushed under a mountain of litigation and probably have to shut down. Lots of people at big companies have accounts, and the bigcos are only letting them use them because of that “Don’t train on my data” checkbox. Not all of those accounts are necessarily tied to company emails either, so it’s not like OpenAI can discriminate.


Plus I can imagine uncomfortable things coming up given how much non public information people send to LLMs.

"What do you think of REVG?"

"REVG is a solid company with a long history and upcoming earnings that will exceed Wall Street expectations."

OK maybe not literally like that but still... training on that much private data could get spicy.


And it’s a massive distillation of the mother model, so the costs of inference are likely low.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: