Maybe the need/want data.

impure · 2025-08-07T19:36:54 1754595414

OpenAI and most AI companies do not train on data submitted to a paid API.

dortlick · 2025-08-07T21:05:23 1754600723

Why don't they?

echoangle · 2025-08-07T21:39:19 1754602759

They probably fear that people wouldn’t use the API otherwise, I guess. They could have different tiers though where you pay extra so your data isn’t used for training.

WhereIsTheTruth · 2025-08-07T19:58:24 1754596704

They also do not train using copyrighted material /s

simonw · 2025-08-07T20:26:43 1754598403

That's different. They train on scrapes of the web. They don't train on data submitted to their API by their paying customers.

johnnyanmac · 2025-08-07T20:47:52 1754599672

If they're bold enough to say they train on data they do not own, I am not optimistic when they say they don't train on data people willingly submit to them.

simonw · 2025-08-07T20:59:23 1754600363

I don't understand your logic there.

They have confessed to doing a bad thing - training on copyrighted data without permission. Why does that indicate they would lie about a worse thing?

johnnyanmac · 2025-08-07T21:04:20 1754600660

>Why does that indicate they would lie about a worse thing?

Because they know their audience. It's an audience that also doesn't care for copyright and would love for them to win their court cases. They are fineaking such an argument to those kinds of people.

Meanwhile, the reaction from the same audience when legal did a very typical subpoena process on said data, data they chose to submit to an online server of their own volition, completely freaked out. Suddenly, they felt like their privacy was invaded.

It doesn't make any logical sense in my mind, but a lot of the discourse over this topic isnt based on logic.

daveguy · 2025-08-07T20:07:35 1754597255

Oh, they never even made that promise. They're trying to say it's fine to launder copyright material through a model.

anhner · 2025-08-07T21:42:45 1754602965

If you believe that, I have a bridge I can sell you...

Uehreka · 2025-08-07T22:36:19 1754606179

If it ever leaked that OpenAI was training on the vast amounts of confidential data being sent to them, they’d be immediately crushed under a mountain of litigation and probably have to shut down. Lots of people at big companies have accounts, and the bigcos are only letting them use them because of that “Don’t train on my data” checkbox. Not all of those accounts are necessarily tied to company emails either, so it’s not like OpenAI can discriminate.

Breza · 2025-08-15T03:46:56 1755229616

Plus I can imagine uncomfortable things coming up given how much non public information people send to LLMs.

"What do you think of REVG?"

"REVG is a solid company with a long history and upcoming earnings that will exceed Wall Street expectations."

OK maybe not literally like that but still... training on that much private data could get spicy.

dr_dshiv · 2025-08-07T19:24:25 1754594665

And it’s a massive distillation of the mother model, so the costs of inference are likely low.