How do other LLMs like Claude deal with this?

BonoboIO · 2025-01-21T03:50:47 1737431447

You don’t talk about the fight club …

Everyone uses „pirated“ content, but some are better at hiding it and/or not talking about it.

There is no other way to do it.

visarga · 2025-01-21T05:07:48 1737436068

More recently they train on a mix of synthetic and organic text, like the Phi-4 and o1 / o3 models. Original copyrighted text can be safely replaced with synthetic standins.

BonoboIO · 2025-01-21T14:07:26 1737468446

I think this works only to a certain degree, they will still use as much data as they can use to train the models.

Synthetic data will not replace original data like books. Synthetic data works very good for math.