Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.
I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).
What if OpenAI were to first summarize or transform the content before training on it? Then the LLM has never actually seen copyrighted content and couldn't produce an exact copy.
You are assuming a lossy compression. Stylistic guidelines and personal habits of beat journalists suggest you might not, depending on how detailed the LLM is. The complaint has many quotes that are long verbatim sections.
I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).