in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.
A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.
I think it's different. LLMs can solve problems. Part of that problem-solving ability comes from training completely unrelated content such as NYT articles. GPT4 doesn't have to spit out NYT articles verbatim to have benefited from NYT articles. It uses NYT articles for every query.
Let's say I'm an academic; if my research, note-taking, and paper writing skills lead to fair-use, cited quotations where applicable, general knowledge not identified, and the creative aspects and unique conclusions creating the intriguing part of my work, that's copacetic. If I spit out (from memory, mind you) verbatim quotes and light rewordings of NY Times articles, that's not; "I don't remember where I got that material" doesn't cut it. My reading the NY Times every day for years because I judge it to be more literate and accurate than other sources, undoubtedly it has informed my thinking and style, but I don't need to acknowledge that.
If I use ChatGPT as a research tool, as long as it lives within the same parameters that I have to live within, I don't see a problem with its education/learning.
I understand that the NYTimes would like a slice of anything that comes out of the GPT but I'm talking about what seems reasonable. People who share their copyrighted material do not own all of the thinking that comes out of it; they own that expression of it, that is all.
Will AI destroy the economics of "writing" the way the web has killed newspapers? perhaps, perhaps we'll all benefit from and need a new model, but killing the new to keep the old on life support is not the way.
You're not replicating yourself millions of times and selling yourself for $20/month. If you are, then NYT might sue you too.
I'm not saying LLMs are by default, illegal. All I'm saying is that there is some merit to why NYT and content companies want a piece of the pie and think they deserve it.
The NY Times benefited in the past from technologies that led to widespread distribution of the Times, putting competitors out of business and concentrating talent at the Times. Nobody is stopping them from producing new editions of the newspaper, their core business. People now have technologies that help them "remember" what was salient in back issues of the Times. Such is progress.
A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.