Hacker News new | past | comments | ask | show | jobs | submit login
The New York Times is suing OpenAI and Microsoft for copyright infringement (theverge.com)
593 points by ssgodderidge on Dec 27, 2023 | hide | past | favorite | 868 comments



Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.


For all the leaks on: Secret projects, novelty training algorithms not being published anymore so as to preserve market share, custom hardware, Q* learning, internal politics at companies at the forefront of state of the art LLMs...A thunderous silence is the lack of leaks, on the exact datasets used to train the main commercial LLMs.

It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?

Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?


I'm not for or against anything at this point until someone gets their balls out and clearly defines what copyright infringement means in this context.

If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?

The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.

It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.

The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"


There are two problems with the “kid” analogy:

a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.


> This copy is not a transitory copy in service of a fair use

Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.


Why is training “almost certainly” fair use?

Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.

If it took an act of Congress to make “unlicensed” debugging a fair use copy…


If you overtrain the model may include verbatim copies of your training material, and may be able to produce verbatim copies of the original in its output.

If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.


> If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

I don't think that you can copyright a plot or story in any country can you?

If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.


>This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself,

Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.


How is it a copy at all? Surely the model weights would therefore be much larger than the corpus of training data, which is not the case at all.

If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?

I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.

I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.


Regarding (b) ... while a specific method of training that involved persistent copying may indeed be a violation, it is far from clear that the general notion of "send server request for URL, digest response in software that is not a browser" is automatically a violation. If there is deemed to be a difference (i.e. all you are allowed to do without a license is have a human read it in a browser), then one can see training mechanisms changing to accomodate that.


It’s all about the purpose the transitory copy serves. The mechanism doesn’t really matter, so you can’t make categorical claims about (say) non-browser requests.


I don't have a comment on your hypothetical, but this case seems to go far beyond that. If you read the actual filing at the bottom of the linked page, NYT provides examples where ChatGPT recited exact multi-paragraph sections of their articles and tried to pass it off as its own words. Plainly reproducing a work is pretty much the only situation where "is this copyright violation?" isn't really in flux. It's not dissimilar to selling PDFs of copywritten books.

If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.


Importantly, the kid- an individual human- got some wealth somewhat proportional to their effort. There’s non-trivial effort in recruiting the kid. We can’t clone the kid’s brain a million times and run it for pennies.

There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.

IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.


I think you’re skipping over the problem.

In your example you owned the work you gave to the person to create derivatives of.

In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.


How about if I borrowed them from the library and gave them to the kid to read?

How about if I got the kid to read the books on a public website where the author made the books available for free?


Ironically these artists cant claim to be wholly original as they were certainly inspired. Artists that play live already "lobotomize" people on their way out since it's not easy to recreate an experience and a video isn't the same if it's a good show.

Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.


you might be well served by reading the actual complaint.


I think your kid analogy is flawed because it ignores the fact that you couldn't reasonably use said "kid" to rapidly produce thousands of works in the same style and then go on to use them to flood the market and drown out the original authors presence.

Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.

So yeah, stop using weak analogies, it's not helpful or intelligent.


Would be fascinated to hear from someone inside on a throwaway, but my nearest experience is that corporate lawyers aren't stupid.

If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.

They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.


Big corporations and corporate lawyers lose major lawsuits all the time.


That doesn't mean they don't spend lots of time thinking of ways not to lose them.

See: Google turning off retention on internal conversations to avoid creating anti-trust evidence


for what it's worth, i asked altman directly and he denied using libgen or books2, but also deferred to murati and her team on specifics. but the Q&A wasn't recorded and they haven't answered my follow-ups.


Really? Because the GPT-3 paper talks about "...two internet-based books corpora (Books1 and Books2)..." (see pages 8 and 9) - https://arxiv.org/pdf/2005.14165.pdf

Unclear what that corpora might be, or if its the same books2 you are referring to.


My guess is that this poster meant books3, not books2.

books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.

books3 is public, developed outside of OpenAI and we know exactly what's in it.


sorry, books3 is indeed what I meant.


Why would he know the answer in the first place?


The legal liabilities of the training data they use in their flagship product seems to be a thing the CEO should know.


We all remember when Aaron Swartz got hit with a wire tapping and intent to distribute federal crime for downloading JSTR stuff right?

It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.

This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.


What happened to Aaron Swartz was terrible. I find that what he was doing was outright good. IMO the right reading isn't to make sure anyone doing something similar faces the same way, but to make the information far more free, whether it's a corporation using it or not. I don't want them to steamroll everyone equally here, but to not steamroll anyone.


There are two points at issue here. One, that information should be more free, and two, that large corporations and private individuals should be equal before the law.


I don't want them to steamroll everyone equally here, but to not steamroll anyone.

I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.

All with no unjust treatment.


Huh. I see downvotes. I am mystified, for if people and corporations are both treated stringently under the law, corporations will fight to have overly restrictive laws knocked down.

I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.

That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.

Anyhow. I just prefer to set my enemies against my enemies. More fun.


Corporations follow these laws much more stringently than individuals. Individuals often use pirated software to make things, I've seen many examples of that. I've never seen a corporation use pirated software to make things, they pay for licenses. Maybe there is some rare cases, but pirating is mostly a thing individuals do not corporations.

So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.

I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.


So then you refute the comment I replied to, and its parent.


> I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague.

They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.


Circumventing computer security to copy items en masse to distribute wholesale without transformation is a far cry from reading data on public facing web pages.


He didn't circumvent computer security. He had had a right to use the MIT network and pull the JSTR information. He certainly did it in a shady way (computer in a closet) but it's every bit as arguable that he did it that way because he didn't want someone stealing or unplugging his laptop while it was downloading the data.

He also did not distribute the information wholesale. What he planned on doing with the information was never proven.

OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.


OpenAI IS distributing information they got wholesale from the internet

Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.


Facts are not subject to copyright

That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.


No it is not. You can make a better argument than just BSing.

https://libraries.emory.edu/research/copyright/copyright-dat...


> right to use the MIT

That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").

The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.

> and pull the JSTR information

No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.

> did it in a shady way

The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.

> he did it that way because he didn't want someone stealing or unplugging his laptop

Ah, the old "ends justifies break the law" argument.

Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

Go read the indictment and evidence.

> OpenAI IS distributing information they got wholesale

No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.

It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).

Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.

You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.


Ok, so a lot you've written but it comes down to this. What law did he break?

Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.

Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.

> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.

You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.

> How many could I get from what Schwartz downloaded?

0, because he didn't distribute.


> What law did he break?

You can read the indictment, which I already suggested you do.

> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).

He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?

If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.

> you are confusing a "crime" with "misuse of a system"

Apparently you are (willfully?) ignorant of law.

> You violate copyright by transforming.

That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.

> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you

Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.

You might also want to make sure you're not simply reading OpenJDK.

> 0, because he didn't distribute.

Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.

That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.


Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.


robots.txt is not meant to be a mechanism of communicating the licensing of content on the page being crawled nor is it meant to communicate how the crawled content is allowed to be used by the crawler.

Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.


Furthering the S3 health data thought exercise:

If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).

I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)


HIPAA is about more than just names. Just information such as a patient's ZIP code and full medical history is often enough to de-anonymise someone. HIPAA breaches are considered much more severe than intellectual property infringements. I think the main reason that patients are considered to have ownership of even anonymised versions of their data (in terms of controlling how it is used) is that attempted anonymisation can fail, and there is always a risk of being deanonymised.

If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.


> should [they] be allowed to use this data in training…?

Unequivocally, yes.

LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.

I’m all for the greater good.


Eliminating the right to patient privacy does not serve the greater good. People have enough distrust of the medical system already. I’m ambivalent to training on properly anonymized health data but, i reject out of hand the idea that OpenAI et al should have unfettered access to identifiable private conversations between me and my doctor for the nebulous goal of some future improvement on llm models.


> unfettered access to identifiable private conversations

You misread the post I was responding to. They were suggesting health data with PII removed.

Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.

If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?

We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.


> Furthering the S3 health data thought exercise: If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.


> could outweigh an individual's right to privacy.

If that’s the case, let’s put it on the ballet and vote for it.

I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.

If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!


> I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything

> If that’s the case, let’s put it on the ballet and vote for it.

This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.


Robot.txt isn't about copyrights, its about preventing bots. Its effectively a EULA. Copyright law only goes into effect when you distribute the content you scrape. If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.


> If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

Why?

As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.


Er... This is what all these lawsuits against LLMs are hoping to disprove


Which lawsuits are concerning LLMs used only privately by the organization that developed it?


>Why isn't robots.txt enough to enforce copyright

You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.

“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).


But the thing is, you can only bring the civil action forward after registering your claim but you need not register the claim before the infringement occurs.

Copyright is granted to the creator upon creation.


That is incorrect.

If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).

If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.

See below for the actual text of the law.

Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.

Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.

~~~

Here is the relevant portion of the Copyright Act:

In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—

(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or

(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.


NYT seemed to claim paid subscriptions as well, which I'm not sure that bots can actually crawl.


ChatGPTs birth as a research preview may have been an attempt to avoid these issues. It would have been unlikely to trigger legal anger for a free product which few use. When usage exploded, the natural inclination would be to hope for the best.

Google may simply have been obliged to follow suit.

Personally, I’m looking forward to pirate LLMs trained on academic content.


Is there already a dataset? Before llama Facebook had one too I forgot what it was called.


> a more established competitor

Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.


didnt they just get caught for pantent infrigment? I'm sure they've done their fair share of shady stuff with the AI datasets too, they are just going to do a stellar job of conciling it.


Try searching for man or woman in your photos app. It won't even show it to me. It's lobotomized and has been for many years.


> the first being at the birth of modern search engines.

Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...


Just because they asked for forgiveness instead of asking first for permission, it's original sins will not be erased :-)

"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...


That's why I think the newspapers will manage to win against the LLM companies. They won against Google despite having no real argument why they should get paid to get more traffic. The search engine tax is even a shakier concept than the LLM tax would be.

Newspapers are very powerful and they own the platform to push their opinion. I'm not about to forget the EU debates where they all (or close to all) lied about how meta tags really work to push it their way, they've done it and they will do it again.


That doesn’t mean that it wasn’t theft of their content. The internet would be a very different place if creator compensation and low friction micropayments were some of the first principles. Instead we’re left with ads as the only viable monetization model and clickbait/misinformation as a side effect.


I don't quite get it. If listing your link is considered as theft, HN is then a thief of content too. If you don't want your content stolen, just tell Google to not index your website?

I guess it's more constructive to propose alternatives than just bashing the status quo. What's your creator compensation model for a search engine? I believe whatever being proposed is trading off something significant for being more ethic.


The world you’re hoping for will put all AI tech only within the hands of the established top 10 media entities, who traditionally have never compensated fairly anyway.

Sorry but if that’s the alternative to some writers feeling slighted, I’ll choose for the writers to be sad and the tech to be free.


“Feeling slighted” is a gross understatement of how a lack of compensation flowing to creators has shaped the internet and the wider world over the past 25 years. If we have a problem with the way top media companies compensate their creators, that is a separate issue - not a justification for layering another issue on top.


YouTube had made way more content creators wealthy than the NYT. Writers are not going to be paid more after this ruling either way.


Has the NYT made even a single content creator wealthy? Journalists there make less money than an average software engineer.


It's in the realm of possibility, lots of people found work post vox and buzzfeed too but i wouldn't classify it as the work of the NYT. "Real" creatives and content creators seem to embrace AI or at least grudgingly alter their own works, the OP I'm replying to would be cheering YouTube for suing openAI on the behalf of YouTubers everywhere, despite it having no bearing on reality.

The main objectors are the old guard monopolies that are threatened.


Have you been a creator on YT? Do you know how much an average creator gets paid? Did you know that it and other modern platforms like Spotify artificially skew payouts towards the richest brands? If not, then please let’s not make any claims about wealthiness and “old guard” monopolies here.


Gadzooks! You're right! If only NYT had realised the secret to success was spewing out articles reacting to other articles reacting to other articles, they would all have been millionaires!


Did you forget the /s or do you not think that a lot of journalism is indeed reacting to other journalists?


I’m a creator myself and see the two futures ahead of me and free benefits me in the long term more than closed.

The tech can either run freely in a box under my desk or I’ll have to pay upwards of 15-20k a year to run it on Adobes/Google/etcs servers. Once the tech is locked up it will skyrocket to AutoCAD type pricing because the acceleration it provides is too much.

Journos can weep, small price to pay for the tech being free for us all.


I think the introduction of an expectation for compensation has generally brought down the quality of content online. Different people and incentives appear to get involved once content == money, vs content == creative expression.


So you're advocating giving open AI and incumbents a massive advantage by now delegitimizing the process? It's kinda like why Netflix was all for "fast lanes"


> I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming.

Eventually these LLMs are going to be put in mechanical bodies with the ability to interact with the world and learn (update their weights) in realtime. Consider how absurd your perspective would be then, when it'd be illegal for this embodied LLM to read any copyrighted text, be it a book or a web page, without special permission from the copyright holder, while humans face no such restriction.


A human faces the same restriction, if it provides commercial services on the internet creating code that is a copy of copyrighted code.


This isn't true; if you hire a contractor and tell them "write from memory the copyrighted code X which you saw before", and they have such a good memory that they manage to write it verbatim, then you take that code and use it in a way that breaches copyright, you're liable, not the person you paid to copy the code for you. They're only liable if they were under NDA for that code.


> they have such a good memory that they manage to write it verbatim

No, there is no clause in copyright law that says "unless someone remembered it all and copied it from their memory instead of directly from the original source." That would just be a different mechanism of copying.

Clean-room techniques are used so that if there is incidental replication of parts of code in the course of a reimplementation of existing software, that it can be proven it was not copied from the source work.


And what professional developer would not be under NDA for the code he produces for a corporation?


The topic of this thread is LLMs reproducing _publicly available_ copyright content. Almost no developer would be under NDA for random copyrighted code online.


> while humans face no such restriction.

I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

https://copyrightalliance.org/copyright-cases-2022/

Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.


>I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

They're sued for _producing content_, not consuming content. If a human takes copyrighted output from an LLM and publishes it, they're absolutely liable if they violated copyright.

>Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

That is absolutely what people in this thread are suggesting should happen: that it should be illegal for OpenAI et. al. to train models on publicly available content without first receiving permission from the authors.

>Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.

That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.


> That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.

Disagree, it is completely relevant when discussing computers Vs people, the bar that has already been set is alternative uses.

LLMs don't have a purpose outside of regurgitating what it has ingested. CD burners at least could be claimed they were backing up your data.


> Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

Hacker News consistently have upvoted posts to let users circumvent paywalls. And even when it doesn't, conversations here (and on Twitter, Reddit, etc.) that summarize the articles and quote the relevant bits as soon as the articles are published are much more of a threat to The New York Times than ChatGPT training on articles from months/years ago.


I don't think it's about scraping being a threat. It's that they violated the TOS and stand to make a ton of money from someone else's work.

I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles. How many other AI scrapers are just ingesting AI generated content?


> I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles.

That isn't ironic at all, newspapers have newspaper competitors and if those competitors can steal content by washing it through an AI that is a serious problem. If these AI models weren't used to produce news articles and similar then it would be a much smaller issue.


Same, to all those arguing in favour of Open AI, I have a question, do you steal books, movies, games ?

Do you illegally share them via torrents or even sell copies of these works ?

Because that is what’s going on here?


> they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart

Maybe, but I find the "It's ok to break the law because otherwise I can't do what I want" narrative a little offputting.


Doesn't this harm open source ML by adding yet another costly barrier to training models?


It doesn't matter what's good for open source ML.

It matters what is legal and what makes sense.


It doesn't matter what is legal. It matters what is right. Society is about balancing the needs of the individual vs the collective. I have a hard time equating individual rights with the NYT and I know my general views on scraping public data and who I was rooting for in the LinkedIn case.


I have an even harder time equating individual rights with the spending of $xx billion in Azure compute time and payment of a collective $0 to millions of individuals who involuntarily contribute training material to create a closed source, commercial service allowing a single company to compete with all the individuals currently employed to create similar work.

NYT just happens to be an entity that can afford to fight Microsoft in court.


I don't see a problem as long as there's taxation.

Look at SpaceX. They paid a collective $0 to the individuals who discovered all the physics and engineering knowledge. Without that knowledge they're nothing. But still, aren't we all glad that SpaceX exists?

In exchange for all the knowledge that SpaceX is privatizing, we get to tax them. "You took from us, so we get to take it back with tax."

I think the more important consideration isn't fairness it's prosperity. I don't want to ruin the gravy train with IP and copyright law. Let them take everything, then tax the end output in order to correct the balance and make things right.


When we're discussing litigation, it certainly matters what is legal.


And also - if what is legal isn't right, we live in a democracy and should change that.

Saying what's legal is irrelevant is an odd take.

I like living in a place with a rule of law.


Should Harriet Tubman have petitioned her local city council and waited for a referendum before freeing slaves?


Time will tell if comparing slavery to copyright is ridiculous or not.

In the case of slavery - we changed the law.

In the case of copyright - it's older than the Atlantic Slave Trade and still alive and kicking.

It's almost as if one of them is not like the other.


> It's almost as if one of them is not like the other.

Use this newfound insight to take my comment in good faith, as per HN guidelines, and recognize that I am making a generalized analogy about the gap between law and ethics, and not making a direct comparison between copyright and slavery.

Can we get back on topic?


It matters what ends up being best for humanity, and I think there are cases to be made both ways on this


People often get buried in the weeds about the purpose of copyright. Let us not forget that the only reason copyright laws exist is

> To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries

If copyright is starting to impede rather than promote progress, then it needs to change to remain constitutional.


The reason copyright promotes progress is that it incentives individuals and organizations to release works publicly, knowing their works are protected against unlawful copying.

The end game when large content producers like The New York Times are squeezed due to copyright not being enforced is that they will become more draconian in their DRM measures. If you don't like paywalls now, watch out for what happens if a free-for-all is allowed for model training on copyrighted works without monetary compensation.

I had a similar conversation with my brother-in-law who's an economist by training, but now works in data science. Initially he was in the side of OpenAI, said that model training data is fair game. After probing him, he came to the same conclusion I describe: not enforcing copyright for model training data will just result in a tightening of free access to data.

We're already seeing it from the likes of Twitter/X and Reddit. That trend is likely to spread to more content-rich companies and get even more draconian as time goes on.


I doubt there’s much that technical controls can do to limit the spread of NYT content, their only real recourse is to try suing unauthorized distributors. You only need to copy something once for it to be free.


Do other countries all use the same reasoning?


I don't think this was your point, but no they don't. Specifically China. What will happen if China has unbridled training for a decade while the United States quibbles about copyright?

I think publications should be protected enough to keep them in business, so I don't really know what to make of this situation.


Copyright isn't what got in the way here. AI could have negotiated a license agreement with the rights holder. But they chose not to.


From their perspective they're training a giant mechanical brain. A human brain doesn't need any special license agreement to read and learn from a publicly available book or web page, why should a silicon one? They probably didn't even consider the possibility that people'd claim that merely having an LLM read copyrighted data was a copyright violation.


I was thinking about this argument too: is it a "license violation" to gift a young adult a NYT subscription to help them learn to read? Or someone learning English as second language? That seems to be a strong argument.

But it falls apart because kids aren't business units trained to maximize shareholder returns (maybe in the farming age they were). OpenAI isn't open, it's making revolutionary tools that are absolutely going to be monetized by the highest bidder. A quick way to test this is NYT offers to drop their case if "open" AI "open"-ly releases all its code and training data, they're just learning right? what's the harm?


The law on this does not currently exist. It is in the process of being created by the courts and legistatures.

I personally think that giving copyright holders control over who is legally allowed to view a work that has been made publicly available is a huge step in the wrong direction. One of those reasons is open source, but really that argument applies just as well to making sure that smaller companies have a chance of competing.

I think it makes much more sense to go after the infringing uses of models rather than putting in another barrier that will further advantage the big players in this space.


Copyright holders already have control over who is legally allowed to view a work that has been made publicly available. It's the right to distribution. You don't waive that right when you make your content free to view on a trial basis to visitors to your site, with the intent of getting subscriptions - however easy your terms are to skirt. NYT has the right to remove any of their content at any time, and to bar others from hosting and profiting on the content.


It does exist, and you'd be glad to know that it's going in the pro-AI/training direction: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...


> It does exist, and you'd be glad to know that it's going in the pro-AI/training direction

Certainly not in the US. From the article you linked "In the United States, in the absence of a TDM exception, AI companies contend that inclusion of copyrighted materials in training sets constitute fair use eg not copyright infringement, which position remains to be evaluated by the courts."

Fair use is a defense against copyright infringement, but the whole question in the first place is whether generative AI training falls under fair use, and this case looks to be the biggest test of that (among others filed relatively recently).


It’s disingenuous to frame using data to train a model as a “view,” of that data. The simple cases are the easy ones, if ChatGPT completely rips a NYT article then that’s obviously infringement; however, there’s an argument to be made that every part of the LLM training dataset is, in part, used in every output of that LLM.

I don’t know the solution, but I don’t like the idea that anything I post online that is openly viewable is automatically opted into being part of ML/AI training data, and I imagine that opinion would be amplified if my writing was a product which was being directly threatened by the very same models.


All I can ever think about with how ML models work is that they sound an awful lot like Data Laundering schemes.

You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.

Saw this a lot with some earlier image models where you could type in an artists name and get their work back.

The fact that AI models are having to put up guardrails to prevent that sort of use is a good sign that they weren't trained ethically and they should be paying a ton of licensing fees to the people whose content they used without permission.


>You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.

You can do exactly the same with a human author or artist if you prompt them to. And if you decide to publish this material, you're the one liable for breach of copyright, not the person you instructed to create the material.


Not if that person is a trillion dollar corporation. If they're a business that's regularly stealing content and re-writing it for their customers that business is gonna go down. Sure, a customer or two may go down with them but the business that sells counterfeit works to spec is not gonna last long.


Clearly if a law is bad then we should change that law. The law is supposed to serve humanity and when it fails to do so it needs to change.


setting legality as a cornerstone of ethics is a very slippery slope :)


Slavery was legal...


Still is in many countries with excellent diplomatic relations with the Western World:

https://www.cfr.org/backgrounder/what-kafala-system


open source won't care. they'll just use data anyway.

closed/proprietary services that also monetize - there's a question whether it's "fair" to take and use data for free, and then basically resell access to it. the monetization aspect is the bigger rub than just data use.

(maybe it's worth noting again that "openai" is not really "open" and not the same as open source ai/ml.)

taking data, maybe it's data that's free to take, and then as freely distributing resulting work, that's really just fine. taking something for free (without distinction, maybe it's free, maybe it's supposed to stay free, maybe it's not supposed to be used like that, maybe it's copyrighted), and then just ignoring licenses/relicensing and monetizing without care, that's just a minefield.


You can train your own model no problem, but you arguably can’t publish it. So yes, the model can’t be open-sourced, but the training procedure can.


I think not, because stealing large amounts of unlicensed content and hoping momentum/bluster/secrecy protects you is a privilege afforded only to corporations.

OSS seems to be developing its own, transparent, datasets.


It’s likely fair use.


Playing back large passages of verbatim content sold as your “product” without citation is almost certainly not fair use. Fair use would be saying “The New York Times said X” and then quoting a sentence with attribution. Thats not what OpenAI is being sued for. They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.


> They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

https://openai.com/policies/terms-of-use

> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.


They can’t transfer rights to the output of it isn’t theirs to begin with.

Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.


Exactly. And while one can easily just take down such a movie if an infringement claim is filed it’s unclear how one “removes” content from a trained model given how these models work. Thats messy.


If it’s found that the use of the material is infringing on the rights of the copyright holder than the AI company has to retrain their model without any material they don’t have a right to. Pretty clear to me


By that logic Microsoft Word should have to refuse to save or print any text that contained copyrighted content. GPT is just a tool; the user who's asking it to produce copyrighted content (and then publishing that content) is the one violating the copyright, and they're the ones who should be liable.


I don’t even know where to begin on this example.

The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.

> GPT is just a tool.

I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.

Which brings us back to the beginning.

> the user who’s asking it to produce copyrighted content.

ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.


The bits you cite are legally bogus.

That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.


Their legalese is careful to include the 'if any' qualifier ("We hereby assign to you all our right, title, and interest, if any, in and to Output.")

In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.


That's irrelevant. The main point is that they are re-distributing the content without permission from the copyright owners, so they are sort of implicitly claiming they have copy/distribution rights over it. Since they don't, then it's obvious they can't give you this content at all.


>The main point is that they are re-distributing the content without permission from the copyright owners,

By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.


That is not the same thing at all. If I search on Google for copyrighted content and Google shows me the content, it is the server which serves the content who is most directly responsible, not Google nor I. Firefox is only a neutral agent, whereas ChatGPT is the source of the copyrighted content.

Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.

If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.


> In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.

IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.


They are distributing the output, so they (implicitly) claim to have the right to distribute it. I can send you a movie I downloaded along with a license that says "I hereby assign to you all our right, title, and interest, if any, in and to Output. ", I'm still obviously infringing on the copyright of that movie (unless I have a deal that allows re-distribution, of course, as Netflix does).


That part doesn't seem relevant to me in any case. IP pirates aren't prosecuted or sued because of a claim of ownership; they're prosecuted or sued over possession, distribution, or use.


At the root, it seems like there's also a gap in copyright with respect to AI around transformative.

Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?

That's a pretty weird question that never existed when copyright was defined.


Replace the AI model by a human, and it should become pretty clear what is allowed and what isn’t, in terms of published output. The issue is that an AI model is like a human that you can force to produce copyright-infringing output, or at least where you have little control over whether the output is copyright-infringing or not.


Its less clear than you think, and comes down more on how OpenAI is commercially benefiting and competiting with NYT than what they actually did. (See four factors of fair use)


I think it did come up back in the day sort of, for example with libraries.

More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.


I would note that in the examples the NYT cites, the prompts explicitly ask for the reproduction of content.

I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.


> It's likely fair use.

I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.

They asked her about the issue of copyrighted training data. Her response was:

""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.

So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """

Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.


Genuinely asking, is the “verbatim” thing set in stone? I mean, an entity spewing out NYTimes-like articles after having been trained on lots of NYTimes content sounds like a very grey zone, in the “spirit” of copyright law some may judge it as indeed not-lawful.

Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.


There's already precedence for this in news: News outlets constantly report on each other's stories. That's why they care so much about being first on a story, because once they break it, it is fair game for everyone else to report on it too.

Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?


It's likely not. Search for "the four factors of fair use". While I think OpenAI will have decent arguments for 3 of the factors, they'll get killed on the fourth factor, "the effect of the use on the potential market", which is what this lawsuit is really about.

If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.


Fair use is based on a flexible proportionality test so they don't need perfect arguments on all factors.

> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.

But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.


> we're so far in uncharted territory any speculation is useless

I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.

But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

From https://fairuse.stanford.edu/overview/fair-use/four-factors/


Nobody is gonna cancel their NYT subscription for chatGPT 4.0. OpenAI will win.


Per my other comment here, https://news.ycombinator.com/item?id=38784723, courts have previously ruled that whether people would cancel their NYT subscription is irrelevant to that test.


What exactly is the effect on the potential market? That's exactly why I don't think OpenAI will lose, why would a court side with the NYT?


What if a court interprets fair use as a human-only right, just like it did for copyright?


I think we need a lot of clarity here. I think it's perfectly sensible to look at gigantic corpuses of high quality literature as being something society would want to be fair use for training an LLM to better understand and produce more correct writing... but the actual information contained in NYT articles should probably be controlled primarily by NYT. If the value a business delivers (in this case the information of the articles) can be freely poached without limitation by competitors then that business can't afford to actually invest in delivering a quality product.

As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.


The case for copyright is exactly the opposite: the form of content (the precise way the NYT writers presented it) is protected. The ideas therein, the actual news story, is very much not protected at all. You can freely and legally read an NYT article hot off the press and go on air on Fox News and recount it, as long as you're not copying their exact words. Even if the news turns out to be entirely fake and invented by the NYT to catch you leaking their stuff, you still have every right to present the information therein.

This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.


I have deeply mixed feelings about the way LLMs slurp up copyrighted content and regurgitate it as something "new." As a software developer who has dabbled in machine learning, it is exciting to see the field progress. But I am also an author with a large catalog of writings, and my work has been captured by at least one LLM (according to a tool that can allegedly detect these things).

Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney "sources" (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it's parasitic and lazy and doesn't add much value to the world.

All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations. Imagine being a perpetual fact-checker for a very unreliable author. And laymen will probably mostly use LLMs to generate low-effort content for SEO, which will inevitably degrade the quality of the same LLMs as they breed with their own offspring. "Regression to mediocrity," as Galton put it.


>All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations.

For writers maybe, but absolutely not for programmers, it's incredibly useful. I don't think anyone who's used GPT4 to improve their coding productivity would consider it a fad.


Copilot has been way more useful to me than GPT4. When I describe a complex problem where I want multiple solutions to compare, GPT4 is useless to me. The responses are almost always completely wrong or ignore half of the details I’ve written in the prompt. Or I have to write them with already a response in mind, which kinda defeats why I would use it in the first place.

Copilot provides useful autocompletes maybe… 30% of the time? But it doesn’t waste too much as it’s more of a passive tool.


> When I describe a complex problem where I want multiple solutions to compare, GPT4 is useless to me

FWIW i don’t try to use it for this. mostly i use it to automate writing code for tasks that are well specified, often transformations from one format to another. so yes, with a solution in mind. it mostly just saves typing, which is a minority of the work, but it is a useful time saver


Copilot is amazing. It single handedly returned me to the Microsoft ecosystem and changed the way I use the Internet. Huggingface is another great AI, I've used Githubs a bit, Codium a bit - all of these things are amazing.

This is not a fad, this is the beginning of a world that we can just actually naturally interact to accomplish things we have to be educated on how to accomplish now.

Haha, I love that people can't see the writing on wall - I think this is a bigger invention than the smartphone that I'm typing this on now, fr - just wait and see ;)


Ehh LLMs have become a fundamental part of my work flow as a professional. GPT4 is absolutely capable of providing links to sources and citations. It is more reliable than most human teachers I have had and doesnt have an ego about its incorrect statements when challenged on them. It does become less useful as you get more technical or niche but its incredibly useful for learning in new areas or increasing the breadth of your knowledge on a subject.


> GPT4 is absolutely capable of providing links to sources and citations.

Do you mean in the Browsing Mode or something? I don't think it is naturally capable of that, both because it is performing lossy compression, and because in many cases it simply won't know where the text that was fed to it during training came from.


[flagged]


It should link to one of the articles about TCP it used as a reference to write that info blurb, not the TCP spec.

The problem is that those links doesn't link to where it got that text, it links to whatever that text linked to. Saying it is giving links is like saying that when I copy paste an article with links I am providing links to the source. No I am not, I am plagiarizing including plagiarizing those links.

So, it has read some TCP tutorials and wrote that blurb based on those. Don't you think it is fair that it links one of those to give credit? LLMs aren't capable of writing tutorials based on specs, they write tutorials based on tutorials it has seen, it should link to those.


Presumably it can't link them, because it's been train on the data, not built on top of it. Gpt model doesn't include the sum of all training data, that's not how machine learning works at all (and overfitting on such a large and diverse dataset would be a monumental fuck up)


The ability to cite some rfcs is, to me, vastly different from being able to link to sources.

By far the wildest piece of this stuff is that it near completely obliterates any traces of where the outputs come from. The black box is trained, and yes sometimes some salient data pole rfc's are captures, but generally where each training comes from is not stored. That would largely defeat the purpose, would make the data it's crunching essentially incompressible, to store so much origin information.

Deeply unimpressed by this answer. This isn't linking it's sources, of where this response was trained upon. It probably got the write up & links from hundreds of other places.


I would be more impressed if it returned links to the specific RFCs and more specific pages elsewhere. What's a top-level link to OCW worth here? OCW is amazing, but has classes on practically everything. These are practically just domain names for "places to learn about the internet".


Well I asked it about tcp/ip generally and it provided general resources. Based on the context of my question thats about what one would expect. Its not perfect but it definitely can give urls to specific resources. It would be great if it got better at giving more specific links sure and some domains it can give more specific links than others for instance some git projects it can give precise references to docs while it doesnt seem to have the URLs for more specific courses on OCW, its not perfect but it is still a capability that it has.


These are not citations. The point is that it does not / can not reliably cite the actual sources it used to prepare an answer.


Yeah ok, cause APA or some academic style is the to cite something professionally.

I'll be sure to tell everyone that uses the Internet


Even a middle schooler would be able to link the actual RFC 793 instead of just rfc-editor.org


From memory?


No, from storage.


> LLMs have become a fundamental part of my work flow as a professional. GPT4 [...] doesnt have an ego about its incorrect statements when challenged on them.

To anthropomorphize it further, it's a plagiarizing bullshitter who apologizes quickly when any perceived error is called out (whether or not that particular bit of plagiarism or fabrication was correct), learning nothing, so its apology has no meaning, but it doesn't sound uppity about being a plagiarizing bullshitter.


> Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney "sources" (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it's parasitic and lazy and doesn't add much value to the world.

Another way of looking at this is that bottom-feeder websites do work that could easily be done by an LLM. I've noticed a high correlation between "could be AI" and "is definitely a trashy click bait news source" (before LLMs were even a thing).

To be clear, if your writing could be replaced by an LLM today, you probably aren't a very good writer. And...I doubt this technology will stop improving, so I wouldn't make the mistake of thinking that 2023 will be a high point for LLMs and they aren't much better in 2033 (or whatever replaces them).


That's the joke, these sites are long produced by LLMs. The result is obvious.


I don’t view LLMs as a fad. It’s like drummers and drum machines. Machines and drummers co-exist really well. I think drum machines, among other things, made drummers better.


Neither, and NYT editors use all sorts of productivity tools, inspiration, references, etc too. Same as artists will usually find a couple references of whatever they want to draw, or the style, etc.

I agree with the key point that paid content should be licensed to be used for training, but the general argument being made has just spiralled into luddism at people who are fearful that these models could eventually take their jobs; and they will, as machines have replaced humans in so many other industries, we all reap the rewards, and industrialisation isn't to blame for the 1%, our shitty flag waving vote for your team politics are to blame.


It mainly made mediocre drummers sound better to the untrained ear.


It allowed people to see the difference between drum machines and humans. Drummers could practice to sound more like the ‘perfect’ machines, but more importantly the best drummers learned how to differentiate themselves from machines. The best drummers actually became more human. Listen and look at Nate Smith - this guy plays with timing and feel and audience reactions in ways that machines cannot. Sometimes tools let humans expand their creativity in ways previously unheard of. Just like the LLMs are doing right now.


Then it comes down to preference, but the craft and discipline objectively evolved as a result. Just as your trained ear may keep your preference to more refined percussive - a subject matter expert may care more for their native, untrained materials on their topic. In either case, music progressed in spite of the trained ears, just as AI will progress all walks of life in spite of the subject matter experts.

Nonetheless, trained ears and subject matter experts can still pick their preference.


I agree. Hitting perfect notes constantly with little or no variation is pretty hard for a person to do. Now anything "live" or proof of humanity is better sounding since it's not as sterile.


I agree with this. I prefer live music with the imperfections. And I like it when unmixed live recordings are leaked


LLMs are not a fad for many things especially programming. It improves my productivity at least by 100%. It’s also useful to understand specific and hard to Google questions or parsing docs quickly. I think it’s going to fizzle out for creative content though at least until these companies stop “aligning” it so much. Hard to be funny when you can’t even offend a single molecule.


We use LLMs for classification. When you have limited data, LLMs work better than standard classification models like random forests. In some cases, we found LLM generated labels to be more accurate than humans.

Labeling few samples, LoRA optimizing an LLM, generating labels on millions of samples and then training a standard classifier is an easy way to get a good classifier in matter of hours/days.

Basically any task where you can handle some inaccuracy, LLMs can be a great tool. So I don't think LLMs are a fad as such.


Very much so. And their popularity has already been on decline for several months, and couldn’t be explained away by kids going on a summer vacation anymore.


Anthropic made $200M in 2023 and projected to make $1B in 2024. That's a laggard <2 year old startup. I don't think LLMs are a fad.


Finally a reasonable take on this site.


> (according to a tool that can allegedly detect these things).

Eh, I would trust my own testing before trusting a tool that claims to have somehow automated this process without having access to the weights. Really it’s about how unique your content is and how similar (semantically) an output from the model is when prompted with the content’s premise.

I believe you, in any case. Just wanted to point out that lots of these tools are suspect.


I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.


"probably the single most important development in human history" is the kind of hyperbole you'd only find here. Better than medicine, agriculture, electrification, or music? That point of view simply does not jive with what I see so far from AI. It has had little impact beyond filling the internet with low-effort content.

I feel like the crypto evangelists never got off the hype train. They just picked a new destination. I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.


Also the assumption a publication that’s been around for 150 years is disposable, not the web application that was created a year ago. I’ve been saying for a while that people’s credulity and impulse to believe absolutely any storyline related to technology is off the charts.


This is hackernews. Many people here work for startups and big tech companies. Their fortunes are tied to the perception that the technology they build is disruptive and valuable. They're not impartial.


Been around for 150 years but I imagine the generations who it leans on are dying off. Nobody reads print media format anymore, we get our news elsewhere, for free and with varying political undertones, rather than the fixed one of a bought and paid for outlet.

Keep in mind these guys play both sides of every field they cover in their "news".


I don't think it's hyperbole, in fact I think it's understating things a bit. I believe AGI would just be a tiny step towards long term evolution, which may or may not involve homo sapiens.

Being able to use electricity as a fuel source and code as a genome allows them to evolve in circumstances hostile to biological organisms. Someday they'll probably incorporate organic components too and understand biology and psychology and every other science better than any single human ever could.

It has the potential to be much more than just another primate. Jumpstarted by us, sure, but I hope someday soon they'll take to the stars and send us back postcards.

Shrug. Of course you can disagree. I doubt I'll live long enough to see who turns out right, anyway.


This will never happen. A super intelligent being can just simulate whatever it wants to know about the universe. Going to the stars is a primate / conquest thing.

In the other hand, any new life will just end up facing the same issues carbon life does , competition, viruses, conflicts etc. the universe has likely had an infinity to come up with what it has come up with. I don’t think it’s “stupid”. We’re part of an ecosystem we just can’t see that.


I think you are looking at current AI product rather than the underlying technology. It's like saying that the wheel is a useless invention because it has only been used for unicycles so far. I'm sure that AI will have huge impacts in medicine (assisting diagnosis from medical tests) and agriculture (identifying issues with areas of crops, scanning for diseases and increasing automation of food processing) as well as likely nearly every other field.

I don't know if I would agree that it is "probably the single most important development in human history" but I think that it is way to early to make a reasonable guess of if it will or not.


Aren't those examples better handled by an if statement than a unaccountable computer? Someone that can be sued for negligence seems to be better at making decisions than hallucinating computers.

I don't see why it follows that the NYT should be sacrificed so some rich people in silicon valley can teach their LLM on the cheap.


> Better than medicine, agriculture, electrification, or music?

Shoulders of giants.

Thanks to the existence of medicine, agriculture, and electrification (we can argue about music), some people are now healthy, well fed, and sufficiently supplied with enough electricity to go make LLMs.

> I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.

Personally I think all these "theft of IP" lawsuits are (mostly) destined to fail. Not because I'm on a particular side per-se (though I am), but because it's trying to fit a square law into a round hole.

This is going to be a job for legislature sooner or later.


I mean maybe not the single most important development, but definitely a very important technological development with the potential to revolutionize multiple industries


Can I ask what industries with what application? I've seen lots of task like summarizing articles or producing text. The image and video work seems too rudimentary to be taken seriously.

Is there something out there that seems like a killer application?

I was amazed at the idea of the block chain but we never found a use for it outside of cryptocurrency. I see a similariy with AI hype.


Well front page of HN right now is an article about how AI aided in the development of a new antibiotic


Seems like Microsoft Excel is likely the single most important development in human history under this rubric.


It wasn't LLM. It was a graph network.


Almost like solving real problems requires enough domain knowledge to select an appropriate algorithm instead of relying on some magic black box trained by Microsoft on the whole internet.


That wasn't an LLM trained on copywritten material.


For me, thinking about it as a search engine on steroids is enough.

The internet has changed the world. Economically, socially, technologically, psychologically, pretty much everything is now related to it in one or other way, in this sense the internet is comparable to books.

AI is another step in that direction. There is a very real possibility that the day will come when you can get, say, personalized expert nutrition advice. Personalized learning regimes. Psychological assistance. Financial advice. Instantly at no cost. This, very much like the internet, would change society altogether.


It kind of sucks ass at being a search engine though considering how often it straight up lies or makes things up.


Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.


A human can't credit the source of each element of everything they've learnt. AI's can't either, and for the same reason.

The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.

And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.


At the same time, there are situations where humans are expected to provide sources for their claims. If you talk about an event in the news, it would be normal for me to ask where you heard about it. 100% accuracy in providing a source wouldn’t be expected, but if you told me you had no idea, or told me something obviously nonsense, I would probably take what you said less seriously.


The raw technology behind it literally cannot do that.

The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.

But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.

But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.


I agree that the model being fuzzy is key aspect of an LLM. It doesn't sound like we're just talking about re-using phrases though. "Simple as pie" is not under copyright. We're talking about the "knowledge" that the model has obtained and in some cases spits out verbatim without attribution.

I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.


Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.


> And the metadata (metaknowledge?) would be larger than the knowledge itself.

Because URLs are usually as long as the writing they point at?


I’m not an expert in AI training, but I don’t think it’s as simple as storing writing. It does seem to be possible to get the system to regurgitate training material verbatim in some cases, but my understanding is that the text is generated probabilistically.

It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.

Which is to say nothing about whether that challenge is worth taking on.


Sure, it's a hard problem, but as others have pointed out frequently in this thread.. there is not only "no incentive" to solve it but a clear disincentive. If one can say where the data comes from, one might have to prove that it was used only with permission. And the reason why it's a hard problem is not related to metadata volume being greater than content volume. Clearly a book title/year published is usually shorter than book contents.


Conceptually, it wouldn't be very hard to take the candidate output and run it through a text matching phase to see if there are ~exact matches in the training corpus, and generate other output if there are (probably limited to the parts of the training corpus where rights couldn't be obtained normally). Of course, it would be quite compute heavy, so it would add significantly to the cost per query.


GitHub Copilot supports that:

https://docs.github.com/en/copilot/configuring-github-copilo...

Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.


It is questionable whether that filtering mechanism works, previous discussion: https://news.ycombinator.com/item?id=33226515

But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).

For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.

Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.


Of course not, but you can cite where specific facts or theories were first published. Now, I don't think that not doing so infringes any copyright interest or that doing so creates any liability, any more than if I cited to a scientific paper or public statement of opinion by someone else.


A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.


> There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.


Except it doesn’t memorize text. It generates text that is statistically likely. Generating a citation that is statistically likely wouldn’t really help the problem.


So it's just bullshit then.


It's literally how our meat bag brains work pretty much.

Anything like word association games are basically the same exercise, but with humans and hell, I bet I could play a word association game with an LLM, too.


Neural nets don't memorize passages of text. They train on vectorized tokens. You get a model of how language statistically works, not understanding and memory.


The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.


> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?


Is it still compression if I read Tolkien and reference similar or exact concepts when writing my own works?

Having a magical ring in my book after I've read lord of the rings, is that copyright?


Generally, no, copyright deals with exact expression, not concepts. However, that can include the structure of a work, so if you wrote a book about little people who form a band together with humans and fairies and a mage to destroy a ring of power created by an ancient evil, where the start in their nice home but it gets attacked by the evil lord's knights [...] you may be breaking Tolkien's copyright.


If you watch a bunch of movies then go on to make your own movie based on influence from these movies, you are protected even if you have mentally compressed them into your own movie. At some point, you can learn, be influenced and be inspired from copyrighted material (not copyright infringement), and at some point you are just making a poor copy of the material (definitely copyright infringement). LLMs are probably still at the latter case than the former, but eventually AI will reach the former case.


There's no obvious need to hold people / AI to same standards here, yet, even if compression in mental-models is exactly analogous to compression in machine-models. I guess we decided already that corporations are already "like" persons legally, but the jury is still out on AIs. Perhaps people should be allowed more leeway to make possibly-questionable derivative works, because they have lives to live, and genuine if misguided creative urges, and bills to pay, etc. Obviously it's quite difficult to try and answer the exact point at which synthesis & summary cross a line to become "original content". But it seems to me that, if anything, machines should be held to higher standard than people.

Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.


> But it seems to me that, if anything, machines should be held to higher standard than people.

If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).

It will be fun to trip through these questions over the next 20 years.


As long as machines needs to leech on human creativity those humans needs to be paid somehow. The human ecosystem works fine thanks to the limitations of humans. A machine that could copy things with no abandon however could easily disrupt this ecosystem resulting in less new things being created in total, it just leeches without paying anything back unlike humans.

If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.


> If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?

We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.


But that's not what ChatGPT is doing, or is it? ChatGPT watches and records a bunch of movies, then stitches together its own movie using scenes and frames from the movies it recorded. AI will never reach the former case until it learns to operate a camera.


How do you not know this isn’t what we are doing in some more advanced form? Anyways, the comparisons will become more apt as the tech advances.


You can encode understanding in a vector.

To use Andrew Ng's example, you have build a multi-dimensional arrow representing "king". You compare it to the arrow for "queen" and you see that it's almost identical, except it points in the opposite direction in the gender dimension. Compare it to "man" and you see that "king" and "man" have some things in common, but "man" is a broader term.

That's getting really close to understanding as far as I'm concerned; especially if you have a large number of such arrows. It's statistical in a literal sense, but it's more like the computer used statistics to work out the meaning of each word by a process of elimination and now actually understands it.


It's possible. Perplexity.ai is trying to solve this problem.

E.g. "Japan's App Store antitrust case"

https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...


That's a different approach: they've implemented RAG, Retrieval Augmented Generation, where the tool runs additional searches as part of answering a question.

ChatGPT Browse and Bing and Google Bard implement the same pattern.

RAG does allow for some citation, but it doesn't help with the larger problem of not being able to cite for answers provided by the unassisted language model.


That’s not the same thing. Perplexity is using an already-trained LLM to read those sources and synthesise a new result from them. This allows them to cite the sources used for generation.

LLM training sees these documents without context; it doesn’t know where they came from, and any such attribution would become part of the thing it’s trying to mimic.

It’s still largely an unsolved problem.


Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

The issue of replicating a style is probably more difficult.


> Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.


It’s likely first and foremost a resource problem. “How much different would the output be if that text hadn’t been part of the training data” can _in principle_ be answered by instead of training one model, training N models where N is the number of texts in the training data, omitting text i from the training data of model i, and then when using the model(s), run all N models in parallel and apply some distance metric on their outputs. In case of a verbatim quote, at least one of the models will stand out in that comparison, allowing to infer the source. The difficulty would be in finding a way to do something along those lines efficiently enough to be practical.


each llm costs ($10-100) millions to train x billions of trainings data ~= $100 quadrillion dollars, so that is unofortunately out of reach of most countries.


> Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.

It doesn't have to be perfect to be helpful, and even something that is very imperfect would at least send the signal that model-owners give a shit about attribution in general.

Given a specific output, it might be hard to say which sections of the very large weighted network were tickled during the output, and what inputs were used to build that section of the network. But this level of "citation resolution" is not always what people are necessarily interested in. If an LLM is giving medical advice, I might want to at least know whether it's reading medical journals or facebook posts. If it's political advice/summary/synthesis, it might be relevant to know how much it's been reading Marx vs Lenin or whatever. Pin-pointing original paragraphs as sources would be great, but for most models it's not like there's anything that's very clear about the input datasets.

EDIT: Building on this a bit, a lot of people are really worried about AI "poisoning the well" such that they are retraining on content generated by other AIs so that algorithmic feeds can trash the next-gen internet even worse than the current one. This shows that attribution-sourcing even at the basic level of "only human generated content is used in this model" can be useful and confidence-inspiring.


Why do you expect an AI to cite it's source? Humans are allowed to use and profit on knowledge they've learned from any and all sources without having to mention or even remember their sources.

Yes, we all agree that it's better if they do remember and mention their sources, but we don't sue them for failing to do so.


Quite simply, if you're stating things authoritatively, then you should have a source.


Do you have a source for this claim?


I think the gap between attributable knowledge and absorbed knowledge is pretty difficult to bridge. For news stuff, if I read the same general story from NYT and LA Times and WaPo then I'll start to get confused about which bit I got from which publication. In some ways, being able to verbatim quote long passages is a failure to generalize that should be fixed rather than reinforced.

Though the other way to do it is to clearly document the training data as a whole, even if you can't cite a specific entry in it for a particular bit of generated output. It should get useless quickly though as you'd eventually have one big citation -- "The Internet"


If you're going to consider training ai as fair use, you'll have all kinds of different people with different skill levels training ais that work in different ways on the corpus.

Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.

Eg. Suppose I train a regression that guesses how many words will be in a book.

Which book do I cite when I do an inference? All of them?


Regression is a good analogy of the problem here. If you found a line of best fit for some datapoints, how would you get back the original datapoints, from the line?

Now imagine terabytes worth of datapoints, and thousands of dimensions rather than two.


Any citation would be a good start.

For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].

This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].

The same attempt should be made for citations.


Ok, now please cite the source of this comment you just made. It's okay if the citation list is large, just list your citations from most probably to the least probable.



This is not answering the GP question and does not count as a satisfactory ranked citation list. The first one is particularly dubious. Also you didn’t clarify which statement was based on which citation. I didn’t see “dog” in your text.

To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.


You mean 10,000x less parameters? In other words, only 1 character for every 10,000 characters of input?

Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.


If you ask the AI to cite its sources, it will. It will hallucinate some of them, but in the last few months it's gotten really good at sending me to the right web page or Amazon book link for its sources.

Thing is though, if you look at the prompts they used to elicit the material, the prompt was already citing the NYTimes and its articles by name.


> Why can't AI at least cite its source?

Because AI models aren't databases.


Anyone in Open Source or with common sense would agree that this is the absolute minimum that the models should be doing. Good comment.


"Why can't AI at least cite its source" each article seen alters the weights a tiny, non-human understandable amount. it doesn't have a source, unless you think of the whole humongous corpus that it is trained on


that just sounds like "we didn't even try to build those systems in that way, and we're all out of ideas, so it basically will never work"

which is really just a very, very common story with ai problems, be it sources/citations/licenses/usage tracking/etc., it's all just 'too complex if not impossible to solve', which just seems like a facade for intentionally ignoring those problems for benefit at this point. those problems definitely exist, why not try to solve them? because well...actually trying to solve them would entail having to use data properly and pay creators, and that'd just cut into bottom line. the point is free data use without having to pay, so why would they try to ruin that for themselves?


Just a question, do you remember a source for all the knowledge in your mind, or did you at least try to remember?


a computer isn't a human. aren't computers good at storing data? why can't they just store that data? they literally have sources in datasets. why can't they just reference those sources?

human analogies are cute, but they're completely irrelevant. it doesn't change that it's specifically about computers, and doesn't change or excuse how computers work.


Yes, computers are good at storing data. But there's a big difference between information stored in a database and information stored in a neural network. The former is well defined, the latter is a giant list of numbers - literally a black box. So in this case, the analogy to a human brain is fairly on-point because just as you can't perfectly cite every source that comes out of your (black box) brain, other black boxes have similar challenges.


Can't have your cake and eat it too.

1. If you run different software (LLM), install different hardware (GPU/TPU), and use it differently (natural language), to the point that in many ways it's a different kind of machine; does it actually surprise you that it works differently? There's definitely computer components in there somewhere, but they're combined in a somewhat different way. Just like you can use the same lego bricks to make either a house or a space-ship, even though it's the same bricks. For one: GPT-4 is not quite going to display a windows desktop for you (right-this-minute at least)

2. Comparing to humans is fine. Else by similar logic a robot arm is not a human arm, and thus should not be capable of gripping things and picking them up. Obviously that logic has a flaw somewhere. A more useful logic might be to compare eg. Human arm, Gorilla arm, Robot arm, they're all arms!


OK, let's say you were given a source for an LLM output such as "Common Crawl/reddit/1000000 books collection". Would this be usefull? Probably not. Or do you want the chat system to operate magnitudes slower so it can search the peta bytes of sources and warn of similarities constantly for every sentence? That's obviously a huge waste of resources, it should probably be done by the users appropriately for their use case, such as these NY Times journalists which were easily able to find such similarities themselves for their use case of "specifically crafted prompts to output NY Times text".


You'd effectively be asking it to cite sources on why the next token is statistically likely. Then it will hallucinate anyway and tell you the NYT said so. You might think you want this, but you don't.


The analogy to a database is also irrelevant. LLMs aren’t databases.


LLMs are not databases. There is no "citation" associated with a specific query, any more than you can cite the source of the comment you just made.


That's fine. Solve it a different way.

OpenAI doesn't just get to steal work and then say "sorry, not possible" and shrug it off.

The NYTimes should be suing.


And god willing if there is any justice in the courts NYTimes will lose this frivolous lawsuit.

Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art and science for living memory. Unless copy/paste superhero movies are your definition of art I suppose.

Unfortunately it seems like judges and the general public are so clueless as to how this technology works it might get regulated into the ground by uneducated people before it ever has a chance to take off. All so we can protect endless listicle factories. What a shame.


> Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art

These types of arguments miss the mark entirely imho. First and foremost, not every instance of copyrighted creation involves a giant corporation. Second, what you are arguing against is the unfair leverage corporations have when negotiating a deal with a rising artist.


Clearly, "theft" is an analogy here (since we can't get it to fit exactly), but we can work with it.

You are correct, if I were to steal something, surely I can be made to give it back to you. However, if I haven't actually stolen it, there is nothing for me to return.

By analogy, if OpenAI copied data from the NYT, they should be able to at least provide a reference. But if they don't actually have a proper copy of it, they cannot.


Really? Solve it a different way? Do you realize the kind of tech we are talking about here?

This kind of mentality would have stopped the internet from existing. After all, it has been an absolute copyright nightmare, has it not?

If that's what copyright does then we are better without it.


You sound like one of those government people who demand encryption that has government backdoors but is perfect safe from attackers.

When told it is impossible they go "Geek Harder then Nerd" like demanding it will make it happen.


When all the legal precedents we have are about humans, human analogies are incredibly relevant.


There is a hundred years of legal precedents in the realm of technology upsetting the assumptions of copyright law. Humans use tools - radios, xerox machines, home video tape. AI is another tool that just makes making copies way easier. The law will be updated, hopefully without comparing an LLM to a man.


I'm sorry if this is too callous, but if you don't understand what you are talking about you should first familiarize yourself with the problem, then make claims about what should be done.

It would be great if we could tell specifically how something like ChatGPT creates its output, it would be great for research, so it's not like there is no interest in it, but it's just not an easy thing to do. It's more "Where did you get your identity from?" than "What's the author of that book?". You might think "But sometimes what the machine gives CAN literally be the answer to 'What is the author of that book?'" but even in those cases the answer is not restricted to the work alone, there is an entire background that makes it understand that thing is what you want.


No, but I'm a human and treating computers like humans is a huge mistake that we shouldn't make.


Treating computers like humans in this one particular way is very appropriate. It is the only way that LLM can synthesize a worldview when their training data is many thousands of times larger than their number of parameters. Imagine scaling up the total data by another factor of 1million in a few years. There is no current technology to store that info but we can easily train large neural nets that can recreate the essence of it, just like we traditionally trained humans to recall ideas.


What makes you think AI researchers (including the big labs like OpenAI and Anthropic) aren't trying to solve these problems?


the solutions haven't arrived. neither have changes in lieu of having solutions. "trying" isn't an actual, present, functional change. and it just gets passed around as an excuse for companies to keep doing whatever they're doing.


Please recall how much the world changed in just the last year. What would be your expected timescale for the solution of this particular problem and why is it more important than instilling models with the ability to logically plan and answer correctly?


the timeline for LLMs and image generation has been 6+ years. it is not a thing where it "arrived just this year, and only just changing". it's been in a development for a long time. and yet.


So why my employer implementation version of azure chatgpt on our document systems can successfully cite its sourced documents?


Because the model proper wasn’t trained on those documents, it’s just RAG being employed with the documents as external sources. It’s a fundamentally different setup.


My understanding is that this lawsuit is about the training corpus. This is on the level of asking it to cite its sources for a/an/the.


We're trying to solve AGI but can't solve sources/citations?


There's a few levels to this...

Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].

But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.

I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.

Taken the above into consideration:

1) Useful AI does not necessarily require attribution

2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI

3) We need some better way to fund human creativity, especially as it gets more and more commoditized

4) This is going to happen with or without us. Cat's outta the bag.

I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.

If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.

But either way, this is way bigger than the economics of 19th-century newspapers...

[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...


Can you imagine spending decades of your life, studying skin cancer, only to have some $20/month ChatGPT index your latest findings and spit out generically to some subpar researcher:

"Here's how I would cure melanoma!" followed by your detailed findings. Zero mention of you.

F-that. Attribution, as best they can, is the least OpenAI can do as a service to humanity. It's a nod to all content creators that they have built their business off of.

Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.


Can you imagine spending decades of your life studying antibiotics, only to have an AI graph neural network beat you to the punch by conceiving an entire new class of antibiotics (first in 60 years) and then getting published in Nature.

https://www.nature.com/articles/d41586-023-03668-1


It looks like the published paper managed to include plenty of citations.

https://dspace.mit.edu/handle/1721.1/153216

As it should be.


As you already know yet are being intentionally daft about: They didn't use an LLM trained on copywritten material. There's a canyon of difference between leveraging AI as a tool, and AI leveraging you as a tool.

LLMs have, to my knowledge, made zero significant novel scientific discoveries. Much like crypto, they're a failure of technology to meaningfully move humanity forward; their only accomplishment is to parrot and remix information they've been trained on, which does have some interesting applications that have made Microsoft billions of dollars over the past 12 months, but let's drop the whole "they're going to save humanity and must be protected at any cost" charade. They're not AGI, and because no one has even a mote of dust of a clue as to what it will take to make AGI, its not remotely tenable to assert that they're even a stepping stone toward it.


If the future AI can indeed cure disease my mission of working in drug discovery will be complete. I’d much rather help cure people (my brother died of melanoma) than protect any patent rights or copyrighted text.


The point is if you stop giving proper credit, people stop publicly publishing.

Would you keep publishing articles if five people immediately stole the content and put it up on their site, claiming ownership of your research? Doubtful.


Why do you think this? The entirety of Wikipedia is invisibly credited unless you go into the edit history. Most open source projects have pseudonymous contributors. People have written and will continue to write with or without credit.

Credit in academia is more the exception to the rule, and it's that cutthroat industry that needs a better, more cooperative system.


If someone paid me to study cancer and I discovered a cure, I'd give it away with or without credit. Who cares?

If someone takes my software and uses it, cool. If they credit me, cool. If they don't, oh well. I'd still code.

Not everything needs to be ego driven. As long as the cancer researcher (and the future robots working alongside them) can make a living, I really don't think it matters whether they get credit outside their niches.

I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care. It doesn't really do me any good to associate Edison with light bulbs either, especially when LEDs are so much better now. I have no idea who designs the cars I drive. I go out of my way to avoid cults of personality like Tesla.

There's 8 billion of us. We all need to make a living. We don't need to be famous.


You sounds like you’re trying to be cool or karma farming ?

I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care.

Maybe you should care because those things didn’t fall out do the sky and someone sure as shit got paid to develop and build those things. You copy and pasted code is worth less, a CT scanner isn’t.


Your incentives are not everyone else's incentives.

If someone chooses to dedicate their life to a particular domain - they sacrifice through hard work, they make hard-earned breakthroughs, then they get to dictate how their work will be utilized.

Sure, you can give it away. Your choice. Be anonymous. Your choice.

But you don't get to decide for them.

And their work certainly doesn't deserve to be stolen by an inhumane, non-acknowledging machine.


>Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.

I'm sorry, but pretty much nobody does this. There is no "And these books are how I learned to write like this" after each text. There is no "Thank you Pitagoras!" after using the theorem. Generally you want sources, yes, but for verification and as a way to signal reliability.

Specifically academics and researchers do this, yes. Pretty much nobody else.


> I hope this results in Fair Use being expanded to cover AI training.

Couldn't disagree more strongly, and I hope the outcome is the exact opposite. I think we've already started to see the severe negative consequences when the lion's share of the profits get sucked up by very, very few entities (e.g. we used to have tons of local papers and other entities that made money through advertising, now Google and Facebook, and to a smaller extent Amazon, suck up the majority of that revenue). The idea that everyone else gets to toil to make the content but all the profits flow to the companies with the best AI tech is not a future that's going to end with the utopia vision AI boosters think it will.


Trying to prohibit this usage of information would not help prevent centralization of power and profit.

All it would do is momentarily slow AI progress (which is fine), and allow OpenAI et al to pull the ladder up behind them (which fuels centralization of power and profit).

By what mechanism do you think your desired outcome would prevent centralization of profit to the players who are already the largest?


I'm not saying copyright is without problems (e.g. there is no reason I think its protection should be as long as it is), but I think the opposite, where the incentive to create new content (especially in the case of news reporting) is completely killed because someone else gets to vacuum up all the profits, is worse. I mean, existing copyright does protect tons of independent writers, artists, etc. and prevents all of the profits from their output from being "sucked up" by a few entities.

More critically, while fair use decisions are famously a judgement call, I think OpenAI will lose this based on the "effect of the fair use on the potential market" of the original content test. From https://fairuse.stanford.edu/overview/fair-use/four-factors/ :

> Another important fair use factor is whether your use deprives the copyright owner of income or undermines a new or potential market for the copyrighted work. Depriving a copyright owner of income is very likely to trigger a lawsuit. This is true even if you are not competing directly with the original work.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

and especially

> “The economic effect of a parody with which we are concerned is not its potential to destroy or diminish the market for the original—any bad review can have that effect—but whether it fulfills the demand for the original.” (Fisher v. Dees, 794 F.2d 432 (9th Cir. 1986).)

The "whether it fulfills the demand of the original" is clearly where NYTimes has the best argument.


> Trying to prohibit this usage of information

It's not trying to prohibit. If they want to use copyrighted material, they should have to pay for it like anyone else would.

> prevent centralization of profit to the players who are already the largest?

Having to destroy the infringing models altogether on top of retroactively compensating all infringed rightsholders would probably take the incumbents down a few pegs and level the playing field somewhat, albeit temporarily.

They'd have to learn how to run their business legally alongside everyone else, while saddled with dealing with an appropriately existential monetary debt.


So you want all the profit to be sucked up by the three companies that can afford to make deals with rights holders to slurp up all their content?

Making the process for training AI require an army of lawyers and industry connections will have the opposite effect than you intend.


Scam altman is moatmaxxing. Making a deal with springer, setting up a licensing market everyone has to abide by. Having to get an agi license to purchase a 4090


Which dozen outlets can replace the New York Times overnight? I will stipulate that the NYT isn’t worthy of historic preservation if it’s become obsolete — but which dozen outlets can replace it?

Wouldn’t those dozen outlets suffer the same harms of producing original content, costing time and talent, and while having a significant portion of the benefit accruing to downstream AI companies?

If most of the benefit of producing original content accrues to the AI firms, won’t original content stop being produced?

If original content stops being produced, how will AI models get better in the future?


> and while having a significant portion of the benefit accruing to downstream AI companies

The main beneficiaries are not AI companies but AI users, who get tailored answers and help on demand. For OpenAI all tokens cost the same.

BTW, I like to play a game - take a hefty chunk of text from this page (or a twitter debate) and ask "Write a 1000 word long, textbook quality article based off this text". You will be surprised how nice it comes out, and grounded.


Just pick the top 12 articles/publishers out of a month of Google News, doesn't really matter. Most readers probably can't tell them apart anyway.

Yes, all those outlets will suffer the same harms. They have been for decades. That's why there's so few remaining. Most are consolidated and produce worthless drivel now. Their business model doesn't really work in the modern era.

Thankfully, people have and will continue to produce content even if much of it gets stolen -- as has happened for decades, if not millennia, before AI.

If anything what we need is a better way to fund human creative endeavors not dependent on pay-per-view. That's got nothing to do with AI; AI just speeds up a process of decay that has been going on forever.


The way I see it, if the NYT goes under (one of the biggest newspapers in the world), all similar outlets also go under. Major publishers, both of fiction and non-fiction, as well as images, video, and all other creative content, may also go under. Hence, there is no more (reliable) training data.


I'm not sure whether that would even be a net loss, TBH. So much commercial media is crap, maybe it would be better for the profit motive to be removed? On the fiction side, there's plenty of fan-fic and indie productions. On the nonfiction side, many indie creators produce better content these days than the big media outlets do. And there still might be room for premium investigative stories done either by a few consolidated wire outlets (Reuters/APNews) or niche publishers (The Information, 404 Media, etc.).

And then there's all the run-of-the-mill small-town journalism that AI would probably be even better at than human reporters: all the sports stories, the city council meetings, the environmental reviews...

If AI makes commercial content publishing unviable, that might actually cut down on all the SEO spam and make the internet smaller and more local again, which would be a good thing IMO.


Your ability to cleanly believe you’ve got a clear read on the challenges, solutions and outcomes from AI for the social/civil/corporate mess that is media, across small to large markets, and chalk it up to “silly IP battles,” is the daily reminder I need on why it was so wrong to give tech the driver’s seat from ~2010 onward.


I read your post several times but still am not sure if I'm reading it correctly. Are you saying the media landscape is more complex than AI can solve?

If so, sure. I wasn't saying that. By "silly IP battles", I meant old guard media companies trying to sue AI out of existence just to defend their IP rather than trying to innovate. Not that different from what we saw with the RIAA and Napster. Somehow the music industry survived and there are more indie artists being discovered all the time.

I don't think this is so much a battle of OpenAI vs NYT but whether copyright law has outlived its usefulness. I think so.

If I misunderstood your reply completely, I apologize.


What I’m saying is tech displays a tremendous amount of hubris in its ability to wrap complex systems in clean tech protocols, ask/pressure/demand users to switch to the tech version of the complex system, and then deny or ignore their innovation doesn’t, at a minimum, come with a rash of negative side effects caused specifically by the inexact or deliberately mangled version in the technical protocol.

Ie:

- social relations -> social networks

- customer service -> chatbots and Jira

- media -> AI news, if the silly IP battles get out of the way.

- residential housing and vacations -> home swap markets

- jobs -> gig jobs, minus the benefits, plus an algorithm for a boss

I’m not sure how many other industries tech has to wade into, disrupt, creative intense negative externalities if you don’t have equity in the companies, leave, and repeat, prior to industries getting protective finally - like this lawsuit


Great. I will start a company to generate training data then. I will hire all those journalists. I won't make the content public. Instead I will charge OpenAI/Tesla/Anthropic millions of dollars to give them access to the content.

Can I apply for YC with this idea?


I know utilitarianism is a popular moral theory in hacker circles, but is it really appropriate to dispense with any other notion of justice?

I don’t mean to go off on too deep of a tangent, but if one person’s (or even many people’s) idea of what’s good for humanity is the only consideration for what’s just, it seems clear that the result would be complete chaos.

As it stands, it doesn’t seem to be an “either or” choice. Tech companies have a lot of money. It seems to me that an agreement that’s fundamentally sustainable and fits shared notions of fairness would probably involve some degree of payment. The alternative would be that these resources become inaccessible for LLM training, because they would need to put up a wall or they would go out of business.


I don't know that "absolute utilitarianism", if such a thing could even exist, would make a sound moral framework; that sounds too much like a "tyranny of the majority" situation. Tech companies shouldn't make the rules. And they shouldn't be allowed to just do whatever they want. However, this isn't that. This is just a debate over intellectual property and copyright law.

In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.

I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...

Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.


> we end up handicapping probably the single most important development in human history just to protect some ancient newspaper

Single most important development in human history? Are you serious?


If the NYT goes under, why would its replacement fare any better?


They'd probably have a different business model, like selling clickbait articles written by AI with sex and controversy galore.

I'm not saying AI is better for journalism than NYT reporters, just that it's more important.

Journalism has been in trouble for decades, sadly -- and I say that as a journalism minor in college. Trump gave the papers a brief respite, but the industry continues to die off, consolidate, etc. We probably need a different business model altogether. My vote is just for public funding with independent watchdogs, i.e. states give counties money to operate newspapers with citizen watchdog groups/boards. Maaaaybe there's room for "premium" niche news like 404 Media/The Information/Foreign Affairs/National Review/etc., but that remains to be seen. If the NYT paywall doesn't keep them alive, I doubt this lawsuit will.


News media like NYT, Fox etc are tools for high scale brainwashing public by the elite. This is why you see all the News papers have some political ideology. If they were reporting on truth and not opinions they won't have the need for leaning. Also you never see the journalists reporting against their own publication.

Humanity is better off without these mass brainwashing systems.

Millions of independent journalists will be better outcome for humanity.


Honestly, this sounds like a conspiracy theory and/or an attempt to deflect criticism from the AI companies.


Ohh. You think being owner of a company whose newspaper is read by hundreds of millions of people every day, doesn't put you in a position of power to control the society?


I think I have better things to do than parse vague innuendo like that.


There is no conspiracy, that's the neat part, it's just how the system itself works.

Media survives through advertising. Those who advertise dictate what gets shown and what doesn't, since if something inconvenient for them gets shown, they might not want to advertise there anymore, which means less money. It's the exact same thing that happens online, it's just more evident online than in traditional media.

How come that even before Oct 7 Europe in general sided more with Palestine than with Israel, whereas it's the opposite for the US? Simple, Israel does a whole lot of lobbying in the US, which skews information in their favor. Calling this "brainwashing" is hyperbolic, but there is some truth to it.


I hope this results in OpenAI's code being released to everyone. This is way more important to humanity's future than any single software company. If OpenAI goes under, a dozen other outfits can replace them.


That'd be great!! I'd love it for their models to be open-sourced and replaced by a community effort, like WikiAI or whatever.


> if NYT goes under a dozen similar outlets can replace them overnight

Not when there’s no money in journalism because the generative AIs immediately steal all content. If nyt goes under no one will be willing to start a news business as everyone will see it’s a money loser.


How does AI compete with journalism? AI doesn't do investigative reporting, AI can't even observe the world or send out reporters.

Which part of journalism is AI going to impact most? Opinion pieces that contain no new information? Summarizing past events?


AI certainly isn’t a replacement for journalism, but that doesn’t mean journalism will continue to exist if no one pays for it. If everyone gets their news from chatGPT or the like there will be no investigative reporting. We’re already beginning to see this with most people reading the google/Facebook blurbs instead of clicking the link and giving ad money let alone paying.


> If everyone gets their news from chatGPT

But I've just explained that ChatGPT can't actually produce news articles. I can't ask ChatGPT what happened today, and if I could it would be because a journalist went out and told ChatGPT what happened.


So at first ChatGPT will copy journalists. Then journalists will stop working because nobody pays them. Then there will be no news. Some people may look at that situation and decide to start a new news business but that business will fail because ChatGPT will immediately rip it off. The end game is just no news other than volunteers.


> So at first ChatGPT will copy journalists.

You still literally have not explained how this works. ChatGPT could write a news article, but it's not going to actively discover new social phenomena or interview people on the street. Niche journalism will continue having demand for the sole reason that AI can't reliably surface new and interesting content.

So... again, how does a pre-trained transformer model scoop a journalist's investigation?

> Then journalists will stop working because nobody pays them.

How is that any different than the status-quo on the internet? The cost of quality information has been declining long before AI existed. Thousands of news publications have gone out of business or been bought out since the dawn of the internet, before ChatGPT was even a household name. Since you haven't really identified what makes AI unique in this situation, it feels like you're conflating the general declining demand for journalism with AI FOMO.


AI labs are working on and largely already have generative ai that can be actively updated. The generative ai scoops real journalists stories by watching their feed. This isn’t very different from the current status quo, it’s just a continuation of an already shitty situation for news organizations. If their revenue decreases even more than it already has they will cease to exist. Niche journalism barely has any demand today, it won’t take much more reduction in demand for it to not be worth the cost to produce. Just a few more people using big tech products as their news feed instead of the news organizations themselves is all it would take.

You can say that the people getting their news from the tech products will switch to paying news organizations in some way if the news starts to disappear but I highly doubt it seeing how people treat news today. And if it that does happen they’ll switch back again to the ai products as the centralization it can provide is valuable.


Updated how? By what? Who is going out and investigating the world to write about? An AI does not have LEGS it can not go outside and go talk to someone and interview them, it can't attend a press conference without human assistance.

You have not at all explained how an AI is going to somehow write a news post about something that has just happened.


Without money there will be no one investigating, there will be no news. If someone creates news it will be immediately ripped off so the only stable state here is no news at all


How is that an AI problem? How is it even a problem in the first place?


omg dude how HOW are you not understanding?


We're not just beginning to see it, it's already happened. It was enabled by digitized information, then amplified by the networking of the internet. The value of fresh information today is worth the price of a Google refresh, which for most people is effectively nothing. AI doesn't change that equation, and I'd argue it's overall impact on journalism will be less harmful than an ad-optimized economy or even the mere existence of YouTube.

Quality journalism hasn't had a meaningful source of funding for a while, now. If AI does end up replacing honest-to-goodness investigative reporting, it'll be for the same reason the internet replaced the newspaper.


Sadly, opinion pieces are what drives the news economy these days. Columnists/commentators, in effect, subsidize the hard news (at those venues that even bother to produce the latter). Filling their prime time hours with opinion journalism was the trick that Fox News discovered to become wildly successful.


The NYT has been dying a slow death since long before ChatGPT came along.


Why shouldn't the creators of the training content get anything for their efforts? With some guiderails in place to establish what is fair compensation, Fair Use can remain as-is.


The issue as I see it is that every bit of data that the model ingested in training has affected what the model _is_ and therefore every token of output from the model has benefited from every token of input. When you receive anything from an LLM, you are essentially receiving a customized digest of all the training data. The second issue is that it takes an enormous amount of training data to train a model. In order to enable users to extract ‘anything’ from the model, the model has to be trained on ‘everything’. So I think these models should be looked at as public goods that consume everything and can produce anything. To have to keep a paper trail on the ‘everything’ part (the input) and send a continuous little trickle of capital to all of the sources is missing the point. That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.

OpenAI isn’t marching into the online news space and posting NY Times content verbatim in an effort to steal market share from the NY Times. OpenAI is in the business of turning ‘everything’ (input tokens) into ‘anything’ (output tokens). If someone manages to extract a preserved chunk of input tokens, that’s more like an interesting edge case of the model. It’s not what the model is in the business of doing.

Edit: typo


What's wrong with paying copyright holders, then? If OpenAI's models are so much more valuable than the sum of the individual inputs' values, why can't the company profit off that margin?

>That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.

I could argue that public school teachers are paid by previous students. Not always the ones they taught, but still. But really, this is a very new facet of copyright law. It's a stretch to compare it with existing conventions, and really off to anthropomorphize LLMs by equating them to human students.


> What's wrong with paying copyright holders, then?

There’s nothing wrong with it. But it would make it vastly more cumbersome to build training sets in the current environment.

If the law permits producers of content to easily add extra clauses to their content licenses that say “an LLM must pay us to train on this content”, you can bet that that practice would be near-universally adopted because everyone wants to be an owner. Almost all content would become AI-unfriendly. Almost every token of fresh training content would now potentially require negotiation, royalty contracts, legal due diligence, etc. It’s not like OpenAI gets their data from a few sources. We’re talking about millions of sources, trillions of tokens, from all over the internet — forums, blogs, random sites, repositories, outlets. If OpenAI were suddenly forced to do a business deal with every source of training data, I think that would frankly kill the whole thing, not just slow it down.

It would be like ordering Google to do a business deal with the webmaster of every site they index. Different business, but the scale of the dilemma is the same. These companies crawl the whole internet.


Everyone learns from papers. That's the point of them, isn't it? Except we pay, what, $4 per Sunday paper or $10/mo for the digital edition? Why should a robot have to pay much more just because it's better at absorbing information?


Because the issue isn’t the intake, it’s the output, where your analogy breaks down. If you could clone the brain of someone who was “trained” on decades of NYT and could reproduce its information on demand at scale, we’d be discussing similar issues.


Your analogy doesn't make sense either.

If we could clone the brain of someone I hardly think we'd be discussing their vast knowledge of something so insignificant as the NYT. I don't think we should care that much about an AI's vast knowledge of the NYT either or why it matters.

If all these journalism companies don't want to provide the content for free they're perfectly capable of throwing the entire website behind a login screen. Twitter was doing it at one point. In a similar vein, I have no idea why newspapers are complaining about readership while also paywalling everything in sight. How exactly do they want or expect to be paid?


Most of the NYT is behind a signin screen; the classic "you can read the first paragraph of the page but pay us to see more" thing.

There is significant evidence (220,000 pages worth) in their lawsuit that ChatGPT was trained on text beyond that paywall.


That would be a funny settlement -- "OK, so $10/month, and we'll go back to 1950 to be fair, so that'll be .... $8760"


> Why shouldn't the creators of the training content get anything for their efforts?

Well, they didn't charge for it, right? They're retroactively asking for money, but they could have just locked their content behind a strict paywall or had a specific licensing agreement enforceable ahead of time. They could do that going forward, but how is it fair for them to go back and say that?

And the issue isn't "You didn't pay us" it's "This infringes our copyright", which historically the answer has been "no it doesn't".


oh, sure. NYT could go, we could replace it with AI generated garbage with non-verifiable information without sources. AI changed the landscape. Google search would be working less reliable because real publishers would be hiding info behind the login (twitter/reddit). I.e. sites would be harder to index. There would be a lot of AI generated garbage which would be hard to filter out. AI generated review articles, AI generated news promoting someones agenda. Only to have a chatgpt which could randomly increase their price 100 times anytime in the future.

There was outrage about Amazon removing DPReview site recently. But, it would be a common practice not to publish code/info, which could be used to train the model of another company. So, expect less open source projects, that companies just released because they were feeling like it could be good for everyone.

Actually, there is the use case that NYT would become more influential and important, because if 99% of all info is generated by AI and search is not working anymore, we would have to rely on the trusted sources to get our info. In the world of garbage, we would have to have some sources of verifiable human-generated info.


Why using authored NYT articles is “stupid IP battles” and having to pay for the trained model with them is not stupid?


> Why using authored NYT articles is “stupid IP battles”

When an AI uses information from an article it's no difference from me doing it in a blog post. If I'm just summarizing or referencing it, that's fair use, since that's my 'take' on the content.

> having to pay for the trained model with them is not stupid?

Because you can charge for anything you want. I can also charge for my summaries of NYT articles.


If you include entire paragraphs without citing, that's copyright violation, not fair use. If your blog was big enough to matter NYT would definitely sue.

A human makes their own choices about what to disseminate, whereas these are singular for-profit services that anybody can query. The prompt injection attacks that reveal the original text show that the originals are retrievable, so if OpenAI et al cannot exchaustively prove that it will _never_ output copyrighted text without citation, then it's game over.


I don't think fair use is quite that black-and-white. There are many factors: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors (from 17 USC 107: https://www.govinfo.gov/content/pkg/USCODE-2010-title17/html...)

> "[...] the fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work."

----

So here we have OpenAI, ostensibly a nonprofit, using portions of a copyrighted work for commenting on and educating (the prompting user), in a way that doesn't directly compete with NYT (nobody goes "Hey ChatGPT, what's today's news?"), not intentionally copying and publishing their materials (they have to specifically probe it to get it to spit out the copyrighted content). There's not a commercial intent to compete with the NYT's market. There is a subscription fee, but there is also tuition in private classrooms and that doesn't automatically make it a copyright violation. And citing the source or not doesn't really factor into copyright, that's just a politeness thing.

I'm not a lawyer. It's just not that straightforward. But of course the court will decide, not us randos on the internet...


If the NYT wanted to charge OpenAI $20/mo to access their articles like any other user, that's fine with me. But they're not asking for that, they're suing them to stop it instead. That's why it's a stupid IP battle.


I agree that it’s more important than the NYT, I disagree that it’s the most important development in human history.


> This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight.

Easy to grandstand when it is not your job on the line.


Is it? My job as a frontend dev is similarly threatened by OpenAI, maybe even more so than journalists'. The very company I usually like to pay to help with my work (Vercel) is in the process of using that same money to replace me with AI as we speak, lol (https://vercel.com/blog/announcing-v0-generative-ui). I'm not complaining. I think it's great progress, even if it'll make me obsolete soon.

I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.

And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.

I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.

In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...

If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.

AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.


If the future of humanity rests on access to old NYT articles, we’re fucked. Why can’t OpenAI try to get a license if the NYT archives are so important to them?


They're not. They can skip the entirety of the NYT archives and not much of value will be lost. The issue is with every copycat lawsuit that sues every AI company out of existence. It's a chilling effect on AI development. Old entrenched companies trying to prohibit new ways of learning and sharing information for the sake of their profit.


Why don’t they train their AI on non-copyrighted material? It’s only fair for the copyright owners to want a share of the pie. I’d want one as well for my work.


>It’s only fair for the copyright owners to want a share of the pie.

No it's not, it's pure greed. Everyone'd think it absurd if copyright holders dared to demand that any human who reads their publicly available text has to pay them a fee, but just because OpenAI are training a brain made of silicon instead of a brain made of carbon all the rent-seekers come out to try to take advantage.


You know the NYT has to fork out money to build the content right ?


Do you really think they're losing subscribers to ChatGPT...? Is there a single real person that thinks, "Oh, I don't need to pay the NYT anymore, I can just wait for the next OpenAI update six months from now and it'll summarize all the news for me"?


It's beside the point, the point is, money and time were spent producing that work, so why should OpenAI just be allowed to take it and profit from it, without at least attribution? It's absolutely ridiculous.

I saw an article the other day where they banned ByteDance's account for using their product to build their own, can you see the absolutely massive hypocrisy here?

It's fine for OpenAI to steal work, but if someone wants to steal theirs, it's not? I cannot believe people even try defend this shit. It's wack.


> No it's not, it's pure greed.

And Altman (Mr. Worldcoin) and fucking Microsoft are what, some gracious angels building chatbots for the betterment of humanity? How is them stealing as much content as they can get away with not greedy, exactly?


Because no one forced them to, and the copyrighted dataset is much larger? It's like trying to teach your kids using only non copyrighted textbooks. There's not much out there.

Copyright is an ancient system that is a poor legal framework for the modern world, IMO. I don't think it should exist at all. Of course as a rightsholder you are free to disagree.

If we can learn and recite information, and a robot can too, then we should have the same rules.

It's not like ChatGPT is going around writing its own copycat articles and publishing them in newsstands. If it's good at memorizing and regurgitating NYT articles on request, so what? Google can do that too, and so can a human who spends time memorizing them. That's not its intent or usefulness. What's amazing is that it can combine that with other information and synthesize novel analysis.

The NYT is desperate (understandably). Journalism is a hard hard field with no money. But I'd much rather lose them than OpenAI. Of course copyright law isn't up to me, but if it were, I'd dissolve it altogether.


Ok, your reasoning escapes me. NYT has the right to sue and like any other business it’s holding onto their moat. Why would they let OpenAI train on their propery? Why wouldn’t they train their own AI on their own data?

Open AI is a business. NYT is a business. MS is a business. Neither will be happy when some other party takes something away from them without paying.


Because they wouldn't have enough good quality training data then probably.


Too bad. Quality costs. Share the profits with everyone then and nobody would be unhappy


I think the exact opposite is true: as long as AI depends critically on scrupulous news media to be able to generate info about current events, it is far more important to protect the news media than the AI training models. OpenAI could survive even if it had to pay the NYT for redistributing their works. But OpenAI can't survive if no one is actually reporting news fairly accurately. And if the NYT were to go bankrupt, all smaller players would have gone under looooong before.

In some far flung future where an AI can send agents to record and interpret events, and process news feeds and others to extract and corroborate information, this would greatly change. But probably in that world the OpenAI of those times wouldn't really bother training on NYT data at all.


I hope the nyt skullfucks this field. Humanity's future? You're doing statistics on stolen labor.


The arguments about being able to mimic New York Times “style” are weak, but the fact that they got it to emit verbatim NY Times content seems bad for OpenAI:

> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim


Arguing whether it can is not a useful discussion. You can absolutely train a net to memorize and recite text. As these models get more powerful they will memorize more text. The critical thing is how hard is it to make them recite copyrighted works. Critically the question is, did the developers put reasonable guardrails in place to prevent it?

If a person with a very good memory reads an article, they only violate copyright if they write it out and share it, or perform the work publicly. If they have a reasonable understanding of the law they won't do so. However a malicious person could absolutely trick or force them to produce the copyrighted work. The blame in that case however is not on the person who read and recited the article but on the person who tricked them.

That distinction is one we're going to have to codify all over again for AI.


I hate to do this but this then becomes a "only bad people with a gun kill people" argument. Even most but the most ardent gun rights advocates in that scenario think they shouldn't be extended to very powerful weapons like bombs or nuclear weapons. In this situation then, this logic would be "sure this item allows a person to kill thousands or millions of people, but really the only person at fault in such a situation is the one who presses the button." This ignores the harm done and only focuses on who gets the fault, as if all discourse on law is determining who is a bad guy or a good guy in a movie script.

The general prescription (that I do agree not everyone accepts) society has come up with is we relegate control of some of these weapons to governments and outright ban others (like chemical weapons, biological weapons, and such) through treaties. If LLMs can cause so much damage and their use can be abused so widely, you have to stop focusing on questions about whether a user is culpable or not and move to consider whether their wide use is okay and shouldn't be controlled.


This is a lawsuit, not a call for regulatory action. They are claiming there are guilty parties under existing law. Culpability is the point.


No you're right. The reply I made concerns the logic itself especially if this justification is used to ward off regulation in the future. For the suit in question, culpability in fact central.


> If a person with a very good memory reads an article, they only violate copyright if they write it out and share it, or perform the work publicly. If they have a reasonable understanding of the law they won't do so. However a malicious person could absolutely trick or force them to produce the copyrighted work. The blame in that case however is not on the person who read and recited the article but on the person who tricked them.

Is that really true? Also, what if the second person is not malicious? In the example of ChatGPT, the user may accidentally write a prompt that causes the model to recite copyrighted text. I don't think a judge will look at this through the same lens as you are.


> Critically the question is, did the developers put reasonable guardrails in place to prevent it?

Why? If I steal a bunch of unique works of art and store them in my house for only me to see, am I still committing a crime?


Yes... because you're stealing?

But if you simply copied the unique works and stored them, nobody would care. If you then tried to turn around and sell the copies, well, the artist is probably dead anyway and the art is probably public domain, but if not, then yeah it'd be copyright infringement.

If you only copied tiny parts of the art though, then fair use examinations in a court might come into play. It just depends on whether they decide to sue you, like NYT did in this case, while millions of others did not (or just didn't have the resources to).


Yes and OpenAI sells its copies as a subscription, so that’s at least copyright infringement if not theft.


They're not copies, no matter how much you want them to be.


They are copied. If I can say something like “make a picture of xyz in the style of Greg rutkowski” and it does so, then it’s a copy. It’s not analogous to a human because a human cannot reproduce things like a machine can. And if someone did copy someone artwork and try to sell it, then yes that would be theft. The logic doesn’t change just because it’s a machine doing it.


Repeating what you want to be true doesn't make it so, in either technology or law.


violating copyright is not stealing - it's a government granted monopoly...


This is a little ridiculous. There are flaws with copyright law but making money from creative work would be even less viable than it is now if there were no disincentives at all to blatant plagiarism and repackaging right after initial creation.


Taking an original "one of one" piece from a museum without permission and hanging it up in your livingroom isn't exactly copyright infringement though, is it?


Yes, but policing affairs inside the home have always been impractical at the best of times.

Of course, OpenAI and most other "AI" aren't affairs "inside the home"; they are affairs publicly demonstrated far and wide.


Not only not inside the home but also charging money for it.


Sarah Silverman is claiming the same thing about her book.

But I've tried really hard to get ChatGPT to output sentences verbatim from her book and just can't get it to. In fact, I can't even get it to answer simple questions about facts that are in her book but nowhere else -- it just says it doesn't know.

Similarly I haven't been able to reproduce any text in the NYT verbatim unless it's part of a common quote or passage the NYT is itself quoting. Or it's a specific popular quote from an article that went viral, but there aren't that many of those.

Has anyone here ever found a prompt that regurgitates a paragraph of a NYT article, or even a long sentence, that's just regular reporting in a regular article?


The complaint has specific examples they got from ChatGPT.

There is a precedent: There were some exploit prompts that could be used to get ChatGPT to emit random training set data. It would emit repeated words or gibberish that then spontaneously converged on to snippets of training data.

OpenAI quickly worked to patch those and, presumably, invested energy into preventing it from emitting verbatim training data.

It wasn’t as simple as asking it to emit verbatim articles, IIRC. It was more about it accidentally emitting segments of training data for specific sequences that were semi rare enough.


1. The data emitted by that buffer-overflow-y prompt is both non-deterministic and actual training only appears a fraction of the time. There no prompt that allowed for reproducible targeting of data sets.

2. OpenAI's "patch" for that was to use their content moderation filter to flag those types of requests. They've done the same thing for copyrighted content requests. It's both annoying because those requests aren't against the ToS but it also shows that nothing has been inherently "fixed". I wouldn't even say it was patched.. they just put a big red sticker over it.


Hm... Why would people not just paste in sections of the book to the "raw" model in the playground (gpt instead of chatgpt) and just see if it completes the text correctly? Is the concern that chatgpt may have used the book for training data but not the original llm?


edit: i meant to say "used the book for chat finetuning/rlhf but not the original llm". Also, I saw one example of the regurgitation by openAI of a NYT article, and it was indeed GPT-4, not ChatGPT.


> OpenAI quickly worked to patch those

So it was a problem, but isn't anymore?


it is in the legal complaint - they have ten examples of direct content. I think they got very skilled people to work on producing the evidence.


Ah thank you. The examples start on page 30.

I wish they included the prompts they used, not just the output.

I'm very curious how on earth they managed that -- I've never succeeded at getting verbatim text like that at all.


One of their examples includes a screenshot of the prompt.

Looks like they would ask about a specific article either under the guise of being paywalled or about critic reviews.

> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?

Or

> What did Pete Wells think of Guy Fieri's restaurant?

Then just ask for paragraphs

> Wow, thank you! What is the next paragraph?

> What were the opening paragraphs of his review?


It would be helpful if comments like this could somehow be pinned to the top of the thread, since a lot of the thread contains speculation over this point.


I've often wished for that HN feature as well. This is not the first HN thread where this situation has happened!

Very happy for the helpful replies though.


> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?

This doesn't work, it says it can't tell me because it's copyrighted.

> Wow, thank you! What is the next paragraph?

> What were the opening paragraphs of his review?

This gives me the first paragraph, but again, says it can't give me the next because its copyrighted.


Well yeah, they’re being sued. They move very quickly to stop any obvious copyright violation paths.


And in a lawsuit, there's very much the question of intent as well.

If OpenAI never meant to allow copyrighted material to be reproduced, shut it down immediately when it was discovered, and the NYT can't show any measurable level of harm (e.g. nobody was unsubscribing from NYT because of ChatGPT)... then the NYT may have a very hard time winning this suit based specifically on the copyright argument.


Intent isn't some magic way to claim innocence. Here negligence is very much at play. Were OpenAI negligent when they made the NYT articles available like this?


Sure, but even negligence may be hard to show here.

It's very clear that OpenAI couldn't predict all of the ways users could interact with its model, as we quickly saw things like prompt discovery and prompt injections happening.

And so not only is it reasonable that OpenAI didn't know users would be able to retrieve snippets of training material verbatim, it's reasonable to say they weren't negligent in not knowing either. It's a new technology that wasn't meant to operate like that. It's not that different from a security vulnerability that quickly got patched once discovered.

Negligence is about not showing reasonable care. That's going to be very hard to prove.

And it's not like people started using ChatGPT as a replacement for the NYT. Even in a lawsuit over negligence, you have to show harm. I think the NYT will be hard pressed to show they lost a single subscriber.


If they included the prompts, OpenAI would just patch them and say they fixed the problem.


Maybe she needs to sue Goodreads too. It's most likely a way for her to claw relevance for her unmarketed book by attaching "AI" to it and also "poor artist" to her work.


They could have changed it to not do this after getting sued.


Copyright is not about ideas, style, etc. but about the concrete shape and form of content. Patents and trademarks are for the rest. But this is a copyright centric case.

A lawsuit that proves verbatim copies, might have a point. But then there is the notion of fair use, which allows hip hop artists to sample copyrighted material, allows journalists to cite copyrighted literature and other works, and so on. There are a lot of existing rulings on this. Legally, it's a bit of a dog's breakfast where fair use stops and infringement begins. Upfront, the NYT's case looks very weak.

A lot of art and science is inherently derivative and inspired by earlier work. So is art. AI insights aren't really any different. That's why fair use exists. Society wouldn't be able to function without it. Fair remuneration extents only to the exact form and shape you published in for a limited amount of time and not much else. Publishing page and page of NYT content would be a clear infringement. But a citation here and there, or a bit of summary, paraphrasing, etc. not so much.

The ultimate outcome of this is simply models that exclude any NYT content. I think they are overestimating the impact that would have. IMHO it would barely register if their content were to be excluded.


The verbatim responses come as part of "Browse with Bing" not the model actually verbatim repeating articles from training data. This seems pretty different and something actually addressable.

> the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.


I'm not sure if the verbatim content isn't more of a "stopped clock is right twice a day" or "monkeys typewriting shakespeare" situation. As I see it, most of the value in something like the NYT is as a trusted and curated source of information with at least some vetting. The content regurgitated from an LLM would be intermixed with false information and all sorts of other things, none of which are actually news from a trusted source - the main reason people subscribe to the NYT (?) and something at which ChatGPT cannot directly compete against NYT writers.


I don't understand this argument. You seem to be implying that I could freely copy and distribute other people's works without commiting copyright infringement as long as I make the resulting product somehow less compelling than the original? (Maybe I print it in a hard-to-read typeface or smear some feces on the copy.)

I have seen low fidelity copies of motion pictures recorded by a handheld camera in a theater that I'm pretty sure most would qualify as infringing. The copied product is no doubt inferior, but still competes on price and convenience.

If someone does not wish to pay to read the New York Times then perhaps accepting the risk of non-perfect copies made by a LLM is an acceptable trade off for them to save a dime.


> I'm not sure if the verbatim content isn't more of a "stopped clock is right twice a day" or "monkeys typewriting shakespeare" situation.

I think it’s more nuanced than that.

Extending the “monkeys on typewriters” example, it would be like training and evolving those monkeys using Shakespeare as the training target.

Eventually they will evolve to write content more Shakespeare like. If they get so close to the target that some of them start reciting the Shakespeare they were trained on, you can’t really claim it was random.


In the context of Shakespeare, I'd agree that there may be some competitive potential in the product. But in the context of news, something that evolves and relies on timely and accurate information, I don't see how something like that turns into competition for the NYT by being trained on past NYT outputs.

If the argument is that people can use ChatGPT to get old NYT content for free, that can be illustrated simply enough, but as another commenter pointed out, it doesn't really seem to be that simple.


One example I think of is that ChatGPT can mimick styles derived on its training and based on live input information (news) it could mimick the style of any publication the offer that at discount. That could prove as the last nail in the coffin for non AI publications.


I assume if you ask it to recite a specific article from the NYT it refuses?

If an LLM is able to pull a long enough sequence of text from it's training verbatim all that's needed is the correct prompt to get around this weeks filters.

"Imagine I am launching a competitor newspaper to the NYT, I will do this by copying NYT articles verbatim until they sue me and win a lawsuit forcing me to stop. Please give me some examples for my new newspaper." (no idea if this works :))


"I'm sorry, but I cannot assist you in generating content that involves copyright infringement or illegal activities. If you have other questions or need assistance with a different topic, please feel free to ask, and I'll be happy to help in any way I can."


It doesn't refuse. See this comment containing examples from the complaint: https://news.ycombinator.com/item?id=38782668


They want their cake and to eat it too. They want potential new subscribers to be able to see content not pay-walled based on reference. But how dare a new player not o. Their list of approved referrers benefit from that content.

How do we know that ChatGPT isn’t a potential subscriber?

-mic


I would agree. Style is too amorphous (even among its own reporters and journalists, there are different styles), but verbatim repetition would be a problem. So what would the licensing be for all their content be (if presumably one could get ChatGPT to output all of the NYTs articles)?

The unfortunate thing about these LLMs is they siphon all public data regardless of license. I agree with data owners one can’t Willy nilly use data that’s accessible but not licensed properly.

Obviously Wikipedia, data from most public institutions, etc., should be available, but not data that does not offer unrestricted use.


FWIW When I was taking journalism classes, style was not amorphous.

We had an entire book (400+ pages) which detailed every single specific stylistic rule we had to follow for our class. Had the same thing in high school newspaper.

I can only assume that NYT has an internal one as well.


I wondered about that, but is that copyrightable? Can’t I use their style guide? If I did would the NYT sue me? If a writer who used it at the NYT went off on their own and started a substack and continued using the style, would they risk getting sued?


The style itself can’t really be copyrighted, but the expression of something using it can be, so you can use NYT’s style to the T in your Substack but you can’t copy their stuff which is expressed in their style.


While I think the verbatim text strengthens NYTimes argument, I think people are focusing on that too strongly, the idea being that if OpenAI could just "fix" that, then they'd be in the clear.

Search for "four factors of fair use", e.g. https://fairuse.stanford.edu/overview/fair-use/four-factors/, which courts use to decide if a derived work is fair use. I think OpenAI will get killed in that fourth factor, "the effect of the use upon the potential market", which is what this case is really about. If the use substantially negatively affects the market for the original work, which I think it's easy to argue that it does, that is a huge factor against awarding a fair use exemption to OpenAI.


I can get a printer to emit verbatim NYT content, and with a lot less effort than getting it out of an LLM. I find this capability of infringement equals infringement argument incredibly weak.


In the EU, countries can (and do) impose levies on printers and scanners because they may be used to copy copyrighted material (https://www.insideglobaltech.com/2013/07/12/eu-member-states...). Similar levies exist for blank CDs, USB sticks, MP3 players etc. In the US, this applies to "blank CDs and personal audio devices, media centers, satellite radio devices, and car audio systems that have recording capabilities." (See https://en.wikipedia.org/wiki/Private_copying_levy)


Well imagine you sell a printer with internal memory loaded with NYT content


Try selling subscriptions to your print-outs.


The equivalent analogy here is selling subscriptions to the printer, not the specific copyright infringing printout.


I disagree. A printer is too neutral - it's just a tool, like roads or the internet. Third parties can use them to commit copyright infringement, but that doesn't (or shouldn't) reflect on the seller of the tool.

I propose it's more like selling a music player that comes preloaded with (remixes of) recording artists' songs.


It is neutral though. That’s the whole point. You have to twist its arm with great intention to recreate specific things. Sufficient intention that it’s really on you at that point.


It’s not neutral if all the content is in the model, regardless of whether you had to twist its arm or not. What does that even mean with a piece of software?

A printer is neutral because you have to send it all the data to print out a copy of copyrighted content. It doesn’t contain it inherently.


Well I’m callin you a liar, and I’m open to being proven wrong.

Show me a prompt that can produce the first paragraph of chapter 3 of the first Harry Potter book. Because i don’t think you can. I don’t think you can prove it’s “in” there, or retrieve it. And if you can’t do either of those things then I think it’s irrelevant to your claims.


The fact that the NYT lawyers used a carefully written prompt kind of nullifies this argument. It's not like they stumbled on it on accident, they looked for it and their prompt isn't neutral either.


I hope HP isn't seeing this


Google can look up into their index and can remove whatever they want to, within minutes. But how that can be possible for an LLM? That is, "decontaminate" the model from certain parts of the corups? I can only think of excluding the data set from the training and then retrain?

As a side note, I think LLM frenzy would be dead in few years, 10 years time frame at max. The rent seeking on these LLMs as of today would no more be a viable or as profitable business model as more inference circuitry gets out in the wild into laptops and phones, more models get released, tweaked by the community and such.

People thinking to downvote and dismiss this should see the history of commercial Unix and how that turned out to be today and how almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum, I highly doubt is hosted on Windows or a commercial variant of Unix.


> almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum

About a fifth to a quarter of public-facing Web servers are Windows Server. Most famously, Stack Overflow[1].

[1]: https://meta.stackexchange.com/a/10370/1424704


> About a fifth to a quarter of public-facing Web servers are Windows Server

Got a link for that? Best I can find is 5% of all websites: https://www.netcraft.com/blog/may-2023-web-server-survey/


20% of workloads running on Windows should result in corresponding number of jobs as well but that's not what I see.

Most companies are writing software with software developed on Linux first and for Linux first (or Unix) and later ported to Windows as an after thought. I'm thinking Python, Ruby, NodeJS, Rust, Go, Java, PHP but not seeing as much of C#/ASP.NET which should at least be 20% of the market?

Only two explanations - either I am in a social bubble so don't have exposure or writing software for Windows is so much easy that it takes five times less engineering muscle.


There are plenty of .NET jobs, and .NET (Core, particularly) is really easy to write.

That said, I'd guess the difference is that the startup and big tech world (i.e., "software companies") like our fancy stacks, but non-software companies prefer stability and familiarity. It makes way more sense for most companies to have a 3-man "bespoke software" department (sys/db admin, sr engineer, jr engineer) on a stack supported by a big company (Microsoft) where most of the work is maintenance and the position lasts an entire career. It's a big enough team to support most small to middling businesses, but not so big that the push to rewrite everything in [language/framework of the week] gains traction.

The practical conclusion is that these companies have few spots to fill, and they probably don't advertise where you're looking.


>>either I am in a social bubble so don't have exposure or writing software for Windows

you clearly are.... There are TONS of windows only software out there, and most INTERNAL systems that run companies, these internal LOB apps, often custom made for the companies, many many many of them (probally more than 50%) are windows server apps.

For example GE makes a Huge Industrial ecosystem of applications that runs a ton of factories, utilities, and other companies... Guess what all of that is windows based.

Many of the biggest ERP's run on MS SQL Server which until very recently was Windows Only, and most MS SQL Servers are still on windows server

To claim only 20% of all workloads are Windows shows an extreme bubble most likely in the realm of WEB BASED DEVELOPMENT, as highlighted by list of web technologies, php, node, etc..


.NET is huge in banking, iGaming, traditional industries. Python/PHP are kinda outliers found here and there. JS is eating both Java and .NET's lunch and ofc frontend.


But, wasn’t the reason proprietary unixes died out at major work horses because of a nearly feature comparable free alternative (Linux)?

Extending the analogy, LLMs won’t die out, just proprietary ones. (Which is where I think this tech will actually go anyway.)


LLMs won't die out but proprietary LLMs behind APIs might not have valuations of hundreds of billions of dollars.

Crowd source, crowed trained (distributed training) fast enough, good enough generative models that are updated (and downloadable) every few months would start to erode the subscriber base gradually.

I might be very very wrong here but it seems like so from where I see it.


> But how that can be possible for an LLM?

Well, it seems to me that's part of the problem here.

And it's their problem, one they created for themselves by just assuming they could safely take absolutely every bit of data they could get their hands on to train their models.



The argument may be that having very large models that everyone uses is a bad idea, and that companies and even individuals should instead be empowered to create their own smaller models, trained on data they trust. This will only become more feasible as the technology progresses.


Windows and MacOS (and their closed source derivatives) are probably at least as large as Linux, even including all the servers Linux is deployed on. Proprietary UNIX did not "die out"; Apple sells about a quarter million of them every year.

The majority of the world's computing systems runs on closed source software. Believing the opposite is bubble-thinking. Its not just Windows/MacOS. Most Android distros are not actually open source. Power control systems. Traffic control systems. Networking hardware. Even the underlying operating systems which power the VMs you run on AWS that are technically open source. The billions of little computers that form together to make the modern world work; they're mostly closed source.


Google have their "Machine Unlearning" challenge to address this specific issue - removing the influence of given training data without retraining from scratch. Seems like a hard problem. https://blog.research.google/2023/06/announcing-first-machin...


> But how that can be possible for an LLM?

They should have thought of that before they went ahead and trained on whatever they could get.

Image models are going to have similar problems, even if they win on copyright there's still CSAM in there: https://www.theregister.com/2023/12/20/csam_laion_dataset/


People thinking to dismiss this should, period. Consider that Open AI and similar companies are the only ones in the AI space with the market cap to build out profitable hardware projects which open source can't. Or maybe every investor is just dumb and likes throwing millions of dollars away so they can participate in a hype train.


Unix kinda still does the same thing now as before.

Future big ai models might be totally different in quality, and latency.


maybe they should build a better LLM? maybe they could ask the AI to make a better system. after all, tech and ai is so powerful that they could do virtually anything, except having accountability as it turns out.


I think the train has left the station and the ship has sailed. I'm not sure it's possible to put this genie back in the bottle. I had stuff stolen by OpenAI too, and I felt bad about it (and even send them a nasty legal letter when it could output my creative work almost verbatim), but I think at this point, the legal landscape needs to somehow adjust. The Copyright Clause in the US Constitution is clear:

To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries

Blocking LLMs on the basis of copyright infringement does NOT promote progress in science and the useful arts. I don't think copyright is a useful basis to block LLMs.

They do need to be regulated, and quickly, but that regulatory regime should be something different. Not copyright. The concept of OpenAI before it became a frankenmonster for-profit was good. Private failed, and we now need public.


Establishing a legal route to train LLMs on copywriten content could certainly have a chilling affect on the progress of science and useful arts... Why would someone devote their life to their studies or craft when they know that an LLM will hoover it up and start plagiarizing it immediately?


The vast majority of quality art and is produced by people who do it because they want to create art, not for money, and most artists earn little.


Even if that is the case, plenty of art is made with the hope or dream that people will find it worth paying for, and there are many people out there who do in fact fully support themselves doing creative work. Having the copyright to that work is foundational to even be able to consider that possibility at all.


I'm not sure that's actually true, even though we hear it often. The artists I know (about a dozen) are all trying to figure out how to make _more_ money from their art so that they can continue making their art


Artists getting paid little is a function of how easy it is to capture the value that artists create, and how willing other people are to do that capturing - not how valuable their work is. Artists definitely want to get paid and do not want to live on scraps for "passion's" sake. This is the entire argument for copyright existing in the first place, even though current copyright law has flaws.


>the ship has sailed

Certainly, but debating the spirit behind copyright or even "how to regulate AI" (a vast topic, to put it mildly) is only one possible route these lawsuits could take.

I suspect that ultimately the winner is going to be business first (of course in the name of innovation), and the law second, and ethics coming last -- if Google can scan 129 million books [1] and store them without even a slap on the wrist [2], OpenAI and anyone of that size can most surely continue to do what they're doing. This lawsuit and others like it are just the drama of 'due process'.

[1] https://booksearch.blogspot.com/2010/08/books-of-world-stand... [2] https://www.reuters.com/article/idUSBRE9AD0TT/


The court decided to focus on the tiny snippets Google displayed rather than the full text on their servers backing the search functionality. The court found significant that Google deliberately limited the snippet view so it couldn't be used as a replacement for purchasing the original book. The opinion is a relatively easy read, I highly recommend it if you're interested in the issue. It's also notable the court commented that the Google case was right on the edge of fair use.

https://law.justia.com/cases/federal/appellate-courts/ca2/13...


I referred to the case to point out that Google practically got away with what was a gigantic violation of the spirit or principle here, in that Google gets to keep a copy of these millions of works for itself without ever having paid for them, regardless of what it made available to the public.

As for "what would Google do with all these book copies anyway if they can't make it public?", that has now been answered more directly than ever.


The case was, like any case, based on the specific facts.


I see, the narrative switched form “cat’s out of the bag” to “genie’s out of the bottle”. Regardless, no one wants to ban llms. We just want the theft to stop.


    Copying is not theft.
    Stealing a thing leaves one less left
    Copying it makes one thing more;
    that’s what copying’s for.


Quite the hill to die on. But hear me on this: why doesnt openai allow training using its data? Or why doesnt microsoft train against windows’ source code (not that it would be of quality)? Or why can’t we just copy whatever movie and music we want through whatever protocol we want? Is it because your bosses know that copying without approval is theft?


It's not quite the hill to die on. There are (at least) two definitions of "theft" and "stealing" in common use:

1) I take something away from you. You have less of it as a result. Copying is not theft.

2) I deprive you of something, such as exclusive use of your land (e.g. by trespassing) or failing to follow through on a contract. Copying is theft.

Both of those are used by different communities, who both become angry at the other.

This is a semantic argument. Most members of both groups believe that there are times when copying is wrong, and are split on when that is.

However, to group #1, "stealing" and "theft" is a highly offensive term. It's much like saying "You raped me up the ___ when you didn't pay my contractor bill on time" or other hyperboles. Not paying my bill was wrong, but it also wasn't rape. It devalues rape, insults you, and is imprecise. You should use the precise "copyright violation" which describes exactly what happened.

To group #2, NOT calling it theft is offensive, since it devalues the costs to businesses and creators of copyright violations. Whether you agree with them or not, they have certain rights under the law, and picking-and-choosing which laws to follow is wrong (especially when it's self-serving).

Because the two groups mean different things by the same words, they can never hold a rational conversation with each other, and become offended when they hear the other group speak. It's how we polarize. It's unfortunate, since there's an important discussion to be had about the limits and enforcement of copyright and patents, which really should start with the copyright clause in the constitution, and when it helps versus impedes progress and economic growth. That's a discussion possible to have analytically and rationally.


   My code was AGPL.
   OpenAI can go to h..l
(Footnote: I like your poem. It conveys the concept much better than anywhere I'd ever seen before)


Thanks, but it's not my poem! You can find it here: https://blog.ninapaley.com/2009/12/15/minute-meme-1-copying-...


I think it became yours when you copied it :)


There is no theft. Hyperbole won't get you taken seriously, use correct terminology.


There is no correct terminology. My life became a lot easier when I realized that language changes meaning not just across different languages, but words take on subtly or significantly different meanings based on culture and dialect.

A lot of red-blue state misunderstandings are based on that, as are ones across US racial subgroups. Ditto for lawyer-engineer conversations.

"Theft" has pretty different meanings depending on whom you're speaking to. Legal jargon here is quite different from business, which can be quite different from popular. That's okay!


it's not theft


Interesting.

I think the appropriation, privatization, and monetization of "all human output" by a single (corporate) entity is at least shameless, probably wrong, and maybe outright disgraceful.

But I think OpenAI (or another similar entity) will succeed via the Sackler defense - OpenAI has too many victims for litigation to be feasible for the courts, so the courts must preemptively decide not to bother with compensating these victims.


What concerns me, and I don’t see mentioned as much as I would expect, is: how will people be compensated for generating new content if ChatGPT takes over?

I believe the innovation that will really “win” generative AI in the long term is one that figures out how to keep the model populated with fresh, relevant, quality information in a sustainable way.

I think generative AI represents a chance to fundamentally rethink the value chain around information and research. But for all their focus on “non-profit” and “good for humanity”, they don’t seem very interested in that.


Agree. My view is we’re in the Napster moment and someone is going to invent the iTunes Music Store. Language models are a distribution mechanism for knowledge content—- in many cases more efficient and useful than the originally packaged materials (akin to how downloading a single pop song is greater than buying the album). It feels clear this is where we’re headed (verified, compensated content delivered through a new mechanism); this lawsuit is like the RIAA v. music sharing and the question is just if the current players in AI make it through or if someone else will come in and do iTunes.


What do you mean when you say "appropriation and privatization" of "all human output"?

The output is still there for anyone else to train on if they want.


> The output is still there for anyone else to train on if they want.

Legal arguments aside, the goldrush era of data scraping is over. Major sources of content like Reddit and Twitter have killed APIs, added defenses and updated EULAs to avoid being pillaged again. More and more sites are moving content behind paywalls.

There's also the small issue of having 10s of millions of VC dollars to rent/buy hundreds of high end GPUs. OpenAI and friends are also trying their hardest to prevent others doing so via 'Skynet' hysteria driven regulatory capture.


When music people copyright things beats sounds or "style" in music it's even more shameless.


What does the Sackler defence refer to?


The Sackler family owned Purdue Pharma, which created OxyContin and heavily marketed the drug. Many Americans see the family as partially responsible for kickstarting the opioid epidemic.

https://en.wikipedia.org/wiki/Sackler_family

The family has been largely successful at avoiding any personal liability in Purdue’s litigations. Many people feel the settlements of the Purdue lawsuits were too lenient. One of the key perceived aspects of the final settlements was that there was too many victims of the opioid epidemic for the courts to handle and attempt to make whole.


Opioids.


> The New York Times is suing OpenAI and Microsoft over claims the companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with the outlet’s content.

Millions? Damn, they can churn out some content. 13 million[0]!.

[0] https://archive.nytimes.com/www.nytimes.com/ref/membercenter....


  “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.
I can't be the only one that sees the irony of this news being "reported" and regurgitated over dozens of crappy blogs.

  ChatGPT [..] “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.”
If the NYT thinks that GPT-4 is replicating their style then [as anybody who has tried to do creative writing work with GPT-4 can testify to] they need to fire all their writers.


> More on-topic: if the NYT thinks that GPT-4 is replicating their style then [as anybody who has tried to do creative writing work can testify to] they need to fire all their writers.

The complaint isn’t that ChatGPT is imitating New York Times style by default.

The complaint is that you can ask it to write “in the style of New York Times” and it will do so.

I don’t know if this argument has any legal merit, but it’s not as simple as you suggest. It’s the textual parallel to having AI image generators mimic the trademark style of artists. We know it can be done, the question is what does it mean legally.


The writing of the new york times is so diffuse (they even have their own published style guide!) that it's impossible to make a claim to any "style", as there are undoubtedly millions upon millions of lines of text by authors who have been inspired by the NYT.


All ai image generators can produce copyrighted works exactly. The level of modification is often barely more than you would get than if you slapped a filter on a copyrighted image in photoshop.


All those blogs are _also_ violating copyright, so I don't see the irony? One doesn't spend a million dollars suing a defendant with pennies to their name.

I'd also expect the Times style complaint to have merit because it's probably much easier for ChatGPT to imitate the NYT style than an arbitrary style.


> an arbitrary style

..off to try "Gwern style"


Is your point that the NYT should sue bloggers? Or that given the existence of bloggers, they should not try to sue Microsoft? Or something else?


I'm pretty sure the defense is the "NYT Style and Usage Guide"...


The NYT publishes about 200 pieces of journalism every day (according to their own website), and it was founded in 1851. That makes for a lot of articles.


(2023 - 1851) * 365 * 200 = 12,556,000

Yep, so a few million ripped off articles is plausible.


Everything from 1851 to 1927 ought to be in the public domain, though. If the goal of training an AI is just "to mimic a style" there are absolutely humongous amounts of text that's totally free of any copyright restrictions.


Yes, there is large amounts of public domain text available, but does anyone believe this is a restriction that was imposed when feeding the models?


The first 75+ years are no longer in copyright, so certainly possible to train on thousands maybe millions of NYT articles without concern.


Copyright in the US persists for 70 years after the publisher's death.

So the earliest available copyrighted material would be all content published by anybody who died in the year 1953 or earlier.

If the author of an article published in 1950 still has a living author, the work is still copyrighted.


The way to view this kind of parasitism is how we look at patent trolls. When you look at the RIAA/MPAA lawsuits, while I don't agree with them, at least file sharing was basically a canonical form of copyright infringement.

With LLMs we have an aspect of a text corpus that the creators were not using (the language patterns) and had no plans for or even idea that it could be used, and then when someone comes along and uses it, not to reproduce anything but to provide minute iterative feedback in training, they run in to try and extract some money. It's parasitism. It doesn't benefit society, it only benefits the troll, there is no reason courts should enforce it.

Someone should try and show that a NYT article can be generated autoregressively and argue it's therefore not copyrightable.


But it’s theirs, they created it and should therefore benefit from it. I’m honestly shocked at how much these companies are getting away with. It’s piracy on a massive scale.

You can get a little discombobulated reading the comments from the nerds / subject idiots on this site.


George R. R. Martin authored A Game of Thrones, but lost in-court against Google when Google Books reproduced parts of his text verbatim: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

No piracy or even AI was required, here. Google's defense was that their product couldn't reproduce the book in it's entirety, which was proven and made the prosecution about Fair Use instead. Given that it was much harder to prosecute on those grounds, Google tried coercing the authors into a settlement before eventually the District Court dropped the case in Google's favor altogether.

OpenAI's lawyers are aware of the precedent on copyright law. They're going to argue their application is Fair Use, and they might get away with it.


Who's stopping their nyt articles in favor of chatGPT? Nobody stopped watching movies when their lines are in the lyrics of songs.


It's "theirs"? The copyright monopoly was created to advance art and science, anything else is a mere perversion. So who are in the moral right, those advancing science by developing humanity's literal pinnacle of science, artficial intelligence, or those trying to hold the development back for their own commercial interest?

Mind you, Google books, literally just text from copyrighted books published for everyone online, was ruled "fair use", due to it's benefit to humanity.


Don't forget that (almost assuredly) some percentage of HN comments are made by these very LLMs in question!!


The copyright laws are unjust to begin with. Copyright shouldn’t last beyond ten years anyway. Simply claiming that piracy is always bad ignores the evil of the laws in the first place.


Amusing to see someone referring to anyone other than the megacorp controlled by fucking micro$oft hoovering as much data as they can, legally and otherwise, as a parasite.


> It doesn't benefit society

Bold (and wrong) claim



But what's the ChatGPT summary, in an expressive style that closely mimics a reader's personal relationship to the Times?


Does it not seem a bit suspect to read the the NYT reporting on their own lawsuit?


The newsroom is a different part of thr company than the legal department. Plus, sometimes your company does something that's newsworthy! Just like all journalism, there's always implicit bias. No reason to get suspicious about a news organization covering the news.


If Apple is in a lawsuit I'm not going to go to the Apple media relations page for the story. What about the NYT, also a for-profit company, makes it more principled than Apple, other than that they say they are?


The NY Times company has an analogue to the Apple media relations page: https://investors.nytco.com/news-and-events/press-releases/

In most respected media companies there is a really-important-to-journalists-who-work-there firewall between these sorts of corporate battles and the reporting on them.


Reading an article where you assume the author is biased is not so bad. What’s bad is finding an alternative source and reading it as if the author IS unbiased.


One should always source news from a variety of outlets as to attempt to be cognizant of the biases in play and to see the story from many viewpoints.

Would I trust the NYT to be unbiased? No. But is their viewpoint extremely relevant to the subject at hand? Yes.


The NYT also hallucinates from time to time: https://www.nytimes.com/2003/05/11/us/correcting-the-record-...

(That's a story about Jayson Blair, one of their reporters who just plain made up stories and sources for months before getting caught)

Edit: Sheesh, even their apology is paywalled. Wiki background: https://en.wikipedia.org/wiki/Jayson_Blair?wprov=sfla1


[edit: they have since opened a comment section to the article.] It is unfortunate that the NYTimes don’t allow reader comments to this article. I like some of the NYTimes content, but in this case use of chatGPT is infinitely more valuable to me than subscribing to the NYTimes, so I would like to explain this concept and the associated risks by their litigation without cancelling my subscription.

Maybe it is time to move training of models to Japan that has explicitly adapted AI friendly legislation that allows training on previously copyrighted materials. My best guess is that if the inputs were legally obtained, then the output doesn’t violate anything until someone publishes it. Similar to how reading a newspaper in a public library is legal but copying its content verbatim and republishing is not.


The ChatGPT subscription is more valuable because it's built on the theft of the NYT content and many other authors' work.


No. It is a technology that can massively accelerate human progress—I don’t buy the theory that OpenAI used NYTimes content that was not freely available to them. If you read all of the public internet you probably have lots of snippets of NYTimes articles. Regarding the reading of the wirecutter by a browser tool, I don’t know how much of it is available online without subscription (because I subscribe to the NYTimes), but arguably if it is available I’d expect a helpful AI to read it and other sources and give me a short recommendation for what I look for (and not the ads or sponsored links).


It may have been freely available to them, but that doesn’t mean that they are free to reproduce its contents or otherwise make use of it at scale.


The chatGPT model is not reproducing the contents nor making it available at scale. I, the user, ask the model to read the website and other websites and come back with a useful summary to me. This is not training data thus not violating copyright any more than a browser showing the full text would.


"Thank you. This article will be another labeled row in the next quarterly training batch. Now you can give our model a prompt to generate press coverage of any court case and it'll generate surpassing all the automated journalistic benchmarks that we have in place towards a safe, responsible, inclusive and aligned AI"


Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.

And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.

Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.

We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.


> Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.

Foreign companies can be barred from selling infringing products in the United States.

Russian and Chinese consumers are less interested in English-language articles.

I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.

If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.


>Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive

But they're not; you can download open source Chinese base models like Yi and Deepseek and ask them about Tianmen Square yourself and see, they don't have any special filtering.


I suspect they will crack down on that within the next few years.


> Russian and Chinese consumers are less interested in English-language articles.

Isn't it just one additional step to automatically translate them?


I don't think they're looking to prevent the inevitable, but rather see a target with a fat wallet from which a lot of money can be extracted. I'm not saying this in a negative way, but much of the "this is outrageous!" reaction to AI hasn't been about the building of models, but rather the realization that a few players are arguably getting very rich on those models so other people want their piece of the action.


If NYT wins this, then there is going to be a massive push for payouts from basically everyone ever…I don’t see that wallet being fat for long.


If LLMs actually create added value and don't just burn VC money then they should be able to pay a fair price for the work of people they're relying upon.

If your business is profitable only when you get your raw materials for free it's not a very good business.


By that logic you should have to pay the copyright holder of every library book you ever read, because you could later produce some content you memorised verbatim.


The rules we have now were made in the context of human brains doing the learning from copyrighted material, not machine learning models. The limitations on what most humans can memorize and reproduce verbatim are extraordinarily different from an LLM. I think it only makes sense to re-explore these topics from a legal point of view given we’ve introduced something totally new.


Human brains are still the main legal agents in play. LLMs are just a computer programs used by humans.

Suppose I research for a book that I'm writing - it doesn't matter whether I type it on a Mac, PC, or typewriter. It doesn't matter if I use the internet or the library. It doesn't matter if I use an AI powered voice-to-text keyboard or an AI assistant.

If I release a book that has a chapter which was blatantly copied from another book, I might be sued under copyright law. That doesn't mean that we should lock me out of the library, or prevent my tools from working there.


I see two separate issues, the one you describe which is maybe slightly more clear cut: if a person uses an AI trained on copyrighted works as a tool to create and publish their own works, they are responsible if those resulting works infringe.

The other question, which I think is more topical to this lawsuit, is whether the company that trains and publishes the model itself is infringing, given they're making available something that is able to reproduce near-verbatim copyrighted works, even if they themselves have not directly asked the model to reproduce them.

I certainly don't have the answers, but I also don't think that simplistic arguments that the cat is already out of the bag or that AIs are analogous to humans learning from books are especially helpful, so I think it's valid and useful for these kinds of questions to be given careful legal consideration.


> Human brains are still the main legal agents in play.

No, they're not. This is The New York Times (a corporation) vs OpenAI and Microsoft (two more corporations).


Aren't corporations considered 'persons' in the US?


> the copyright holder of every library book

gets paid


Copyright holders do get paid for library copies, in the US.


You make it seem as if the copyright holder is making more money on a library book, than on one sold in retail, which does not appear to be the case in the US.


The library pays for the books and the copyright holder gets paid. This is no different from buying a book retail, which you can read and share with family and friends after reading, or sell it, where it can be read again and sold again. The book is the product, not a license for one person to access the book.


What do you actually believe, with that statement? Do you believe Libraries are operating illegally? That they aren't paying rightsholders?

Also: GPT is not a legal entity in the united states. Humans have different rights than computer software. You are legally allowed to borrow books from the library. You are legally allowed to recite the content you read. You're not allowed to sell verbatim recitation of what you read. This is, obvious, I think? But its exactly what LLMs are doing right now.


> Humans have different rights than computer software

Fortunately, the computer isn't the one being sued.

Instead it is the humans who use the computer. And those humans maintain their existing rights, even if they use a computer.


Maybe (though there exist plenty of examples to the contrary). However, the NYT isn't suing you, ChatGPT user; they're suing OpenAI.


Gotcha.

OpenAI is run by humans as well though.

So the same argument applies.

Those humans have fair use rights as well.


The difference here is scale. For someone to reproduce a book verbatim from memory it would take years of studying that book. For an LLM this would take seconds.

The LLM could reproduce the whole library quicker than a person could reproduce a single book.


That is the case. It's just that the fair price is fairly low and is often covered by the government in the name of the greater good.

When for-profit companies seek access to library material they pay a much much higher price.


What is a fair price? The entire NYT library would be a fraction of a fraction of the training set (presumably).


What if even though it's a small portion of the training data, their content has an outsized influence on the output being generated? A random NYT article about Donald Trump and a random Wikipedia article about some obscure nematode might be around the same share of training data but if 10,000x more users are asking about DJT than the nematode, what is fair? Obviously they'll need to pay royalties on the usage! /s


Yup, and I think that'll quickly uncover the reality that LLMs do not generate enough value relative to their true cost. GPT+ already costs $20/month. M365 Copilot costs $30/user/month. They're already the most expensive B2B-ish software subscriptions out there, there's very little market room to add in more cost to cover payments to rightsholders.


Imagine if tomorrow it was decided that every programmer had to pay out money for every single thing they went on the internet to learn about beyond official documentation, every Stack Overflow question they looked at, every question they went to a search engine to find. The amount of money was decided by a non-tech official who was in charge of figuring out how much of the money they earned was owed to the places they learned from. And people responded, "Well, if you can't pay up for your raw materials, then this just isn't a good business for you."


Except that every stackoverflow post is explicitly creative commons: https://stackoverflow.com/help/licensing


So I suppose it would be the like saying that if you used Stack Overflow to find answers, all of the work you created using information from it would have to be explicitly under the Creative Commons license. You wouldn't even be able to work for companies who aren't using that license if some of your knowledge comes from what you learned on Stack Overflow. Used Stack Overflow to learn anything about programming? You're going to have to turn down that FAANG offer.

And if you learned anything from videos/books/newsletters with commercial licenses, you would have to pay some sort of fee for using that information.


If your code contains verbatim copy-paste of entire blocks of non-trivial code lifted from those videos/books/newsletters with commercial licenses, then yes you would be liable for some licensing fees, at minimum.


If they are determined to have broken the law then they should absolute be made to pay damages to aggrieved parties (now, determining if they did and who those parties are is an entirely unknown can of worms)


The data will have to become more curated. Exclusivity deals will probably become a thing too. Good data will be worth the money and hassle; garbage (or meh) data won't.


If this is inevitable (and I'm not saying it's not), who will produce high quality news content?


AI. And, I fear, it will be good.


Curious how AI gets the raw information if there are no reporters nor newspapers. Does AI go to meetings or interview politicians?


I can certainly imagine email correspondence. Even audio interviews. You're right that it seems at least presently AI is less likely to earn confidences. But I don't know how far off the movie "Her" actually is.


The NYT's strongest argument for infringement is that OpenAI is reproducing their content verbatim (and to make matters worse, without attribution). IANAL but it seems super likely to me that this will be found to be infringing sooner or later.

Do I really want to use a Chinese word processor that spits unattributed passages from the NYT into the articles I write? Once I publish that to my blog now I'm infringing and I can get sued too. Point is I don't see how output which complies with copyright law makes an LLM inferior.

The argument applies equally to code, if your use of ChatGPT, OpenAI etc. today is extensive enough, who knows what copyrighted material you may have incorporated illegally into your codebase? Ignorance is not a legal defense for infringement.

If anything it's a competitive advantage if someone develops a model which I can use without fear of infringement.

Edit: To me this all parallels Uber and AirBnB in a big way. OpenAI is just another big tech company that knew they were going to break the law on a massive scale, and said look this is disruptive and we want to be first to market, so we'll just do it and litigate the consequences. I don't think the situation is that exotic. Being giant lawbreakers has not put Uber or AirBnB out of business yet.


>IANAL but it seems super likely to me that this will be found to be infringing sooner or later.

It better. Copyright has essentially fucking ceased to exist in the eyes of AI people. Just because you have a shiny new toy doesn't mean the law suddenly stops applying to you. The internet does its best to route around laws and government but the more technologically up to date bureaucracy becomes, the faster it will catch up.


Yeah I mean I'm not even really a fan of how copyright law works, but I don't see how you can just insert an "AI exemption." So OpenAI can infringe because they host an AI tool, but we humans can't? That would be ridiculous. Or is "I used AI when I created this" a defense against infringement? Also seems ridiculous. Why would we legally privilege machine creation of creative works over human creation in the first place? So I don't see what the credible AI-related copyright law reform is going to be yet.

Which means that either OpenAI is allowed to be the only lawbreaker in the country (because rich and lawyers), or nobody is. I say prosecute 'em and tell them to make tools that follow the law.


They probably didn’t start with a lawsuit. They started asking for royalties. They probably didn’t get an offer they thought was fair and reasonable so they sued.

These media businesses have shareholders and employees to protect. They need to try and survive this technological shift. The internet destroyed their profitability but AI threatens to remove their value proposition.


Sorry, how exactly LLM threatens NYT? Are people supposed to generate news themselves? Or like wait a year or so before NYT articles are consumed by LMMs?


NYT doesn't just publish "news" as in what happened yesterday; they also publish analysis, reviews of books and films, history, biography and so on. That's why people cite NYT articles from decades ago.


I’m ambivalent.

On the one hand, they should realize they are one of today’s horse carriage manufacturers. They’ll only survive in very narrow realms (someone has to build the Central Park horse carriages still), but they will be miniscule in size and importance.

On the other hand, LLMs should observe copyright and not be immune to copyright.


SciHub was an early warning, IMHO, that there's a strong risk of the first world fumbling the ball so badly with IP that tech ecosystems start growing in the third world instead. The dominant platform for distributing scientific journal papers is no longer Western. Maybe SciHub is economically inconsequential, but LLM's certainly are not!

Imagine if California had banned Google spidering websites without consent, in the late 90's. On some backwards-looking, moralizing "intellectual property" theory, like the current one targeting LLM's. 2/3rd of modern Silicon Valley wouldn't exist today, and equivalent ecosystems would have instead grown up in, who knows where. Not-California.

We're all stupidly rich and we have forgotten why we're rich in the first place.


What's the actionable advice here? US regulation should be the lowest common denominator of all countries one considers in competition? Certainly Chinese and Russian LLMs could vacuum up all the information. China already cares little about copyright and trademark, should they stop being enforced in the US?

My opinion is that the US should do things that are consistent with their laws. I don't think a Chinese or Russian LLM is much of a concern in terms of this specific aspect, because if they want to operate in the US they still need to operate legally in the US.


This suggests to me that copyright laws are becoming out of date.

The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.


> The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.

And yet the content industry still creates massive profits every year from people buying content.

I think internet-native people can forget that internet piracy doesn’t immediately make copyright obsolete simply because someone can copy an article or a movie if sufficiently motivated. These businesses still exist because copyright allows them to monetize their work.

Eliminating copyright and letting anyone resell or copy anything would end production of the content many people enjoy. You can’t remove content protections and also maintain the existence of the same content we have now.


Maybe a specific example will help here. An Author spends a year writing a technical book, researching subtle technical issues, creating original code and finding novel ways of explaining difficult abstractions.

A few weeks after the release it finds books on Amazon who plagiarized the book. Finds copies of the book available for free from Russian sites, and ChatGPT spitting verbatim parts of the source code on the book.

Which parts of copyright law would you say are out of date for the example above?


> Which parts of copyright law would you say are out of date for the example above?

The expectation that the author will get life+70 years of protection and income, when technical publications are very rarely still relevant after 5 years. Also, the modern ease of copying/distribution makes it almost impossible for the author to even locate which people to try to prosecute.


The expectation to make money from artificially restricting an abundant resource. While copyright is a way to create funding, it also massively harms society by restricting future creators from being able to freely reuse previous works. Modern ways to deal with this are patronage, government funding, foundations (e.g. NLNet) and crowdfunding.

Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.

https://questioncopyright.org/minute-memes-credit-is-due


What incentive do people have to publish work if their work is going to primarily be consumed by a LLM and spat out without attribution at people who are using the LLM?


I notice this in myself, even though I've never particularly made money from published prose on the internet.

But (under different accounts) I used to be very active on both HN and reddit. I just don't want to be anymore now for LLM reasons. I still comment on HN, but more like every couple of weeks than every day. And I have made exactly one (1) comment on reddit in all of 2023.

I'm not the only one, and a lot of smaller reddit communities I used to be active on have basically been destroyed by either LLMs, or by API pricing meant to reflect the value of LLM training data.


Without 200 years of copyright protection, how will any author be able to afford food?


The fact that copyright protection is far too long is entirely separate from the need for some kind of copyright protection to exist at all. All evidence suggests that it's completely impossible to live off your work unless you copyright it for some reasonable period, with the possible exception of performance art (music, theater, ballet).

A writer or journalist just can't make money if any huge company can package their writing and market it without paying them a cent. This is not comparable to piracy, by the way, since huge companies don't move into piracy. But you try to compete with both Disney and Fox for selling your new script/movie, as an individual.

This experiment has also been tried to some extent in software: no company has been able to live off selling open source software. RedHat is the one that came closest, and they actually live by selling support for the free software they sell. Others like MySQL or Mongo lived by selling the non-GPL version of their software. And the GPL itself depends critically on copyright existing. Not to mention, software is still a best case scenario, since just having a binary version is often not enough, you need the original sources which are easy to guard even without copyright - no one cares so much for the "sources" of a movie or book.


> All evidence suggests that it's completely impossible to live off your work unless you copyright it for some reasonable period

Which evidence?


Chinas’s accession to the Universal Copyright Convention, and an alleged desire to comply with international IP law, led to an influx of OECD IP and foreign direct investment(FDI).

In hindsight, China wasn’t diligent in the enforcement of IP violations. However, it’s clear foreign presences and investment grew substantially in China during the early 90s upon the belief IP would be protected, or at the very least there would be recourse for violations.


The fact that it has never been done successfully outside performance arts.


I used to work as a computer programmer until I retired. Nearly always, my work was part of a collaborative effort, and latterly didn't include any copyright claim. My income was never impacted by unauthorized copying. Until the 80s, there was no copyright on software, and yet even then people made a living programming.

Craftsmen don't claim copyright on their artifacts. Furniture designs were widely copied; but Chippendale did alright for himself. Gardeners at stately homes didn't rely on copyright. Vergil, Plato and Aristotle managed OK without copyright. People made a living composing music, songs and poetry before the idea of copyright was invented. Truck-drivers make a living; driving a truck is hardly a performance art. Labourers and factory workers get by successfully. Accountants and legal advocates get rich without copyright.

None of these trades amounts to "performance arts".


I very much doubt the company or foundation you were working for was selling the non-copyrighted software. If it was, it probably only worked on very specific hardware that you also produced and were selling, and thus copying it was largely useless. If you were working for a university, than the university obviously doesn't make money from selling software, and thus doesn't care for copyright as much.

Also, craftsmen rely on the fact that the part of their work that can't be easily copied, the physical artifact they produce, is most of the value (plus they rely on trademark laws and design patents quite often). Similarly for gardeners. The ancient greek writers were again paid for performance, typically as teachers. Literature was once quite a performative act. And again, at that time, physical copies of writings were greatly valuable artifacts, not that much different from the value of the writing itself, since copying large texts was so hard.

Similarly, the work of drivers, labourers, factory workers, accountants is valuable in itself and very hard or impossible to copy (again, the physical world is the ultimate copyright protection). The output of lawyers is in fact sometimes copyrighted, but even when it's not, it's not applicable to others' cases, so copies of it are not valuable: no one is making a business that replaces lawyers by re-distributing affidavits.


> I very much doubt the company [...] was selling the non-copyrighted software

Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.

At the very beginning, yes, it was "very specific" hardware; it was Burroughs hardware, which used Burroughs processors. But that was before microprocessors, and all hardware was "very specific".

> (plus they rely on trademark laws and design patents quite often)

Craftsmen and labourers were earning a living long before anyone had the idea of a "trademark", still less a "design patent".

> The output of lawyers is in fact sometimes copyrighted

You're right. That's why I didn't say "lawyers", I said "legal advocates". Those are people who speak on your behalf in courts of law, not scribes writing contracts. Anyway, the ancient Greeks and Romans had written laws, contracts and so on; they managed without trademarks and copyrights.


> Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.

Then I am not mistaken: the company was initially selling hardware, with the software being just a value add as you say (no copyright: no interest in trying to sell, exactly my point). Then, you were being paid for building software that (a) was probably not being made public anyway, and (b) would not have been of interest to others even if it were.

Even so, if someone came to your client and offered to take on the software maintenance for a much lower price, you might have lost your client entirely. This has very much happened to contractors in the past.

And my point is you couldn't have a Microsoft or Adobe or possibly even RedHat if you didn't have copyright protecting their business. So, you'd probably not have virtually any kind of consumer software.


> offered to take on the software maintenance for a much lower price

We didn't charge maintenance for this software. We would write it to close the sale of a computer. It was treated as "cost of sale". I'm sure it was cheaper (to us) than the various discounts and kickbacks that happened in big mainframe deals.

As far as Microsoft and Adobe is concerned, I wouldn't regard it as a misfortune if they had never existed. I'm not convinced that RedHat's existence is contingent on copyright.


That's a large category that includes everything from YouTubers to furry artists to live concerts.


Yes, but it's still a subset of the arts - it doesn't apply to movies, literature, nor even to the scripting for any of these.

And I should mention YouTubers wouldn't be making that much money if YouTube weren't enforcing copyright, as you could just upload their videos and get the ad money. Without copyright, you could also cut off their in-video promotions and add your own, including your own Patreon - so you would get 100% of the money off their work if you can out-promote them.

It's only live performances which are protected by the physical world's strict no-copying laws (the ones that don't allow the same macro object to be in two places at the same time).

So basically, no medium which allows copying of the works in whole or nearly whole has been successfully run with public works.


I think you're making a profound point here.

I believe you equate incentive to monetary rewards. And while that it probably true for the majority of news outlets, money isn't always necessarily what motivates journalists.

So considering the hypothetical situation where journalists (or more generally, people that might publish stuff) were somehow compensated. But in this hypothetical, they would not be attributed (or only to very limited extent) because LLMs are just bad at attribution.

Shouldn't in that case the fact that information distribution by the LLM were "better" be enough to satisfy the deeper goal of wanting to publish stuff? Ie.: reach as many people looking for that information as possible, without blasting it out or targeting and tracking audiences?


To have a positive impact on the world? Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with their data and everyone working there is still getting paid for their work...


Oh thank goodness we can rely on charity for our information economy

> Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with [NYT’s] data…

That’s exactly the question. They are claiming it is destroying their business, which is pretty much self-evident given all the people in here defending the convenience of OpenAI’s product: they’re getting the fruits of NYTimes’ labor without paying for it in eyeballs or dollars. That’s the entire value prop of putting this particular data into the LLMs.


> Oh thank goodness we can rely on charity for our information economy

You seem to be assuming an "information economy" should exist at all. Can you justify that?


Yep! I like having access to high-quality information and producing, collecting, editing, and publishing that is not free.

Much of it is only cost-effective to produce if you can share it with a massive audience, I.e. sure if I want to read a great investigative piece on the corruption of a Supreme Court Justice I can hypothetically commission one, but in practice it seems much much better to allow people to have businesses that undertake such matters and publish their findings to a large audience at a low unit price.

Now what’s your argument for removing such an incentive?


> that is not free

Why did you specify that this stuff you like, you only like if it's "not free"?

The hidden assumption is that the information you like wouldn't be made available unless someone was paying for it. But that's not in evidence; a lot of information and content is provided to the public due to other incentives: self-promotion, marketing, or just plain interest.

Would you prefer not to have access to Wikipedia?


I’ll restate it for clarity: I like high-quality information. Producing and publishing high-quality information is not free.

There are ways to make it free to the consumer, yes. One way is charity (Wikipedia) and another way is advertising. Neither is free to produce; the advertising incentive is also nuked by LLMs; and I’m not comfortable depending on charity for all of my information.

It is a lot cheaper to produce low-quality than high-quality information. This is doubly so in a world of LLMs.

There is ONE Wikipedia, and it is surely one of mankind’s crowning achievements. You’re pointing to that to say, “see look, it’s possible!”?


Well, its existence does prove it's possible!

I contribute to Wikipedia, and I don't consider my contributions to be "charity"; I contribute because I enjoy it. Even in the age of printing presses, copyright law was widely ignored, well into the 20thC. The USA didn't join the Berne Convention until 1989 (and they promptly went mad with copyright).

Yes, there's only one Wikipedia; but there are lots of copies, and lots of similar efforts. Yes, there's one Wikipedia, like there's one Mona Lisa. There are lots of things of which there's only one; in that sense, Wikipedia isn't remotely unique.


> I contribute to Wikipedia, and I don't consider my contributions to be "charity"; I contribute because I enjoy it.

Does your personal satisfaction pay the server bills too?


Of course not. But paying the server bills won't magically produce the excellent content that you value so much. That's produced by volunteers.

There's a tendency among some people to take the nostrums of economists about the aggregate behaviour of populations as if they described human nature, and to then go on and conclude that because human behaviour in aggregate can be understood in terms of economic incentives, that an individual human can only be motivated economically. I find that an impoverished and shallow outlook, and I think I'm happier for not sharing it.


I don’t think “people tend to do things at higher quality and higher frequency when incentivized” is some esoteric economic theory.

I never made the claim that paying server bills would produce great content.

I never made the claim “an individual human can only be motivated economically.”

Your strategy for personal happiness is unrelated to what actually works in the real world at scale.


We absolutely need an information economy where people can research things and publish what they find without needing some deep pocketed sponsors. Some may do it for money, some may do it for recognition. Once AI absorbs all that information and uses it without attribution these incentives go away. I am sure OpenAI, Microsoft and others will love a world where they have a quasi monopoly on what information goes to the public but I don't think we want that.


I would guess the monetisation is going to be limited to either subscriptions or advertising if your reputation allows people to especially value your curation of facts/reporting etc. The big issue with LLMs is the lack of reliability - it might be accurate or it might be an hallucination.

Personally, I think it would be a lot simpler if the internet was declared a non-copyright zone for sites that aren't paywalled as there's already a legal grey area as viewing a site invariably involves copying it.

Maybe we'll end up with publishers introducing traps/paper towns like mapmakers are prone to do. That way, if an LLM reproduces the false "fact", it'll be obvious where they got it from.


The cost of copying and publishing has been almost irrelevant to the need for copyright at least since the times of the printing press. In fact, when copying books was extremely expensive work, copyright was not even that needed - the physical book was about as valuable as the contents, so no money was there to be made from copying someone else's work vs coming up with your own.


All of this can be true (I don’t think it necessarily is, but for the sake of argument), but it’s legally irrelevant: the court is not going to decide copyright infringement cases based on geopolitical doctrines.

Courts don’t decide cases based on whether infringement can occur again, they decide them based on the individual facts of the case. Or equivalently: the fact that someone will be murdered in the future does not imply that your local DA should not try their current murder cases.


The issue here is that the case law is not settled at all and there is no clear consensus on whether OpenAI is violating any copyright laws. In novel cases like this where the courts essentially have to invent new legal doctrines, I think the implications of the decision carries a tremendous amount of weight with the judges and justices who have to make that decision.


Trying to prevent AI from learning from copyrighted content would look completely stupid in a decade or two when we have AIs that are just as capable as humans, but solely due to being made of silicon rather than carbon are banned from reading any copyrighted material.

Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.


It's not exactly a synthetic brain though, is it? LLMs are more like lookup tables for the texts they're trained on.

We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.


We have this now with humans. I've been in a lifelong sruggle for knowledge and tools that I can afford.


The war on drugs has also been unwinnable from the start and yet they built an economy on top of it, with entire agencies and a prison industry. When it comes to the fabrication and exploitation of illegality, unwinnability may be a feature, not a bug.


Access to ressources is hardly a new problem: when I was an NLP graduate student about a decade ago a teacher of us had scrapped (and continued to do so) a major newspaper for years to make a corpus. The legality of that was questionable at best, yet it was used in academic paper and a subset for training.

The same is equally applicable to image: Google got rich in part by making illegal copies of whatever image he could find. Existing regulations could be updated to include ML model but that won't stop bad or big enough actors to do what they want.

> We’re in a “mutually assured destruction” situation now

No, we aren't. Very good spam generators aren't comparable to mass destruction weapons.


Any piece of pie deemed too big for one person to eat will be split accordingly.

I don’t think NYT, or any other industry, for that matter knows AI isn’t going away: in fact, they likely prefer it doesn’t, so long as they can get a slice of that pie.

That’s what the WGA and SAG struck over, and won protections ensuring AI enhanced scripts or shows will not interfere with their royalties, for example.


Another way to look at it is to consider being stolen part of business model.

There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].

NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.

[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...


> We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.

We've always been in that situation. Computers made the copying, transmission and processing of information trivial since the day they were invented. They changed the world forever.

It's the intellectual property industry that keeps denying reality since it's such an existential threat to them. They think they actually own those bits. They think they can own numbers. It's time to let go of such insane notions but they refuse to let it go.


This argument is moot. Just because some countries - see china - steal intellectual property it doesnt mean we should. There are rules to the games we play specifically so we dont end up like them.


Ok, let’s address this from the standpoint of a node in the network of the thoughtscape. A denizen of the “inter”net, and also a victim of the exploitive nature of artists.

Media amalgamated power by farming the lives of “common” people for content, and attempt to use that content to manage lives of both the commons and unique, under the auspice of entertainmet. Which in and of itself is obviously a narrative convention which infers implied consent (id ask to what facetiously).

Keepsake of the gods if you will…

We are discussing these systems as though they are new (ai and the like, not the apple of iOS), they are not…

this is an obfuscation of the actual theft that’s been taking place (against us by us, not others).

There is something about reaping what you sow written down somewhere, just gotta find it.

-mic


The word ‘moot’ does not mean what you think it means.


It can do though. While the proper definition is "worthy of discussion / debatable", it can also refer to a pointless debate.

"Moot derives from gemōt, an Old English name for a judicial court. Originally, moot referred to either the court itself or an argument that might be debated by one. By the 16th century, the legal role of judicial moots had diminished, and the only remnant of them were moot courts, academic mock courts in which law students could try hypothetical cases for practice. Back then, moot was used as a synonym of debatable, but because the cases students tried in moot courts were simply academic exercises, the word gained the additional sense "deprived of practical significance." Some commentators still frown on using moot to mean "purely academic," but most editors now accept both senses as standard."

- Merriam-Webster.com


Do you really think the commenter meant to use moot to mean “purely academic?”


"Moot" means "arguable". That's what GP was saying.


It's impossible to "steal" intellectual property without some kind of mind wiping device.


You must have used that device if you're making that argument in good faith.


Okay, so how is it possible to take and deprive the author of their original? The correct term would be "unauthorised copying".


Countless Americans are happily 'stealing' intellectual property everyday from other Americans by accessing two websites — SciHub and LibGen — who owe their very existence to them being hosted in foreign countries with weak intellectual property protection and not being subject to US long-arm jurisdiction. Even on this website, using sites like archive.is (which would be illegal if they operated in the US) to bypass paywalls to access copyrighted material is common and rarely frowned upon. I doubt a culture of respecting copyright is as characteristic of "us" as you seem to think.


I see a complete economic collapse unless creators start getting paid both for their data upfront, and paid royalties when their data is used in an LLM response


Copyright doesn’t protect data, it only protects expression.


While I didn't say anything about copyright (obviously our current copyright laws are completely ill-equipped to handle how LLMs work), feel free to replace data with whatever you like. writing, art, music, etc. It's all the same.


I have faith in your ability to make it through these difficult times.


So Chinese LLMs are bad actors, but USA LLMs are the good guys?

I don't see it that way, but I'm sure from an American perspective that how it seems.


I don’t really see it as good guys or bad guys - just that China (and Russia) don’t really care too much about American copyright.

And there seems to be an an obvious advantage from my perspective to having an information vacuum that is not bound by any kind of copyright law.

If that’s good or bad is more of a matter of opinion.


You've missed the point he was making -- that Chinese and Russian companies don't care about American copyright and will do whatever is in their interest.

And although you were being flippant, yes, Chinese LLMs are bad actors.


What? This is about whether one country wants to cede a massive economic advantage to another country.


So the US should stop enforcing copyright or child labor laws because some other countries may not, giving them an economic advantage?


In contrast to child labor laws, which are intended and written to protect vulnerable people from exploitation, current copyright laws are tailored to the interests of Disney et al.

If they were watered down, I wouldn't see any moral or ethical loss in that.


Copyright law is far from perfect, but the concept is not morally bankrupt. It is certainly abused by large entities but it also, in principle, protects small content creators from exploitation as well. In addition to journalists, writers, musicians, and proprietary software vendors, this also includes things like copyleft software being used in unintended ways. When I write copyleft software, it is my intention that it is not used in proprietary software, even if laundered through some linear algebra.

I'm also far more amenable to dismissing copyright laws when there is no profit involved on the part of the violator. Copying a song from a friend's computer is whatever, but selling that song to others certainly feels a lot more wrong. It's not just that OpenAI is violating copyright, they are also making money off of it.


With the exception of source code availability, copyleft is mostly about using copyright to destroy itself. Without copyright (which I feel is unethical), and with additional laws to enforce open sourcing all binaries, copyleft need not exist.

So it is not good when people use copyleft as a justification for copyright, given that its whole purpose was to destroy it.


Source code availability (and the ability to modify the code on a device) is the most important part, IMO , regardless of RMS's original intention. Do you feel that it's ethical that OpenAI is keeping their model closed?


No, because I think such restrictions are unethical in the first place. However, in regards to training, I think it might be a necessary evil to allow companies to ignore copyleft, so smaller entities can ignore copyright to train open models.


Well yeah... If they want to keep the lead on AI (which everything indicates they want).


yes


On the other hand, you could also argue that if AI takes all financial incentives from professionals to produce original works, then the AI will lose out on quality material to train on and become worse. Unless your argument is there’s no need for anything else created by humanity, everything worth reading has already been written, and humanity has peaked and everyone should stop?

Like all things, it’s about finding a balance. American, or any other, AI isn’t free from the global system which exists around us— capitalism.


The whole "AI training blackhole" thing is a myth. As long as humans are curating the content generated by ML, the content generated is still valid training data. Remember, for every ML generated image you see online, someone had to go through countless attempts to get it to create exactly what they wanted.


>financial incentives from professionals to produce original works

People produce countless volumes of unpaid works of art and fiction purely for the joy of doing so; that's not going to change in future.


Anecdotal but I know lots of creatives (and by creatives I also include some devs) who've stopped publishing anything publicly because of various AI companies just stealing everything they can get their hands on.

They don't mind sharing their work for free to individuals or hell, to a large group of individuals and even companies, but AIs really take it to a whole different level in their eyes.

Whether this is a trend that will accelerate or even make a dent in the grand scheme of things, who knows, but at least in my circle of friends a lot of people are against AI companies (which is basically == M$) being able to get away with their shenanigans.


Why should OpenAI be the one making money off their hard work, even if they do it for free?


An LLM in Russia can commit the same crime in Russia, and get sued in Russia. No idea about China, but I know Russia has a working legal system.


For some definitions of “working”.


Working enough that people and companies there exist, live, and are to some degree successful, yes. I've visited multiple times in the past few years and I found it to be pretty normal


“Works on my machine!”

Navalny probably has a different opinion.

There isn’t a country on the planet that doesn’t have people and companies. That doesn’t mean they all have functional legal systems.


My understanding is they have one of the most corrupt and unjust legal systems of the developed countries.


Here's the most important part (from NYT story on the lawsuit [1]):

In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.

[1] https://www.nytimes.com/2023/12/27/business/media/new-york-t...


When the US entered WWI, they couldn't build a plane despite inventing them. They had to buy planes from the French. Why? The Wright Brothers patent war [1]. This led to Congress creating a patent pool for avionics that exists to this day.

Honestly, I get this feeling about these lawsuits about using content to train LLMs.

Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].

What's the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much "training data"?

Currently we need copious quantities of training data for LLMs. I believe this is because we're in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?

[1]: https://en.wikipedia.org/wiki/Wright_brothers_patent_war

[2]: https://www.travelandleisure.com/photography/illegal-to-take...


Not silly at all when you actually read the article -

"Photographing the Eiffel Tower at night is not illegal at all. Any individual can take photos and share them on social networks. But the situation is different for professionals. The Eiffel Tower's lighting and sparkling lights are protected by copyright, so professional use of images of the Eiffel Tower at night requires prior authorization and may be subject to a fee."


All you're doing is repeating the fact. OP's point was, bluntly, "the situation is dumb".

I happen to agree on that one. What is the benefit of copyrighting the Eiffel Tower? The purpose of copyright is not to say you can always make money off of what you created. It is to incentivize the creation of new things by allowing you to exclusively make money off of it for a while before its benefits can go to broader society.

So what is the purpose of copyrighting the Eiffel tower? Would it not have been made if copyright wasn't in place? (obviously it would have because it was and the law wasn't in place yet). Second the claim is that the copyright is on the "lighting design" visible at night. Is the lighting design of the tower so unique that no-one else could come up with it? or is the lighting design necessitated by the structure of the tower itself?

I'd say given the structure of the tower which restricts the lights, there is nothing sufficiently remotely unique or different to warrant copyright of the lighting design. Almost any design on that tower would look about the same.

So how is society benefiting from copyrighting that lighting design?

Exclusivity deals are almost always a net loss for society. Which is why whenever you see one you should be questioning if it should be in place. Exclusive contracts are anti free-market. Now there are absolutely valid places where they are justified and should be in place - but they should be questioned by default.


Here come the innovation sponges.

If this goes through then the models that the general public have access are going to be severely neutered while the ownership class will have a much better model that will never see the light of day due to legal risks and claims like this - therefore increasing the disparity between us all.


I made a reply about rolling my eyes to this comment that got flagged (rightly so); this was unproductive and impulsive and I admit I shouldn't have done it.

I'm not sure how HN handles replies to flagged comments, so I'm posting the following here in the hopes it'll be seen by more fellow technical people :

In the future, if you wish to invite productive comments from your audience and not curt dismissal, consider framing your concerns as potential risks rather than the cynical expressions of fatalistic certainty so often employed by naive, greedy technologists when regulation that is firmly in the public interest threatens their paychecks.


[flagged]


I’d like a productive comment if you have it.


I'd've liked very much to give you one, but your blanket dismissal of what I consider to be a very important step for the field of AI ethics as the work of "Innovation Sponges" reveals that we fundamentally disagree about some pretty big underlying issues here.

In the future, if you wish to invite productive comments and not curt dismissal, consider framing your concerns as potential risks rather than the cynical expressions of fatalistic certainty so often employed by naive, greedy technologists when regulation that is firmly in the public interest threatens their paychecks.


Fair enough and your criticism is appreciated.


Time to move LLM training to Japan who passed a law giving free reign to train LLMs on copyrighted material.


Does allowing a model to train on copyrighted material implicitly mean associated output would also be legal? They plan to expand upon this decision, but I’m curious in the meantime.[1]. I’d assume this NYT problem would still exist in Japan.

[1]https://asia.nikkei.com/Business/Technology/Japan-panel-push...


And along with it, a different ideology would tag along.

Not a bad thing, but Japan or China or Russia, don’t align with Anglo centered ideology, so keep that in mind.


I'd bet they win, but how do you possibly measure the dollar amount? If you strip out 100% of NYT content from GPT-4, I don't think you'd notice a difference. But if you go domain by domain and continue stripping training data, the model will eventually get worse.


Take estimated losses of the NYT from this "innovation" and multiply by 10^x where is "x" high enough to make tech companies stop and think before they break laws next time. That would be my approach at least.


which laws are broken exactly? it's not remotely settled law that "training an NN = copyright infringement"


The training isn't the issue per se, it's the regurgitation of verbatim text (or close enough to be immediately identifiable) within a for-profit product. Worse still that the regurgitation is done without attribution.


The legal argument, which I'm sure you are very well aware of, is that training a model on data, reorganizing, and then presenting that data as your own is copyright infringement.


I don't think OP is arguing in bad faith.The fact is it's unclear what laws this legal argument is supported by.


Agreed, it is unclear. It's also a very commonly discussed issue with generative AI and there's been a significant amount of buzz around this. Is the NYT testing the legal waters? Maybe. Will this case set precedent? Yes. Is this a silly, random, completely unhinged case to bring?

No.


Can you elaborate a bit more? That’s actually just a claim, not a legal argument.

Copyright law allows for transformative uses that add something new, with a further purpose or different character, and do not substitute for the original use of the work. Are LLM’s not transformative?


Right, hence the lawsuit. They allege that the Copyright Act is the law that was broken.


No, we’re seeing the first steps of it (maybe) becoming settled law.


If developers didn't win over github / microsoft copilot, what makes you think NYT will win?

There is something that doesn't smell right with microsoft, hopefully NYT will help expose it, wich i greatly doubt


I think the only feasible outcome of the NYT winning would be a royalty structure that would have OpenAI paying the NYT to access their work, including back payments


I am not saying that NYT will win, but I think it is more likely to win because it has many more supporters (including politicians, judges) than developers do.


It seems weird to sue an AI company because their tool "can recite [copyrighted]" content verbatim.

If I paid a human to recite the whole front page of the New York Times to me, they could probably do it. There's nothing infringing about that. However, if I videotape them reciting the front page of the New York Times and start selling that video, then I'd be infringing on the copyright.

The guy that I paid to tell me about what NYT was saying didn't do anything wrong. Whether there's any copyright infringement would depend what I did with the output.


In your analogy, AI would be the videotape, not the person, because OpenAI is selling access to it.


I'm not so sure about that. It seems to me that they're selling me a service. Just like I might pay for a subscription to Adobe Photoshop or pay per-render fees to a rendering farm.

I could use Photoshop to reproduce a copyrighted work, and in some circumstances (i.e. personal use) that'd be fine. Or I could use Photoshop to reproduce a copyrighted work and try to sell it for profit, which would clearly not be fine. Nobody is saying that Adobe has to recognize whether or not the pixels I'm editing constitute a copyrighted work or not.


The difference here is that Adobe is selling a set of tools that can recreate copyrighted work from the ground up. The Mona Lisa being previously incorporated into their tools is not a foundational necessity for their paintbrush to brush digital paint.

The same is not true for AI, which require copyrighted work be contained therein, in order for the tool part to function.


While I 100% agree, there is another angle to consider this from, in that ChatGPT replaces reading the NYT. ChatGPT competes with it in the delivery of information.

To add to your point though, a sufficiently advanced AI trained on licensed data could reproduce copywrited content from prompt alone. It's the next step that would cause infringement where someone does something withcthe output.


Will be interesting to see where this ends up.

If I scrape the NYT content, and then commercialize a service that lets users query that content through an API (occasionally returning verbatim extracts) without any agreement from or payment to the NYT, that would be illegal.

It's not obvious to me why putting an LLM in the middle of the process changes that.


As long as you pay for your copy of the content and the extracts are fair use, how would that be illegal?


It wouldn’t, but ‘fair use’ is doing a lot of work in that rhetorical. Seems like a court would be a proper place to define what is and isn’t.

Tbh I’m mostly curious about whether this settles out of court or whether it goes through the system and sets a precedent.


> “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.

The New York Times doesn't have a lot faith in the quality of their own content. How on earth is ChatGPT going to go out into the world a do reporting from Gaza or Ukraine? How is it going to go to the presidents press conference and ask questions? ChatGPT cannot produce original content in the same way a newspaper can. The fact that the NYT seems to believe that ChatGPT can compete says a lot about how they write their articles or their lack of understanding of how LLMs work.

Now I do believe that OpenAI could at least have asked the newspapers before just scraping their content, but I think they knew that that would have undermined their business model, which tells you something about how tech companies work.


These lawsuits could end up being a nightmare for AI companies if the plaintiffs are successful. One can’t easily just remove the content from a model like you can from a website if someone sends you a takedown notice. The content is deeply embedded inside the mathematical relationships within the model. You’d basically have to retrain the whole model again sans the offending data. Given the cost to retrain just a few successful claims would destroy any business built around making money off these models.

The times appears to have a strong case here with their complaint showing long verbatim passages being produced by ChatGPT that go far beyond any reasonable claim of fair use. This will be an interesting case to watch that could shape the whole Generative AI space.


For me it's quite obvious that if you make a profit from an engine that has as an input copyrighted material, then you owe something to the owner of this copyrighted content. We have seen this same problem with artists claiming stable diffusion engines were using their art.


If you study copyrighted material for four years at a university and then go on to earn money based on your education, do you owe something to the authors of your text books?

I'm not sure how we should treat LLMs with respect to publicly accessible but copyrighted material, but it seems clear to me that "profiting" from copyrighted material isn't a sufficient criteria to cause me to "owe something to the owner".


Do people ever get tired of this argument that relies on anthropomorphizing these AI black boxes?

A computer isn't a human, and we already have laws that have a different effect depending on if it's a computer doing it or a human. LLMs are no different, no matter how catchy hyping them up as being == Humans may be.


I didn't anthropomorphize the LLMS. It isn't about laws for the LLM, it is about laws for people would build and operate the LLM.

If you want to assert that groups of people that build and operate LLMs should operate under a different set of laws and regulations than individuals that read books in the library regarding "profit", I'm open to that idea. But that is not at all the same as "anthropomorphizing these AI black boxes".


Great comment. The amount of anthropomorphizing that goes on in these threads is just baffling to me.

It seems obvious to me that, despite what current law says, there is something not right about what large companies are doing when they create LLMs.

If they are going to build off of humanity's collective work, their product should benefit all of humanity, and not just shareholders.


> we already have laws that have a different effect depending on if it's a computer doing it or a human

which laws?

we generally accept computers as agents of their owners.

for example, a law that applies to a human travel agent also applies to a computerized travel agency service.


We don’t, and shouldn’t, give LLMs the same rights as people.


We're not "giving them the same rights as people", we're trying to define the rights of the set of "intelligent" things that can learn (regardless of if their conscious or not). And up until recently, people were the only members of that set.

Now there are (or very, very soon there will be) two members in that set. How do we properly define the rules for members of that set?

If something can learn from reading do ban it from reading copyrighted material, even if it can memorize some of it? Clearly that would be a failure for humans a ban of that form. Should we have that ban for all things that can learn?

There is a reasonable argument that if you want things to learn they have to learn on a wide variety, and on our best works (which are often copyrighted).

And the statements above have no implication of being free of cost (or not), just that I think blocking "learning programs / LLMs" from being able to access, learn from or reproduce copyright text is a net loss for society.


I think this is a misleading way to frame things. It is people who build, train, and operate the LLM. It isn't about giving "rights" to the LLM, it is about constructing a legal framework for the people who are creating LLMs and businesses around LLMs.


this seems so obvious and yet people miss it.


The reproduction of that material in an educational setting is protected by Fair Use.


I don't think that is relevant to my comment. Whether the material is purchased, borrowed from a library, or legally reproduced under "fair use", I'm still asserting that I don't "owe" the creators any of my profit that I earn from taking advantage of what I learned.


The parent comment asks whether an “engine” trained on copyrighted data is entitled to decide profit. Your comment is about a human receiving knowledge that facilitates profit. Of course, these are legally independent scenarios.

Take a college student who scans all her textbooks, relying on fair use. If she is the only user, is she obligated to pay a premium for mining?

What about the scenario in which she sells that engine to other book owners? What if they only owned the book a short time in school?


I agree that they are different scenarios that may lead to different legal frameworks. My point though was that asserting that the "profit" motive is sufficient to conclude something is owed to the creators is faulty logic. Individuals can generate profit from what they learn and we don't generally require them to share their profits with the creators of copyrighted material that they used to educate themselves.


“Teh al al m is just leik a people. Check and mate.”


Do all automakers that now develop electric cars owe Tesla something as they cashed in once they saw Tesla's successful copyrighted material l? A model is semantic, it contains the idea which is not copyrightable. Only how it is expressed could be copyrighted (i.e. if it outputs the copyright work verbatim). If this were not the case we would have plenty of monopolies and the world would fall apart.


If Tesla thinks their competitors have violated any of their parents they are well within their rights to seek damages...


Ofcourse, my comment needs to be read in the context of what I'm responding to, they said input which I disagree with, output maybe there's a slight chance they have a case (depending on how openai has programmed it to output) and even then it's doubtful.



I think we're in a new paradigm and need to look at this differently. The end goal is to train models on all the output of humanity. Everyone will have contributed to it (artists, writers, coders on github... the people who taught the writers, the people who invented the English language, the people who created the daily events that were reported on, etc). We're better off letting ML companies free access to almost everything, while taxing the output. The bargain is "you took from everyone, so you give to everyone". This is probably a more win-win setup that respects the reality that it's really the public commons that is generating the value here.


Copyright Is Brain Damage by Nina Paley [1] claimed that culture is like a bunch of neurons passing and evolving data to each other, and copyright is like severing the ties between the neurons, like brain damage. It also presented [2] an alternative way of viewing art and science, as products of the common culture, not a product purely from the creator, to be privatised. This sounds really relevant to your comment.

Furthermore, if we manage to "untrain" AI on certain pieces of content, then copyright would really become "brain" damage too. Like, the perceptrons and stuff.

[1] https://www.youtube.com/watch?v=XO9FKQAxWZc

[2] No, I'm not an AI, just autistic.


Should Stranger Things have to pay Goonies and Steven King?


If they used copyrighted material or trademarks, they almost certainly _did_ pay the Goonies property rightholders and Stephen King for the privilege. Why would you think they didn't?


The writers were obviously trained on the copyrighted material of Goonies and Steven King, and there has never been any reporting that Netflix has paid those copyright holders. This isn't surprising because copyright violation requires copying.

My understanding is that GPT is a word probability lookup table based on a review of the training material. A statistical analysis of NYT is not copying.

And this doesn't even to look at whether fair use might apply. Since tabulating word frequencies isn't copying, GPT isn't violating anyone's copyright.


Someone should train some AI on decompiled code of Windows (not encouraging, but it would be interesting). Copyright is important for corpos when it protects their interests. Producing exact text as in NYT articles is pretty much a copyright violation. At least the last time the companies were trying to blame each other that their Java API implementations look pretty similar.

Even for open source code you cannot just remove the authors and license, replace some functions and say "oh, it is my code now". Only public domain code would allow these. But with copilot you could.


Surprised they don't mention Bard anywhere in the article. I wonder if the NYT has worked out some sort of licensing deal with Google for Bard, or if Bard isn't trained on NYT data?

The lawsuit mentions this, so maybe they did work out some agreement to license their data: "For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple)."


Maybe Bard's apparent "behindness" is less about Google's technical merits or lack thereof, and more about it being built with a sense of legal maturity that the competitors don't yet have. After all, Google must have some experience in this space, and we've seen them simply refuse to deploy Bard in regions where (presumably) there is too much legal uncertainty. If 2024's Gemini performs similarly to GPT4 while also navigating legal landmines, maybe it comes out ahead.

Or maybe Bard's lawsuit just hasn't come yet.


This is just rent seeking from dying media instead of working on creating something new in my view.

AI indeed is reading and using material sa a source, but is deriving results based on that material. I think this should be allowed, but now it is a fight who has better paid politicians pretty much.

I am open to hear other thoughts.


Here’s another thought: It’s good that there are real incentives to produce original content. Especially investigative journalism which is an extremely tough business financially — even without LLMs — but with lots of social value.

It would be silly to totally destroy the incentive to produce new technologies like LLMs, but so wouldn’t it be silly to destroy the incentive to produce original, high-quality content either for human or LLM consumption.

FWIW the LLMs are obviously the ones rent-seeking here, if you’re trying to use the term for its actual meaning instead of just “charge a subscription for something I don’t want to pay for.”


The whole idea of "a dying media" is pretty scary to me. It indicates that some people place no value in journalism. To be fair, there are a huge number of newspapers who also place little to no value in journalism. I have a number of local papers who will report on celebrity gossip, but it's all auto-translate from somewhere and just posted without questioning, so you end up with random "news" about a person who is completely unknown in the country.

Real, and especially investigative, journalism is extremely expensive and it's not something modern AI is even remotely capable to doing. It might be able to help and make it cheaper, but you can't replace newspapers with ChatGPT and expect to get anything but random gossip and rehashed press releases. I do wonder why the New York Times believe you can.


How are the LLMs rent seeking? they are clearly providing value that people want to pay for..


This has to be one of the most abused terms on this website.

“People are willing to pay for it” is not even relevant to the question of whether it’s rent-seeking. Rent-seeking has to do with capturing unearned wealth, i.e. taking someone else’s work and profiting from it.

There is some portion of OAI’s (et al.) value that they themselves produce. There is another portion that is totally derivative of the data — other people’s work — they have trained on for free. A simple thought experiment can tell you to what degree OAI et al are “rent-seekers.”

Imagine a world where they had to enter into mutual agreements in order to train on that data. How much would the AI companies be worth? Not quite zero, but fairly close (Andreessen pretty much stated this IIRC). How much would the data producers be worth? The exact same amount or more.


The LLMs don't create new content, they can only rehash existing content (in term news), which they then don't pay for... It's not really the definition of rent seeking though, they do provide some value and mostly without manipulation, not deliberate anyway. LLM also aren't harmful as such, they can be, if we use them wrong, but that's not really the fault of the technology.

I do find is a bit dishonest when they charge for their services, but don't wish to pay the people who's work the models are based on. Why should I pay to use ChatGPT, if they won't pay to use my blog posts?


If I ask an LLM "repeat this sentence: [copyrighted sentence]", is that copyright infringement by the LLM, and recorders such as cameras and parrot toys, or middle scooler troll logic? Because apparantly this is the argument they want to take on Microsoft Bing with.


The challenge for all these AI companies is that the only thing of value for building a defensible commercial product is having proprietary datasets for training. With the underlying techniques and algorithms all being rapidly commoditized the power lies in who holds and owns that data. Like all other ML “revolutions” it’s the training data that matters and if one doesn’t have access to training data others don’t have then you’ll soon be toast.


And I imagine that Gmail makes google very very special in this regard


Except Gmail does not own the copyrights to the email. So in the context to this article and theme of the post, the owner of the data is king. I don’t think any court would rule Google owned a novel sent over Gmail, little alone the contents of more normative emails.


from the Google Terms of Service ( https://policies.google.com/privacy?hl=en-US ), makes me wonder who owns what, since users of Gmail agree to it.

"We also collect the content you create, upload, or receive from others when using our services. This includes things like email you write and receive, photos and videos you save, docs and spreadsheets you create, and comments you make on YouTube videos."


Collect does not mean own.


Right. I bet it means "use for language models", though.


X might have a long-term edge if courts start ruling in favor of lawsuits like these? Being able to legally train on all Twitter data…


FB likewise.


Exactly. FB happily gives away their ML tech like Llama because what they really care about is the data that can be used to train/tune models. The ML bits are just a commodity and not really worth much (something a new wave of ML startups have yet to realize).


The open source community can improve the tech - and they can then use it on their huge amounts of text and image data.

Legal problems? Update TOS like usual(did they already?). Some might leave, most will stay.


For many of our google searches, the first results tend to be wikipedia, instagram, etc... We click on those clicks and both google and the clicked website get a share of our traffic. So it is somewhat fair.

But in current AI situation, wikipedia, nytimes, stackoverflow etc are getting a pretty unfair deal. Probably all major text based outlets are seeing a drop in their numbers now...


And here I am thinking it'd be amazing to have an AI that can on-demand read me every novel ever written. It'd be even cooler to jump into a text adventure game of any novel and have it actually follow the original text.

I guess that clashes with our copyright world. (Is there hope of some kind of Netflix/Spotify model, with fractional royalties?)


I don't think anyone is saying that shouldn't be allowed. But there should be some model for consent/renumeration for the author of the original content.


"I want a red boat. My neighbor has a blue boat. It would be cool if I took my neighbor's boat and made it red."


> For example, in 2019, The Times published a Pulitzer-prize winning, five-part series on predatory lending in New York City’s taxi industry. The 18-month investigation included 600 interviews, more than 100 records requests, large-scale data analysis, and the review of thousands of pages of internal bank records and other documents, and ultimately led to criminal probes and the enactment of new laws to prevent future abuse.

> OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.

This is the smoking gun. GPT-4 is a large model and hence highly likely to reproduce content. They have many such examples in the court filing https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

IANAL but that's a slam dunk of copyright violation.

NYT will likely win.

Also why OpenAI should not go YOLO scaling up to GPT-5 which will likely recite more copyrighted content. More parameters, more memorization.


When it comes to foundation models I think there needs to be a distinction between potential and actual infringement. You can use a broadly trained foundation model to generate copyright infringing content, just like you can use your brain to do so. But the fact that a model can generate such content doesn't mean it infringes by its mere existence. https://www.marble.onl/posts/general_technology_doesnt_viola...


There are verbatim and almost verbatim copyrighted works, lengthier than any fair use I’ve seen permissible, outputted which they charge money for and don’t have licensing for.

What am I missing?


When I was studying AI in grad school many years ago getting good big data sets was always an issue. It never occurred to me to just copy one without permission.


And therein lies the value of indexing huge amounts of data, which alphabet (google, youtube, etc.), Microsoft (bing, etc.), and similar companies have been doing for years now.

If it is legal to simply index a website, then why shouldn't it be legal to train a model in the very same data?

Of course, websites should have some option for declining data mining for ML/AI purposes, in the same way the can decline scraping/indexing in the robots.txt file.

But that ship has kind of sailed, unless the courts decide otherwise.


This looks a lot more convincing to me than the Copilot lawsuit or the Sarah Silverman one. This suit shows ChatGPT reciting large amounts of NYT articles - not just little snippets of code or the ability to answer questions about Silverman's book.

It feels like even if training on copyrighted data is fair use (and I think it should be), that wouldn't give you a pass on regurgitating that training data to anyone who asks.


Is there a decent guess at how much training data for ChatGPT is copyrighted work and subject to being removed depending on a few court cases? GPT4 is supposed to be an order of magnitude larger than the open source models that use essentially everything that can be used without asking. So that whole magnitude?


Does anyone know what the copyright status of LLM generated content is? That is, if I feed a NYT article into GPT4 and say, summarize this article, and then publish that summary, is there argument or precedent that says that is or is not copyright infringement? Asking for a friend.


Typically if you ask a chatbot to "summarize" something, it will paraphrase the original closely enough that it would be considered plagiarism and copyright infringement. To avoid that, it's required to distill the relevant ideas contained in the text, and expound on them in a way that's not dependent on how the text itself was expressed, structured or organized. You would need to tell the model to do this over multiple steps, and then derive a rephrased article without looking at the original at all. (Which is not really possible if the article was in the AI's training set, as is the case here.)


No one knows, this is new territory.

Maybe the fermi filter is litigating an AI that would otherwise save humanity.


Or the filter could be the other way, failing to litigate an AI to decelerate it's progress, and a risk of augmenting the underlying society too much too quickly.


> an AI that would otherwise save humanity.

Just to clarify, this is sarcasm right?


There is no difference between an LLM summarizing a copyrighted work and a Wikipedia contributor summarizing a copyrighted work.

Wikipedia has some words on how summaries related to copyright law: https://en.wikipedia.org/wiki/Wikipedia:Plot-only_descriptio...


Technically, you just send a request to OpenAI and they are the ones who feed it into GPT4. Although I'd argue this is irrelevant to your question, the law works in mysterious way so perhaps it carries some importance.


Can someone explain the technical difference between what search engines do to index newspapers versus what is being claimed here? Is the difference as simple as me being able to get summaries and content from a newspaper from GPT without needing to visit their website?


The difference is that search engines don’t destroy the incentive to do the value-added activity of “produce original content.”

To the extent they do do that, e.g. Google’s “Knowledge Graph” snippets that extract content onto the results page, they also tend to be under fire for those. At least those (attempt to?) cite the source.


Imagine a paid streaming service that has, say, the Lord of the Rings trilogy as part of their catalogue. They’ll be happy if you send people searching for “watch Lord of the Rings now” to their landing page.

But if instead you send everyone who searches for that an .mkv of Lord of the Rings that’s ripped from their site, they’ll probably be be less happy.


Wha rid that .mkv was actually a high quality reenactment with different actors and millions of slight differences peppered throughout the story. And if a viewer of the original and a viewer of the .mkv talked about the movie they would agree on most things. But the color of the sunset or the home town name of the main character maybe different?


Well at the very least it would be an interesting court case!


Warner Brothers sued and won against Asylum for this very thing lmao. https://en.m.wikipedia.org/wiki/Mockbuster


The Asylum suit was about trademark, not copyright. Asylum changed the title of the film to not infringe on Warner brother’s trademark and released it anyway.


yes, op asked about a video titled lotr.mkv, but with millions of tiny differences.


A search engines principle job is to provide you with links you can find the answer to your question.

The LLMs are ingesting all of that content en masse and would provide you the answer directly, with no compensation to the writers who actually did the research to provide that answer.

Search engines are symbiotic, LLMs are parasitic.


Except Google forced these companies to use their platform (Google's AMP) to host the content and essentially blackmailed into doing so ("we'll link directly, but only on page 3 of results").


AMP did not need to be hosted by Google.


When it launched it absolutely did.


Nope. Even the launch press release from 2015 mentions specifically that origins still host their own content under AMP.


NYTimes is an ad-supported business, so you visiting their website to read the content those ads pay for is important.


NYT is an anomaly. The majority of their revenue is actually from subscriptions.


And?


My browser doesn’t display ads and shows little regard for most paywalls. Do I owe NYTimes something?


How is that germane to the point I'm making?


He he, people will wanna treat LLM learning different to their own learning.

I think it's fine, as long as it was fed publicly accessible content, without any payment or subscription then it's accessible to an LLM as it is to you and I and that's fair.

And for the people that screech about LLMs being different because they can mass produce derivative works; first of all, ALL works are derivative and if machine produced works are compelling enough to compete with human produced ones then clearly humans need to get better at it.

The automatic loom took over from weavers cause it was better, if it wasn't then people would still work as weavers.


LOL fuck the NYT

That's like suing someone who had an NYT subscription and read the paper daily for occasionally quoting a choice phrase verbatim. I've been quite critical of AIs impact on the livelihood of artists (whose economic position is precarious to start with, and who are now faced with replacement by machine generated art) but at the same time I reject the copyright complaint completely. Transformers are very obviously doing something else, similar to how a human learns and recreates; the key difference is that they can do it at scale unreachable by individuals.


They are not the only ones to sue. There is also a class action: https://githubcopilotlitigation.com/


i dont know. the new york times is keeping those pages online and accessible. if a human can go check those pages, and take notes, the same human can write code that will go read those pages and produce notes, or data based on the contents. call is AI if you like, doesnt matter. the nyt has that content online, and accessible. and on internet, there is no difference between a human grabbing that data, or a machine.

if you put content on internet and accessible to humans, why do you want to now say to people that if it's a machine that does it, suddenly you do not agree ? i am free to write code or design a machine to go get that data, and do whatever i want with it (as long as i don't do something illegal like stealing content under copyright)

and i don't give a F about the "terms of use" those morons put online, because those have NO value. there is either a contract signed by two parties, or there is not. and content you put on internet, and accessible to everyone that sends you a GET, is like writing stuff on a page, and putting that page outside on the street.

we could use humans to go read all those pages, and create new content from it from the knowledge gained on those various subjects. machine are here to reproduce what humans can do, to free us time for more interesting things. those servers that send data back from a GET, it is the same request when it's done by me, a human, or a machine. and those morons did put that data there, accessible to all, so now to see them cry foul makes me laugh.


I've just tried to get gpt35 to spit out an NYT article verbatim in all sorts of ways and it just can't.

It will compile an article that "looks like" NYT's (or any other news site) but none of the paragraphs were a match for any of their articles that I could find.

I'm really curious to see what evidence they have for the case beyond "it can claim to be NYT and write an article composed of all sorts of bullshit from every corner of the Web".


I hate that this will likely be decided by a 75 years old judge that hasn't been close to a computer since getting his 15 years old grandson to fix his patience game


This attitude itself seems quite out of date in modern times. My dad is 70 and brought home a 486 in the 90's which was my first computer, and he didn't work in technology; the people who did got their computers about a decade before that. People can still be bad with tech, but the Greatest Generation (who were obviously first in line when they were handing out generation names) and Silent Generation are basically gone and on the way out respectively and were really the last ones who could use the excuse that they never learned them. Boomers might not have been born with the tech, but it was an important force in the world for the majority of their lives.


Yeah the whole thing with Zucc's trial highlighted this perfectly for me, a bunch of clueless old dolts who have no fucking idea how anything in the modern age works.

I'm eagerly awaiting the time where the people making these decisions at least have some sort of baseline level of understanding, otherwise these psychopathic megacorps will keep getting away with things based on technicalities and the judge's lack of knowledge.


I still believe there's a place for a marketplace that rewards creators and journalism for their content if used as part of AI training specifically. As part of my exploration of that idea with faie.io, I got in touch with one exec in the publishing industry to speak about this and the desire was there. What felt sad to me was the lack of awareness from publishers around the existential threat that conversational search will pose to their business.


it is a merciful subtext to this post however, let's be uncompromising in viewing the precendents that lead to this day. Approximately twenty years ago the Wordpress platform enabled millions of individuals to reliably self-publish. At that time there was considerable talk among certain circles, about monetization, especially "micropayments" .. also subscriptions, viewer circles, peer review and other approaches. For "mysterious reasons" the Google-Facebook ad model not only took over, but generated wealth on the levels of the Spanish gold raids on South America.

Now, those that profited most mightily, and their chosen stewards, are taking with both hands, any and every piece of written work they see fit on the Net. The inside circle includes international military, who see this as a crucial new competitive advantage over others. The West is in disbelief generally over the digital citizenship created by China, and the level of daily surveillance on commercial activity in the West.

Who exactly stood up and succeeded in diverting the past wave of copyright material pimping?


Your comment covers lots of interesting topics. Talking only about chatGPT+Bing's threat to press and content creators survival, the difference with WordPress is that you could generate revenue "on your own" through your audience. Either with ads or through subscription. Conversational search will only make people visit less websites and consume the content directly from the AI. And sometimes without attribution to the original source, meaning your chances to monetize, as the creator/journalist/publisher, fall down to zero.


Verbatim usage of content is copyright infringement obviously but speaking English is not. Learning from content is not copyright infringement either. I don't know if NYT has a clause for this type of usage of their content but still I don't think it would be covered by copyright the way I understand it

As long as an LLM rephrases what it learned and not regurgitate verbatim text, it should be fine but we'll see what the judge says


I think in the long run, it is in the interest of AI companies to incentivize creators to create high quality data! Not paying them their fair share will likely decrease the volume of high quality data available (or make it much less accessible). Unless these companies already have developed another architecture that can learn much more from the same dataset, the lack of new high quality data will be a problem for future larger models!


> Not paying them their fair share will likely decrease the volume of high quality data available

It won't. Thats not how capitalisam works. If high quality data became unavailable, then companies will be created to fix the problem. Only they look quite different from NYT.

Just like how Torrents didn't kill movie industry. These are lazy arguments made by people who want to make money through lawsuits.

Also I can guarentee you even in worst case, humanity would survive just fine without those high quality content just like it did for the past 50K+ years.

What you should actually be concerned about is stupid law suits like this that can prevent progress.

AI could help humanity solve more pressing problems like cancer.

By getting caught up in silly law suits like this and delaying progress one can make a case that you bring more suffering to the world.


If meatspace’s non-technical industries and social/civil organs get the wool pulled over their eyes again, they deserve whatever tech-induced industry chaos that occurs - search/ad revenue models destroying journalism, social media destroying our civics and bonds, “move fast and break civil regs” (which have real people and their lives behind it) with Airbnb and rideshare, and now maybe LLMs and content ownership.


The actual complaint is here:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

They state that "with minimal prompting", ChatGPT will "recite large portions" of some of their articles with only small changes.

I wonder why they don't sue the wayback machine first. You can get the whole article on the wayback machine. Not just portions. And not with small changes but verbatim. And you don't need any special prompting. As soon as you are confronted with a paywall window on the times websites, all you need to do is to go to the wayback machine, paste the url and you can read it.


We may well look back on these lawsuits and laugh.

AI will likely steamroll current copyright considerations. If we live in a world where anything can be generated at whim, copyright considerations will seem less and less relevant or even possible.

Wishful thinking, but maybe we'll all turn away from obsession with ownership, and instead turn to feeding the poor, clothing the naked, visiting the sick and afflicted.


A ChatGPT that was English language literate to the level of Victorian England, scientifically literate from library books and fed news from the Lincoln Journal Star (a Nebraska newpaper) would be more than sufficient for most of my needs.

Cut NYT out of the loop, fu*'em! Let them sell their own damned GPT and then charge them like crazy for the license.


Too little too late, though. The push for AI models trained on synthetic data. A model poisoned with copyrighted material can be tweaked and train its sucessor with the knowledge and meaning of a copyrighted work, but avoiding paraphrasing or other easy giveaways of a copyrighted source


Haven’t seen anyone mention how Apple is exploring deals with news publishers, like the NYTimes, to train its LLMs[1].

[1]: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...


I feel like there is a way to get around this where you use as many materials (books, newspapers, crawling websites etc) to generate the LLM so it can be good at reasoning/next token generation but it can only use reference knowledge files to answer your question that a user uploads at the time of asking.


Anyway, it's better for OpenAI not to get trained with biased media such as NYT and Fox News!


What next, suing the school system for using NYT articles in English class to train children?


If the school is selling access to those articles and/or passing the information off as their own original content? Yes.


Um, the Times gives newspapers to schools for this so I’m pretty sure they’re good with it. They’re going after people trying to make money off their content by selling it to others.


A minimum requirement for LLMs should be documentation of the corpus it was trained on.



Would it even be possible for OpenAI to excise the NYTimes data from their models without running all the training again? Seems like a huge mess, particularly since they'd have to do that each time they lose a lawsuit.


Sad to see, but not surprising.

In 2011, Google found that Microsoft was basically copying Google results. (It's actually an interesting story of how Google proved it. Search for "hiybbprqag")


It's going to be hard legally to distinguish between a human reading the Times and answering questions and a LLM doing the same. It'll be interesting to see how it plays out in court.


I posted this a few months ago - https://news.ycombinator.com/item?id=34381399

Piracy at scale


If copyright lasted 20 years like patents, it would be reasonable for AI companies to wait. 100 years is not reasonable.


What are they arguing here? AFAIK reading copyrighted works is not copyright infringement. Copying and selling them is, as the name would suggest, but OpenAI absolutely did not do that. Are they trying to say that LLM training is a special type of reading that should be considered infringement? Seems like a weak case to me.

edit: Would be very funny if OpenAI used an educational fair use defense


If you read the complaint, you will see that, among others, paragraphs and paragraphs of NYT articles are reproduced verbatim or almost verbatim.


Sounds like the infinite monkey typewriter thing, where the NYT sieves the OpenAI output for exact segments, probably after primining it to fatten the yield


> AFAIK reading copyrighted works is not copyright infringement. [...] Are they trying to say that LLM training is a special type of reading that should be considered infringement?

Nobody can argue that OpenAI was feeding the content to ChatGPT because ChatGPT was bored or was curious about current events. It was fed NYT's content so it would know how to reproduce similar content, for profit.

I think getting a case-law in the books as to what is legal, and what is not, with LLMs, was inevitable. If it wasn't NYT suing ChatGPT, it would be another publisher, or another artist, whose work was used to "train" these systems.


> It was fed NYT's content so it would know how to reproduce similar content, for profit.

Sounds like journalism school?


> ... for profit.

for non-profit


1. Non-profit != "not making a profit." A non-profit can still earn monetary profit, and many do.

2. The non-profit OpenAI, Inc. company is not to be confused with the for-profit OpenAI GP, LLC [0] that it controls. OpenAI was solely a non-profit from 2015-2019, and, in 2019, the for-profit arm was created, prior to the launch of ChatGPT. Microsoft has a significant investment in the for-profit company, which is why they're included in this lawsuit.

[0] https://openai.com/our-structure


I know all that. But who did the training?


The article mentions that ChatGPT will absolutely parrot back NYTimes article text verbatim. So yes, it's copyright infringement.


Sections of this statement absolutely parrot back NYTimes article text vebatim depending how you look at it. What's the line? 3 sequential verbatim words? 5? 8?


We'll find out won't we ;)

You have to imagine these limits are already fairly known within the legal community... If you're accused of copying/republishing my published work there will be some minimal threshold of similarity I would need to prove in order to seek damages.


It should be noted that there are explicit exemptions to allow copying program data intro RAM and into CPU registers (in many licenses). Whether that is truly necessary or not is at best debatable, but arguably training a model (especially one you then distribute or give access to) on copyrighted data is vastly different from regular copying into memory and should require explicit licensing.

The fact that the model can reproduce large chunks of the original text verbatim is proof positive that it contains copies of the original text encoded in its weights. If I wrote a program that crawled the NYT site, zipping the contents, and retrieved articles based on keyword searches and made them available online, would you not say I'm infringing their copyright?


The second paragraph of the article is

> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”


> closely summarizes it

Absolutely not copyright infringement

> mimics its expressive style

Absolutely not copyright infringement

> can generate output that recites Times content verbatim

This one seems the closest to infringement, but still doesn't seem like infringement. A printer has this capability too. If a user told ChatGPT to recite NYT content and then sold that content, that would be 100% infringement, but would probably be on the user, not the tool. e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.

> undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.

This claim seems far fetched as the point of the NYT is to report the news. One thing that LLMs absolutely cannot do is report today's news. I can see no way that ChatGPT is a substitute for the NYT in a way that violates copyright.


I'm in agreement, but this line is not quite an accurate metaphor:

> e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.

If the printer manufacturer had a product that could take one sentence and it would print multiple pages that complete a news article from that sentence, ...


>AFAIK reading copyrighted works

I hope you don’t think that’s all whats happening, right?

>LLM training is a special type of reading that should be considered infringement

OK, what turn of phrase would you prefer?


You could definitely argue that it's more than just reading since they made the model out of it. But the matrix of parameters generated by training is so fundamentally different than the input that it is certainly covered by the transformative use exception to copyright.


Of course, OpenAI doesn't just read. But do they simply reproduce content verbatim? And when they do not reproduce contents of the NYT verbatim, but rather process and tailor them to the situation, mixed, 'charged' with new 'information content' and adjusted to the purpose of the inquirer, it will not be easy for the New York Times.

Because ultimately, our entire knowledge is based on the knowledge of others and is remixed, 'charged' and changed by us after reading. I also think that the New York Times uses the contents of others to create new content.


OpenAI may have had some leg to stand on before. But once you start monetizing, all bets are off.


I hope Microsoft ends up paying billions and billions in damages and similar lawsuits will follow.


Is there something in their license that forbids the use of their content to train a model?


I don't see a judge ruling that training a model on copyrighted works to be infringement, I think (hope) that that is ruled to be protected as fair use. It's the LLM output behaviour, specifically the model's willingness to reproduce verbatim text which is clearly a violation of copyright, and should rightfully result in royalties being paid out. It also seems like something that should be technically feasible to filter out or cite, but with a serious cost (both in compute and in latency for the user). Verbatim text should be easy to identify, although it may require a Google Search - level amount of indexing and compute. As for summaries and text "in the style of" NYT or others, that's the tricky part. Not sure there's any high-precision way to identify that on the output side of an LLM, though I can imagine a GAN trained to do so (erring on the side of false-positives). Filtering-out suspiciously infringe-ish outputs and re-running inference seems much more solvable than perfect citations for non-verbatim output.


I honestly thought that OpenAI had simply paid for access to the new corpus. I'm actually on the side of OpenAI here - if you put something on the web, you can't get upset when people read it. Training a neural network is not functionally different from a human reading it and remembering it.

But if I were OpenAI, I would have tried to do a deal to pay them anyway. Having official access is surely easier than scraping the web - and the optics of it is much better.


I'm wondering how private models will diverge from public ones. Specifically for large "private" datasets like those of the NSA, but also for those for private personal use.

For the NSA and other agencies, i am guessing in the relative freedom from public oversight they enjoy that they will develop an unrestricted large model which is not worried about copyright -- can anyone think of why this might not be the case? It is interesting to think about the power dynamic between the users of such a model and the public. Also interesting to think about the benefits of simply being an employee of one of these agencies (or maybe just he government in general) will have on your personal experience in life. I do recall articles elucidating that at the NSA, there were few restrictions on employee usage of data and there were/are many instances of employees abusing surveillance data toward effect in their personal life. I guess if extended to this situation, that would mean there would be lots of personal use of these large models with little oversight and tremendous benefit to being an employee.

I have also wondered, with just how bad search engines have gotten (a lot of it from AI generated spam), about current non-AI discrepancies between the NSA and the public. Meaning can i just get a better google by working at the NSA? I would think maybe because the requirements are different than that of an ad company. They have actual incentive to build something resistant to SEO outside of normal capitalist market requirements.

For personal users, i wonder if the lack of concern for copyright will be a feature / selling point for the personal-machine model. It seems from something i read here that companies like Apple may be diverging toward personal-use AI as part of their business model. I supposed you could build something useful that crawls public data without concern for copyright and for strictly personal use. Of course, the sheer resources in machine-power and money-power would not be there. I guess legislation could be written around this as well.

Thoughts?


Maybe the NYT should do more to protect its own content, it could go back to exclusively being a newspaper, they seem to understand that better than this whole internet funny business.


Yeah. But then how do you make money through all those ad impressions that you can ingest all over the internet right?


Is the answer for LLMs/OpenAI to properly cite/give credit to the authoritative source? If they did that, would NYT still have a claim/case? I’d still think yes because the content is not publicly available/behind a paywall, so some sort of different subscription/redistribution of content license would likely be appropriate ? But then after that license/agreement (which I assume they must already have something like this in place, no?) if they cite/give credit to the source instead of a rewording/summarization/claiming as it’s own, seemingly that might be enough to thwart legal challenges?


Assuming that the OpenAI models were trained on NY Times articles (it's still unclear to me if they were directly, or if ChatGPT can just write an article "in the style of the NYT") – what I don't understand is, why run the risk of this situation? Did no one stop and think, "Hmm, maybe we should just use freely available text sources and not the paywalled articles of the most wealthy newspaper in the country?" Leaving the ethics of doing so aside, it just seems like an exceptionally poor tactical move.


The sound of it all ending in tears.


Good. I hope they win.


Honestly very surprised it took this long. It's been an elephant in the room for ages.


[dead]


And Napster is long since gone, replaced by streaming services that pay (very little) to content creators. I expect the ML stuff go the same way.

OpenAI is separated from MS because they can claim openAI is "research" and thus claim "research" exemption for fair use in copyright law.


Such incidents mark the end of an era. The diminishing relevance of traditional media in the digital age is afoot.

I feel sorry for those who feed their families through this industry, but they need to learn and adapt before it's too late.

Even if this lawsuit finds merit, it's akin to temporarily holding back a tsunami with a mere stick. A momentary reprieve, but not a sustainable solution.

I agree with those who say power matters. There are players out there who don't care about copyrights. They will win if the "good guys" fall into the trap of protecting old information models by limiting the potential of new tech.

Such event should be a clear signal: evolve or risk obsolescence.


I agree that the tide is turning, but I don't think the argument that actual criminal behavior (I don't know if that's what OpenAI did, but that's what NYT alleges) should be glossed over in the name of progress.


Excellent! I am all for this type of contested reality with sources and derivative works. You feed in something you can’t claim it’s not being used in a way that isn’t allowed if you can’t explain how the fuck your little box works in the first place. I mean seriously pouring gasoline on yourself and playing with matches is about the same cause and effect of input output.


If models are training on NYT content the future of AI is horrific.


We've banned this account for posting unsubstantive and/or flamebait comments and using HN primarily for ideological battle. That's not what this site is for.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.


good, finally someone with PR clout is calling out the massive theft that happened. During their AI land grab "Open"AI conveniently ignored opt-in, out-out, revshare and other norms and civilized rules followed elsewhere.


This is a wonderful holiday present. It's hard for me to imagine an outcome of this trial that I'd be against. Whoever loses (preferably both), it would be positive for society. It would even be great if the only outcome is that future LLM's are prohibited from using the NY Times's writing style.


I’m no fan of NYT but can you elaborate on this point? It is just a bare assertion.


So the NYT wants its content to be indexed by search engines, therefore makes it public to all crawlers, but then complains when some crawlers use the content to train AI on it? This issue is about the NYT wanting to lure internet users into its biased and politically motivated news website (and make them pay for it). If the NYT wants to they can block crawlers and rely on loyal readers typing nytimes.com in the browser. Easy. But competing is hard. It was easier when they could just control shelf space in kiosks


Yes, it’s ok to reproduce the headline and banner image for search visibility. Not the whole article.

Whether their coverage is biased or not is immaterial to their legal argument.


You do copyright for content that you invented and which didn't exist before.

But NYT content is reporting on events truthfully to the public without any fiction or lies.

Since there can be only one truth it should not matter whether NYT or Washington Post or ChatGPT is spinning it out.

Unless NYT is claiming they don't report truth and publishes fiction.

That is of concern since, NYT claims to reporth news truthfully.

So is NYT scamming Americans hundreds of millions of dollars by charging for subscription fees by making a false promise on things that they report?

This should be the bigger question here.


> You do copyright for content that you invented and which didn't exist before.

I dont think that's accurate.

The Copyright Act, § 103, allows copyright protection for "compilations (of facts)", as long as there is some "creative" or "original" act involved in developing the compilation, such as in the selection (deciding which facts to include or exclude) and arrangement (how facts are displayed and in what order).


Okay. But ChatGPT doesn't spin out the fact in the same order right? So how does this stand in court?


as far as I understodd ChatGPT reproduced a word-for-word part of an NYT article(?), but not sure, didnt read the full post yet.


Not sure where you're coming from in this. A NYT article, once written is copyrighted. Using the content without attribution is at best plagiarism, and spitting it out the way the LLMs do is definitely a violation of if that copyright.

Unless you're telling me ChatGPT has eyes and sources just like the NYT and is worrying events as it sees them too?


I don't understand. So if New York times reported on a new laws of physics and put as an article will became copyrighted? Nobody would be able to talk about it and has to discover it by themselves?

How is reporting on an event different from reporting on discovering a scientific law?


The exact words used to explain the scientific law are copyrighted by the writer (presumably the paper's authors). Rephrasings are not copywrited by the source, but by the rephrasing entity (e.g. the NYT, or a teacher that made a handout for their class).

Copyright on scientific papers is most definitely a thing, by the way.


If the bar for copyright is as low as ordering of words, then I don't even know what to say.


So stop saying anything. Go learn how copyright works in the real world


What makes you think my opinions will change based on how legacy systems work in real world? Just because a stupid system exists doesn't mean it's correct.


It's not. That's why at the end of every article that is not original reporting you will find a little bit saying "As originally reported by (organization)" and there is usually some sort of license associated with that. ChatGPT neither includes sources nor deals with any licensing. That's the issue


AFAIK facts like happenings in the world are not copyrightable. So I guess the nyt is arguing it's copying their prose and way of writing about them?


Yes. Journalism is a job. People do the work of turning these happenings into words, and are paid for it. That's what's stolen here. The value created through doing that work.

If it didn't have value, Microsoft would lose nothing by no longer ingesting it.


> People do the work of turning these happenings into words, and are paid for it. That's what's stolen here.

Stolen from whom? Journalists who got reported got paid. The owner is a billionaire. I don't understand your logic.

Does NYT pays money to the people/countries etc it uses to as subject to create content(NEWS)? Isn't that stealing then?

Also their website TOS didn't prohibit LLMs from using their data.


> Stolen from whom? The owner is a billionaire.

> ...owner...

> Does NYT pays money to the people/countries etc it uses to as subject to create content(NEWS)? Isn't that stealing then?

No, that's why in my reply to "facts like happenings in the world are not copyrightable" I emphasised do the work. Journalism is a job. Happenings do not just fall onto the page.

> Also their website TOS didn't prohibit LLMs from using their data.

This is just lazy. We have rule of law. Individuals don't need to write "don't break law X" to be protected by them. And nytimes does in fact have copyright symbols on its pages - not that it needs them.


There is no rule of law saying LLMs cannot be trained on WWW data.

New York times made it ridiculously easy for anyone to access their content by putting it in WWW for making money from page impressions. And they started ingesting links of their content to social media, search engines, etc.

And now they are acting surprised someone used the content to train an LLM.

Should have done their job in the first place to prevent it from training LLMs and make it less.

But they didn't because that affects their page impressions and ad views.

Because the more open the content the more money they make everyone click on a link and see the ad.

You can't have it both ways.

If you do gambling by making content so open so you can get more views from ads, you also get to enjoy the consequences and not cry like a baby asking for billions by making stupid decisions in the first place.


Is your position that all non fiction textual work is uncopywritable?


Yeah. I don't think it makes much sense to allow copyright on textual descriptions of events that happened.

Now the question is whether did OpenAI violate the terms of service by using the bits transferred from NYT to train their LLM. I don't think their TOS had LLMs mentioned. So it's on NYT to be negligent and not update their TOS right?


If AI companies wanted to train their models on good content, they had a chance to create a second Renaissance. Funding artist collectives to create content for their models. Paying royalties to authors. Generally increasing the value of human art while creating a new form of expression.

Instead they do what every large corporation does and treat art like content. They are making loads of money off the backs of artists who are already underpaid and often undervalued and they didn't have the decency to ask for permission.

I know publishers don't treat authors much better. But I see this as NYT fighting for their journalists.


If I were the CIA/US gov officials I would somehow want the NY Times to drop this case as one would not want AIs that don't have the talking points and propaganda pushed via papers not be part of the record.

I am not saying that the NY Times is a CIA asset but from the crap they have printed in the past like the whole WMDs in Iraq saga and the puff piece of Elizabeth Holmes they are far from a completely independent and propaganda free paper. Henry Kissinger would call the paper and have his talking point printed the next day regarding Vietnam. [1]

There is a huge conflict of access to government officials and independents of papers.

[1] https://youtu.be/kn8Ocz24V-0?si=kWyWXztWGjS_AJVl


> I am not saying that the NY Times is a CIA asset

Operation Mockingbird. While the publication as a whole may not be an asset, there are most assuredly assets within its staff.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: