Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.


There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.

It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.

Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg


There's actually a lot of court activity on this topic, but the law moves slowly and is reluctant to issue injunctions where harm is not obvious.

It's more that the law about "one guy decides to pirate twelve movies to watch them at home and share with his buddies" is already well-settled, but the law about "a company pirates 10,000,000 pieces to use as training data for an AI model (a practice that the law already says is legal in an academic setting, i.e. universities do this all the time and nobody bats an eye)" is more complicated and requires additional trials to resolve. And no, even though the right answer may be self-evident to you or me, it's not settled law, and if the force of law is applied poorly suddenly what the universities are doing runs afoul of it and basically nobody wants that outcome.


What’s ironic to me is that had these companies pirated only a single work, wouldn’t that be a chargeable crime?

Clearly Bonnie and Clyde shouldn’t have been prosecuted. Imagine they were just robbing banks for literary research purposes. They could have then used the learnings to write a book and sell it commercially…

Or imagine one cracks 10000 copyrighted DVDs and then sells 30 second clips… (a derived work).

To me, for profit companies and universities have a huge difference — the latter is not seeking to directly commercially profit from copyrighted data.


There is a distinction that must be made that very few people do, but thankfully the courts seems to grasp:

Training on copyright is a separate claim than skirting payment for copyright.

Which pretty much boils down to: "If they put it out there for everyone to see, it's probably OK to train on it, if they put it behind a paywall and you don't pay, the training part doesn't matter, it's a violation."


Whether it’s legal slash fair use to train on copyrighted material is only one of the questions currently being asked though. There’s a separate issue at play where these companies are pirating the material for the training process.

By comparison, someone here brought up that it might be transformative fair use to write a play heavily based on Blood Meridian, but you still need to buy a copy of the book. It would still be infringement to pirate the e-book for your writing process, even if the end result was legal.


If they would buy material at a large scale, the seller might require them to sign a contract that requires royalty if the material is used for training an AI. So buying legally is a way to put yourself into a trap.


They can buy individual works like anyone else.

Or they can negotiate a deal at scale with whatever price / restrictions make sense to both parties.

I don’t see a way they could be “trapped”. Worst case they pay retail price.


What is the precedent on that kind of agreement?

The only thing I've been able to find is the note that since copyright is federal law, state contract law actually can't supersede it, to wit: if you try to put a clause in the contract that says the contract is void if I use your work to make transformative fair-use works (or I owe you a fee), that clause is functionally unenforceable (for the same reason that I don't owe you a fee if I make transformative fair-use works of your creations in general).


So if I download copyrighted material like the new disney movie with fansubs and watch it for training purposes instead of enjoyment purposes it's fine? In that case I've just been training myself, your honor. No, no, I'm not enjoying these TV shows.

Because it's important to grasp the scale of these copyright violations:

* They downloaded, and admitted to using, Anna's Archive: Millions of books and papers, most of which are paywalled but they pirated it instead

* They acquired Movies and TV shows and used unofficial subtitles distributed by websites such as OpenSubtitles, which are typically used for pirated media. Official releases such as DVDs tend to have official subtitles that don't sign off with "For study/research purpose only. Please delete after 48 hours" or "Subtitles by %some_username%"


I don't know what is confusing here, perhaps my comment isn't clear.

If you skirt payment, its a violation. If it's free, but still copyright, it's likely not a violation.


They've done both, so my confusion is about why you are bringing this up?


OpenSubtitles has nothing to do with pirated media. Transcripts/translations are fair use. Their own use case is fair use as well.


OpenSubtitles is almost exclusively used with pirated media. Official copies come with official subtitles. OpenSubtitles itself is legal, but that's not the point at all.


If you owe the bank $1,000 you have a problem.

If you owe the bank $100,000,000 the bank has a problem.

We live in an era where the president of the United States uses his position to pump crypto scams purely for personal profit.


10% for the big don


The dead corpses of filmmakers and authors and actors are buried in unmarked graves out behind those companies' corporate headquarters. Unimaginable horror, that piracy. Why has no one intervened?

>If you're just a normal person you get to spend years in jail or worse.

Not that I'm a big fan of the criminalization of copyright infringement in the United States, but who has ever spent years in jail for this?

Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?


> who has ever spent years in jail for this?

Aaron Swartz?

EDIT: apparently he wasn't in jail, he was on bail while the case was ongoing - but the shortest plea deal would still have had him in jail for 6 months, and the penalty was 35 to 50 years.


Nope, he didn't go to jail.


> Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?

What a weirdly condescending way to interpret my post. My point boils down to: Either prosecute copyright infringement or don't. The current status quo of individuals getting their lives ruined while companies get to make billions is disgusting.


> Either prosecute copyright infringement or don't

This is the absolute core of the issue. Technical people see law as code, where context can be disregarded and all that matters is specifying the outputs for a given set of inputs.

But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.

If you go down the road of “the law is the law and billion dollar companies working on product should be treated the same as individual consumers”, it follows that individuals should do SEC filings (“either require 10q’s or don’t!”), and surgeons should be jailed (“either prosecute cutting people with knives or don’t!”).

There is a lot to dislike about AI companies, and while I believe that training models is transformative, I don’t believe that maintaining libraries of pirated content is OK just because it’s an ingredient to training.

But insisting that individual piracy to enjoy entertainment without paying must be treated exactly the same as datasets for model training is the absolute weakest possible argument here. The law is not that reductive.


> But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.

As Anatole France famously quipped:

"The law, in its majestic equality, forbids the rich and poor alike to sleep under bridges, to beg in the streets, and to steal bread."


Pretty funny that your argument boils down to: It's okay to break the law if you do it as a company.

Copyright laws target everyone. SEC laws don't.


Not sure if I was unclear or you’re disingenuous. But that is not at all what I said.


It doesn't matter whether it's transformative. Copyright covers derivative works.


No one (in the US) has been jailed for downloading copyrighted material.


https://en.wikipedia.org/wiki/Aaron_Swartz

And the US is not the only jurisdiction


That's not the same as piracy though. He wasn't downloading millions of scientific papers from libgen or sci-hub, he was downloading them directly from jstor. Indeed, none of his charge was for copyright infringement. It was for stuff like "breaking and entering" and "unauthorized access to a computer network".


The exact same charges could apply to the AI scrapers illegitimately accessing random websites.


No, they couldn't, since the then-novel and untested strained interpretation of the CFAA that the prosecutor was relying on has since been tested in the courts and soundly rejected.


I haven’t seen any accusations that they’ve done that, though. Usually people get pirated material from sources that intentionally share pirated material.


They're not just training on pirated content, they've also scraped literally the entire internet and used that too.


Scraping the public internet is also not a CFAA violation


CFAA bans accessing a protected computer without authorization. Hitting URLs denied by robots.txt has been argued to be just that.


> Hitting URLs denied by robots.txt has been argued to be just that.

"Has been argued" -- sure, but never successfully; in fact, in HiQ v. LinkedIn, the 9th Circuit ruled (twice, both before and on remand again after and applying the Supreme Court ruling in Van Buren v. US) against a cease and desist on top of robots.txt to stop accessing data on a public website constituting "without authorization" under the CFAA.


Now do every other jurisdiction


CFAA was mentioned specifically, which means only US jurisdiction is relevant here.


Part of the accusation comes from the fact that Swartz accessed the downloads through a MIT network closet, which AI companies wasn't doing. The equivalent to that would be if openai broke into a wiring closet at Disneyland to download Disney movies.


The CFAA is vague enough to punish unauthorized access to a computer system. I don't have an example case in mind, but people have gotten in trouble for scraping websites before while ignoring e.g. robots.txt


The CFAA might be vague, but the case law on scraping pretty much has been resolved to "it's pretty much legal except in very limited circumstances". It's regrettable that less resourced defendants were harassed before large corporations were able to secure such rulings, but the rulings that allowed scraping occurred before AI companies' scraping was done, so it's unclear why AI companies in particular should be getting flak here.


Aaron Swartz was not jailed or even charged for copyright infringement. The discussion and the comment I replied to is centered around US companies and jurisdiction.


The thread is centered around US companies, but not US jurisdiction.


There could be a moral question. For example a researcher might not want to download a pirated paper and cause loss to a fellow researcher. But it becomes pretty stupid to pay when everyone, including large reputable companies endorsed by the government, is just downloading the content for free. Maybe his research will help developing faster chips to win against China, why should he pay?

Would it be a "fair use" to download pirated papers for research instead of buying?

Also I was gradually migrating from obtaining software from questionable sources to open source software, thinking that this is going out of trend and nobody torrents apps anymore, but it seems I was wrong?

Or another example: if someone wants to make contributions to Wine but needs a Windows for developing the patch, what would be the right choice, buy it or download a free copy from questionable source?


Researchers don't get paid when their papers are downloaded, though. They pay to have their papers downloaded, and the middleman makes money on both sides. Piracy is the only moral option for them. There is a reason every single competent professor in the western world will email you a free copy of their papers if you ask nicely.


What about people filming movies in the cinema (for learning of course)? [1]

[1] https://www.thefederalcriminalattorneys.com/unauthorized-rec...


No, if you revolutionize both the practice and philosophy of computing and advance mankind to the next stage of its own intellectual evolution, you get to do whatever the fuck you want.

Seems fair.


Hm. Not a given that it's an advance.


I get the common cynical response to new tech, and the reasons for it.

We wish we lived in a world where change was reliably positive for our lives. Often changes are sold that way, but they rarely are.

But when new things introduce dramatic capabilities that former things couldn't match (every chatbot before LLMs), it is as clear of an objective technological advance as has ever happened.

--

Not every technical advance reliably or immediately makes society better.

But whether or when technology improves the human condition is far more likely to be a function of human choices than the bare technology. Outcomes are strongly dependent on the trajectories of who has a technology, when they do, and how they use it. And what would be the realistic (not wished for) outcome of not having or using it.

For instance, even something as corrosive as social media, as it is today, could have existed in strongly constructive forms instead. If society viewed private surveillance, unpermissioned collation across third parties, and weaponizing of dossiers via personalized manipulation of media, increased ad impact and addictive-type responses, as ALL being violations of human rights to privacy and freedom from coercion or manipulation. And worth legally banning.

Ergo, if we want tech to more reliably improve lives, we need to ban obviously perverse human/corporate behaviors and conflicts of interest.

(Not just shade tech. Which despite being a pervasive response, doesn't seem to improve anything.)


At the risk of stepping on a well-known land mine around here, how'd you do on the IMO problem set this year?


I didn't participate. I probably wouldn't have done well. I disagree with your framing.


Well, wait, if somebody writes a computer program that answers 5 of 6 IMO questions/proofs correctly, and you don't consider it an "advance," what would qualify?

Either both AI teams cheated, in which case there's nothing to worry about, or they didn't, in which case you've set a pretty high bar. Where is that bar, exactly? What exactly does it take to justify blowing off copyright law in the larger interest of progress? (I have my own answers to that question, including equitable access to the resulting models regardless of how impressive their performance might be, but am curious to hear yours.)


The technology is capable in a way that never existed before. We haven't yet begun to see the impacts of that. I don't think it will be a good for humanity.

Social networks as they exist today represent technology that didn't exist decades ago. I wouldn't call it an "advancement" though. I think social media is terrible for humans in aggregate.


I notice you've motte-and-baileyed from "revolutionize both the practice and philosophy of computing and advance mankind to the next stage of its own intellectual evolution" to simply "is considered an 'advance'".


You may have meant to reply to someone else. recursive is the one who questioned whether an advance had really been made, and I just asked for clarification (which they provided).

I'm pretty bullish on ML progress in general, but I'm finding it harder every day to disagree with recursive's take on social media.


Except that the jury’s (at best) still out on whether the influence of LLMs and similarly tech on knowledge workers is actually a net good, since it might stunt our ability to critically think and problem solve while confidently spewing hallucinations at random while model alignment is unregulated, haphazard, and (again at best) more of an art than a science.


Well, if it's no big deal, you and the other copyright maximalists who have popped out of the woodwork lately have nothing to worry about, at least in the long run. Right?


It's not about copyright _maximalism,_ it's about having _literally any regard for copyright_ and enforcing the law in a proportionate way regardless of who's breaking the laws.

Everyone I know has stories about their ISP sending nastygrams threatening legal action over torrenting, but now that corporations (whose US legal personhood appears to matter only when it benefits them) are doing it as part of the development of a commercial product that they expect to charge people for, that's fine?

And in any case, my argument had nothing to do with copyright (though I do hate the hypocrisy of the situation), and whether or not it's "nothing to worry about" in the long run, it seems like it'll cause a lot of harm before the benefits are felt in society at large. Whatever purported benefits actually come of this, we'll have to deal with:

- Even more mass layoffs that use LLMs as justification (not just in software, either). These are people's livelihoods; we're coming off of several nearly-consecutive "once-in-a-generation" financial crises, a growing affordability crisis in much of the developed world, and stagnating wages. Many people will be hit very hard by layoffs.

- A seniority crisis as companies increasingly try to replace entry-level jobs with LLMs, meaning that people in a crucial learning stage of their jobs will have to either replace much of the learning curve for their domain with the learning curve of using LLMs (which is dubiously a good thing), or face unemployment, and leaving industries to deal with the aging-out of their talent pools

- We've already been heading towards something of an information apocalypse, but now it seems more real than ever, and the industry's response seems to broadly be "let's make the lying machines lie even more convincingly"

- The financial viability of these products seems... questionable right now, at best, and given that the people running the show are opening up data centres in some of the most expensive energy markets around (and in the US's case, one that uniquely disincentivizes the development of affordable clean energy), I'm not sure that anyone's really interested in a path to financial sustainability for this tech

- The environmental impact of these projects is getting to be significant. It's not as bad as Bitcoin mining yet, AFAIK, but if we keep on, it'll get there.

- Recent reports show that the LLM industry is starting to take up a significant slice of the US economy, and that's never a good sign for an industry that seems to be backed by so much speculation rather than real-world profitability. This is how market crashes happen.


>why ordinary people cannot

They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.


People got in trouble for filming in the cinema as I understand, there is a separate law for that.


But in that case even though filming isn't technically distribution, it's clearly a step to distributing copies? To take this to the extreme, suppose you ripped a blu-ray, made a thousand copies, but haven't packaged or sold them yet. If the FBI busted in, you'd probably be prosecuted for "conspiracy to commit copyright infringement" at the very least.


It's just "training"


You seem to equate "training" (with scare quotes) with someone actually pirating a blu-ray, but they really aren't equivalent. Courts so far have ruled that training is fair use and it's not hard to see why. Unlike copying a movie almost verbatim (as with ripping a blu-ray), AI companies are actually producing something transformative in the form of AI models. You don't have to like AI models, or the AI companies' business models, but it strains credulity to pretend ripping a blu-ray is somehow equivalent to training an AI model.


Who's to say why I downloaded and am now watching a movie? Is it for my enjoyment? Is it because I'm training my brain? How is me training my brain any different from companies training their LLMs?

Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.


>Who's to say why I downloaded and am now watching a movie? Is it for my enjoyment? Is it because I'm training my brain? How is me training my brain any different from companies training their LLMs?

None of this is relevant because Anthropic was only left off the hook for training, and not for pirating the books itself. So far as the court cases are playing out, there doesn't appear to be a special piracy exemption for AI companies.

>Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.

You can certainly use that as a defense. That's why we have judges, otherwise there's going to be some smartass caught with 1KG of coke and claiming it's for "personal consumption" rather than distribution.

None of this matters in reality, though. If you're caught with AV gear in a movie theater once, you'd likely be ejected and banned from the establishment/chain, not have the FBI/MPAA go after you for piracy. If you come again, you'd likely be prosecuted for trespassing. In the cases where they're going after someone in particular for making these rips, they usually have a dossier of evidence, like surveillance/transaction history showing that the same individual has been repeatedly recording movies, and watermarks correlating the screenings that the person has been in to files showing up on torrent sites.


> If you're caught with AV gear in a movie theater once, you'd likely be ejected and banned from the establishment/chain, not have the FBI/MPAA go after you for piracy

Good example, because this is exactly what websites are doing with LLM companies, who are doing their damnest to evade the blocks. Which brings us back around to "trespassing" or the CFAA or whatever.


>Which brings us back around to "trespassing" or the CFAA or whatever.

That argument is pretty much dead after https://en.wikipedia.org/wiki/Van_Buren_v._United_States and https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


https://www.rvo.nl/onderwerpen/octrooien-ofwel-patenten/vorm...

I'll leave all other jurisdictions up to you.


IANAL, but reading a bit on this topic: the relevant part of the copyright law for AI isn't academia, it's transformative work. The AI created by training on copyrighted material transforms the material so much that it is no longer the original protected work (collage and sampling are the analogous transformations in the visual-arts and music industries).

As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.


AFAIK, downloading or watching pirated stuff isn't something you'll get in trouble for. Hosting and distributing it is what will get you.


Well, it just shows that they've downloaded subtitles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: