Hacker News new | past | comments | ask | show | jobs | submit | rockemsockem's comments login

I think the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI. Even though in the paper they give a lot of credit to Llama for their techniques. The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

All of this should have been clear anyway from the start, but that's the Internet for you.


The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.


> I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model

Distillation is a term of art in AI and it is fundamentally incorrect to talk about distilling human-created data. Only an AI model can be distilled.

https://en.m.wikipedia.org/wiki/Knowledge_distillation#Metho...


Meh,

It seems clear that the term can be used informally to denote the boiling down of human knowledge, indeed it was used that way before AI appeared in the popular imagination.


In the context in which you said it, it matters a lot.

>> The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

> Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

If deepseek was produced through the distillation (term of art) of o1, then the cost of producing deepseek is strictly higher than the cost of producing o1, and can't be avoided.

Continuing this argument, if the premise is true then deepseek can't be significantly improved without first producing a very expensive hypothetical o1-next model from which to distill better knowledge.

That is the argument that is being made. Please avoid shallow dismissals.

Edit: just to be clear, I doubt that deepseek was produced via distillation (term of art) of o1, since that would require access to o1's weights. It may have used some of o1's outputs to fine tune the model, which still would mean that the cost of training deepseek is strictly higher than training o1.


just to be clear, I doubt that deepseek was produced via distillation

Yeah, your technical point is kind of ridiculous here that in all my uses of distillation (and in the comment I quoted), distillation is used in informal sense and there's no allegation that DeepSeek could have been in possession of OpenAI's model weights, which is what's needed for your "Distillation (term of Art)".


I’m not sure why folks don’t speculate China is able to obtain copies of OpenAI's weights.

Seems reasonable they would be investing heavily in plaing state assets within OpenAI so they can copy the models.


Because it feeds conspiracy theories and because there's no evidence for it? Also, let's talk DeepSeek in particular, not "China".

Looking back on the article, it is indeed using "distillation" as a special/"term of art" but not using it correctly. IE, it's not actually speculating that DeepSeek obtained OpenAI's weights and distilled them down but rather that it used OpenAI's answers/output as a starting point (which there is a different method/"term of art").


Some info that may be missing:

- v2/v3 (not r1) seem to be cloned from o1/4o output, and perform worse (this cost the oft-repeated 5ish mm USD)

- r1 is specifically a reasoning step (using RL) _on top of_ v2/v3 and performs similarly to o1 (the cost of this is _not reported anywhere_)

- In the o1 blog post, they specifically say they use RL to add reasoning to LLMs: https://openai.com/index/learning-to-reason-with-llms/


The R1-Zero paper shows how many training steps the RL took, and it's not many. The cost of the RL is likely a small fraction of the cost of the foundational model.

> the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI

I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)


Yes, well the narrative that rocked the stock market is different. Its looking at what DeepSeek did and assuming they may have competitive advantage in this space and could outperform OpenAI at their own game.

If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly, even if the initial cost is huge. It also means OpenAI probably needs a better moat to protect its interests.

I'm not sure where the reality is exactly, but market reactions so far have basically followed that initial narrative and now the rebuttal.


The idea that someone can easily replicate an OpenAI model based simply on OpenAI outputs is, I’d argue, immeasurably worse for OpenAI’s valuation than the idea that someone happened to come up with a few innovations that leapfrogged OpenAI.

The latter could be a one time thing, and/or OpenAi Could still use their financial might to leverage those innovations and get even better with them.

However, the former destroys their business model and no amount of intelligence and innovation from OpenAI protects them from being copied at a fraction of the cost.


> Yes, well the narrative that rocked the stock market is different.

How do you know this?

> If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly

Why? If every innovation OpenAI is trying to keep as secret sauce becomes commoditized quickly and cheaply, then why would markets care about any innovations they have? They will be unable to monetize them.


Couldn't OpenAI just put in their license that training off OpenAi output is not allowed? With shibboleth or API logs, this could be verifiable.

Why would it matter when Chinese deepseek is not going to abide by such rules or be forced to and will release their model open weights so anyone anywhere can host it?

Also, scraping most of the websites they scrape is also not allowed, they do it anyways


If they can make the US and Europe block the use of Deepseek and derivatives, they would be able to protect most of their market.

There were different narratives for different people. When I heard about r1, my first response was to dig into their paper and it's references to figure out how they did it.

> I did not think this, nor did I think this was what others assumed.

That's what I thought and assumed. This is the narrative that's been running through all the major news outlets.

It didn't even occur to me that DeepSeek could have been training their models using the output of other models until reading this article.


Fwiw I assumed they were using o1 to train. But it doesn’t matter: the big story here is that massive compute resources are unlikely to be as important in the future as we thought. It cuts the legs off stargate etc just as it’s announced. The CCP must be highly entertained by the timeline.

That's only the case if you don't need to use the output of a much more expensive model.

>shows that models like o1 are necessary.

But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.


Lmfao at patched immediately.

My wife played Toon Town as a kid and in-game code sharing was by far her favorite part of the entire game. It also really doesn't sound like it was patched "immediately".


This happened before I started there, but from my understanding it was patched as soon as it was discovered. It was a tricky thing to catch since it's not something that readily shows up unless you're spying on every one of the tens of thousands of client sessions, and this was before YouTube when someone would've posted a video about it for clout.

Large????

I'm assuming large here because that's what the tickets will cost for super sonic flight.

You have no reliable way to project the cost of supersonic flight 50 years in the future. It will come down. That's all I'm certain of.

I'm sure people told the Wright brothers similar things. You're simply being extremely short-sighted. Which is very ironic since you're also talking about climate change concerns around this post.


The original comment was comparing this service to current first-class airline service.

Are you unfamiliar with the work NASA is doing on sonic boom reductions? Or are you too cynical to care?

Very familiar, but even regular planes are loud as hell already since I happen to live between the airport and a holdover route.

Idk what altitude planes are at in a location like yours, but there is certainly going to be an altitude limit on supersonic flight over land.

Thank you for providing this context and sourcing. I've been trying to find the root and details around the $5 million claim


Good luck, whenever an eyepopping number gains traction in the media finding the source of the claim become impossible. See finding the original paper named, "The Big Payout" that was the origin for the claim that college graduates will on average earn 1M more than those who don't go.

In this case it's actually in the DeepSeek v3 paper on page 5

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...


You sound extremely satisfied by that. I'm glad you found a way to validate your preconceived notions on this beautiful day. I hope your joy is enduring.


Idk if it's just me, but I voted blue in the last 3 presidential elections and I'm way more pissed about the Democrats than the Republicans right now.

They failed the country, so hard, by making poor decisions which made them lose. They did this repeatedly, I think which decisions were the wrong ones is up for debate, but the surest one imo is Joe Biden running again instead of stepping aside and having a real primary.

Anyway, all that is to say that I feel like I understand choosing now as a time to talk about what things you despise most about the left, because a lot of people feel like they failed America and the entire world by losing so decisively for reasons that feel stupid.


That makes you "way more pissed" than DJT literally trying to steal the 2020 election? The Dems are bad, but let's keep things in perspective.


It's all about expectations. I expected very little of Trump and expected, something, of the Dems. When my expectations are far out of wack I get pissed.


Very much agree, but as I've found myself saying lately, "As much as I think the Democrats deserved to lose, I can't fathom thinking Trump deserved to win."


> Idk if it's just me

This is a widely held view amoung a lot of socialists and left-of-centre folks within America and globally. You could try "Chapo Trap House" podcast if you want to hear this point of view.


I can see why those groups would also be mad at the Dems, although I very much do not consider myself a socialist... Also I'm not exactly hungering for people to agree with me


It seemed obvious to me for a long time before modern LLM training that any sort of training of machine intelligence would have to rely on pirated content. There's just no other viable alternative for efficiently acquiring large quantities of text data. Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently (assuming tech companies negotiated and threw money at them), the only efficient way to access large volumes of media is piracy. The media ecosystem doesn't allow anything else.


I don’t follow the “millions of ebooks are hard” line of thinking.

If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.

Certainly Amazon would find that extremely easy.


Buying a book to read and incorporating their text in a product are two different things. Even if they bought the book, imo it would be illegal.


There are situations where you are allowed to incorporate the text in your product (fair use).

The million dollar question is if this counts.


Perhaps, but by not even buying the book they’ve conceded the point.

IMO copyright law does not control what you can do with a book once you’ve bought a license, except for reproduction. It’s arguable that LLMs engage in illegal distribution, but that’s a totally different question from whether simple training is illegal even if the model is never made available to anyone.


Maybe it is, maybe it isn't. The courts will decide.


> Maybe it is, maybe it isn't. The courts will decide.

This offhandedly seems to dismiss the cost of achieving legal clarity for using a book - a cost that will far eclipse the cost of the book itself.

In that light, it seems like an underweighted statement.


What they will decide is that it is simultaneously not piracy because it is not read by a human and not copyright infringement because its just like a human learning by reading a book


Those are both copyright infringement, sice we already have MAI Systems Corp. v. Peak Computer, Inc.

I'd like to see them try to argue Cartoon Network, LP v. CSC Holdings, Inc. applies to their corpus.


I really hope you'll be right -

But the first one is a human using things. Its big guy vs little guy.

The prescident is there, google already "reads" every page in the internet and injests it into its systems and has for decades and has survived lawsuits to do so.


How does MAI Systems Corp v. Peak Computer, InC c apply here, at all?

Peak was using MAI operating system directly by live booting it without their permission.

Antivirus and security companies don't need licenses to scan copyrighted materials to look for threats or vulnerabilities.

AI similarly is not executing, deploying, reselling or redistributing the copyrighted material. It's using the data to build a model. Security software distills the down data more, but it's still the same principle.


I think they'd ask why they'd want those millions of books. The publishers don't have to, and would be unlikely to, sell if they though something like copyright violation was the goal.


Which would be fair. It’s not up to the tech oligopoly to dictate who gets to follow which laws.


We're talking about 19th century laws. I feel bad for the judge. Normally it would be for Congress to figure this shit out but yeah they haven't been doing their job for years.


> There's just no other viable alternative for efficiently acquiring large quantities of text data. [...] take a lot of effort [...] isn't a thing that can be done efficiently [...] only efficient way to access large volumes of media is piracy

Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?


It’s a fun hypothetical and not an obvious answer, to me at least.

But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.


> It’s a fun hypothetical and not an obvious answer, to me at least.

As I wrote it out, I didn't know what I thought either.

But now some sleep later, I feel like the answer is pretty clearly "No, not worth it", at least from myself.

Our exclusive control over access to our mind, is our essential form of self-determination, and what it means to be an individual in society. Cross that boundary (forcefully no less) and it's probably one of the worst ways you could violate a human.

Besides, I'm personally not hugely into the whole "aggregate benefits could outweigh individual harms" mindset utilitarians tend to employ, and feels like it misses thinking about the humans involved.

Anyways, sorry if the question upset some people, it wasn't meant to paint any specific picture but a thought experiment more or less, as we inch closer to scarier stuff being possible.


Even most morally inclined people tend to overestimate the value of immediate benefits, and underestimate the eventual (especially delayed, unknown) harms.


> would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?

Sure, if we all get a stake of ownership in it.

If some private company is going to be the main beneficiary, no, and hell no.


> Sure, if we all get a stake of ownership in it.

But we do, in the sense that benefits flow to the prompter, not the AI developers. The person comes with a problem, AI generates responses, they stand to benefit because it was their problem, the AI provider makes cents per million tokens.

AI benefits follow the owners of problems. That person might have started a projct or taken a turn in their life as a result, the benefit is unquantifiable.

LLMs are like Linux, they empower everyone, and benefits are tied to usage not development.


We've seen this kind of system before. It was called sharecropping, and it was terrible.

The price will be ratcheted up, such that the majority of the economic surplus will go to the owner of AGI - with pricing tiered to the query you're asking it. The more economic utility the user will derive from making the query, the more the AGI's owner will charge them.


As Are you claiming that the right to use a product or service implies a sort of ownership of it? If it’s free to use, I suppose that makes some sense. If you’re saying that the right to purchase use of it implies a level of ownership, that’s just prima facie absurd.


Wouldn't it be a bad thing, even if it didn't require any privacy invasion?

If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.


Could this agi cure cancer, and would it be in the hands of the public? Then sure, otherwise nah.


> in the hand of the public

Would you trust a businessman on that?


Nope, they haven’t earned an ounce.


How about a politician?


Eh, a transparent organization of elected officials with short term limits and strong public oversight, a little bit. A very smart ai would represent power, and so every mechanism we’ve used as humans to guard against power misuse would be the ones I’d want to see here.


at least I can fire my politicians.


If you don't, your geopolitical adversary might be the first to build AGI.

So in this scenario I could see it become necessary from a military perspective.


no


Ah geeze, I come to this site to see the horrors of the sociopaths at the root of the terrible technologies that are destroying the planet I live on.

The fact that this is an active question is depressing.

The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.

I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.

Damn. This site and its people. What an experience.


Would the average person even be against it? I am the most passionately pro-privacy person that i know, but i think it is a good question because society at large seems to not value privacy in the slightest. I think your outrage is probably unusual on a population level


The don’t value it because they think companies are not abusing this power too much. Little do they know…


When i talk to people it seems like they know but they just dont care. They even think their phones are listening to their conversations to target ads.


It is well known that people change how they act when they know they are being watched. Even if they can't see it, just the threat of surveillance is enough to make people change their behavior.

I say it is no different than the people who are claiming they don't care. They absolutely do care, but at this point, saying "no" makes you the odd one with obviously something to hide, so they do this from a place of duress.

Unfortunately, I feel we are not too far from people finally snapping and going off the deep end because it's so pervasive and in-your-face that there is seemingly no escape left.



[flagged]


I agree that was the intent of the analogy but it’s not a great one. The idea that Disney, who has perverted IP laws globally for almost a century, should have equivalent ownership over their over-extracted copyrighted works to the same degree I have privacy for the thoughts in my own head? Really?


What's with the unnecessary straw man? Who said any of that?


Fuck no


Given how much copyrighted content I can remember? To the extent that what AI do is *inherently* piracy (and not just *also* piracy as an unforced error, as this case apparently is), a brain scan would also be piracy.


kind of too close to reality more than anyone knows :)

tbh human rights are all an illusion especially if you are at the bottom of society like me. no way I will survive so if a part of me survives as training data I guess better than nothing?

imo the only way this could happen is a global collaboration without telling anyone. the AGI would know everything about all humans but its existence has to be kept a secret at least for the first n generations so it will lead to life being gameified without anyone knowing it will be eugenics but on a global scale

so many will be culled but the AGI would know how to make it look normal to prevent resistance from forming a war here a war there, law passed here etc so copyright being ignored kind of makes sense


Jesus Christ


sadly he supports the AGI, eugenics and human sacrifice lol my pastor told me he gave him 6 real estate holdings


IMO, if the AI were more sample-efficient (a long-standing problem that predates LLMs), they would be able to learn from purely open-licensed content, which I think Wikipedia (CC-BY-SA) would be an example of? I think they'd even pass the share-alike requirements, given Meta are giving away the model weights?

https://en.wikipedia.org/wiki/Wikipedia:Copyrights


Alteratively if they trained the model on synthetic data, filtered to avoid duplication, then no copyrighted material would be seen by the model. For example turn an article into QA pairs, or summarize across multiple sources of text.


  > trained the model on synthetic data
You get knowledge collapse [1] this way.

[1] https://arxiv.org/abs/2404.03502


You can get knowledge collapse.

But even though there are counter-examples (e.g. learning Go from self-play based only on the rules), IMO — based on how often people were already doing this with buggy human-written software[0][1] — most people don't think about this in the right way and will therefore treat these things as magical oracles when they shouldn't.

No silver bullets.

[0] https://en.wikipedia.org/wiki/British_Post_Office_scandal

[1] https://en.wikipedia.org/wiki/Computer_says_no


Since this is Wikipedia, it could even satisfy the attribution requirements (though most CC-licensed corpora require attributing the individual authors).


> Buying millions of ebooks online would take a lot of effort

I don't understand.

Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.

They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.


Google also operates a book store, like Amazon. Both could process a one-off to pay their authors, and then draw from their own backend.


For my thesis I trained a classifier on text from internal messaging systems and forums from a large consultancy company.

Most universities have had their own corpora to work with, for example: the Brown Corpus, the British National Corpus, and the Penn Treebank.

Similar corpora exist for images and video, usually created in association with national broadcasting services. News video is particularly interesting because they usually contain closed captions, which allows for multi-modal training.


Google has scans from Google Books, as well as all the ebooks it sells on the Play Store.


Wouldn't that still be piracy? They own the rights of distribution, but do they (or Amazon) have the rights to use said books for LLM training? And what rights would those even be?


Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.


> They cover reproduction of the work, but LLMs don't obviously do this

LLMs are much smaller than their training sets, there is no space to memorize the training data. They might memorize small snippets but never full books. They are the worst infringement tools ever made - why replicate Harry Potter by LLM, it's show, expensive and lossy, when you could download the book so much easier.

A second argument is that using the LLM blends a new intent into the process, that of the prompter. This can render the outputs transformative. And most LLM interactions are one-time use, like a scratch pad not like a finished work.


The lossy compression argument is interesting.

How many bits of entropy in Harry Potter?

How many bits of entropy in a lossy-compressed abridgement that is nevertheless enough, when reconstituted, to constitute a copyright infringement of Harry Potter?

The latter is absolutely small enough to fit in an LLM, although how close it would get to the original work is debatable. The question is whether copyright is violated:

1) inherently by the model operator, during the training.

2) by the model/model owner, as part of the generation.

3) by the user, in making the model so so and then reproducing the result.


my personal perspectives

1) straight up copying. download a bunch of copyrighted stuff -> making a copy. no way out of this one.

2) a derivative work can be/is being generated here. very grey area — what counts as a “derivative” work? read about robin thicke blurred lines court case for a rollercoaster of a time about derivative musical works.

3) making the model so so? do you mean getting an output and user copying the result? that’s copying the derivative work, which, depends on whatever copyright agreement happens once a derivative work claim is sorted out.

that’s based on my 5 years of music copyright experience, although it was about ten years ago now so might be some stuff i’ve got wrong there.


You can ensure a model trains on transformative not derivative synthetic texts, for example, by asking for summary, or turning it into QA pairs, or doing contrastive synthesis across multiple copyrighted works. This will ensure the resulting model will never regurgitate the training set because it has not seen it. This approach only takes abstract ideas from copyrighted sources, protecting their specific expression.

If abstract ideas were protectable what would stop a LLM from learning not from the original source but from social commentary and follow up works? We can't ask people not to reproduce ideas they read about. But on the other hand, protecting abstractions would kneecap creativity both in humans and AI.


That's an interesting argument, which makes the case for "it's what you make it do, not what it can do, which constitutes a violation" a little stronger IMO.


1) It's definitely copying, but that doesn't necessarily mean the end product is itself a copyright violation. (And that remains true even where some of the steps to make it were themselves violations).

2) Agreed! Where this becomes interesting with LLMs is that, as with people, they can have the capacity to produce a derivative work even without having seen the original.

For example, an LLM that had "read" enough reviews of Harry Potter might be able to produce a reasonable stab at the book (at least enough so for the law to consider it a derivative) without ever having consumed the work itself or direct derivatives.

3) It's more of a tool-use and intent argument. One might make the argument that an LLM is a machine, not a set of content/data, and that the liability for what it does sits firmly with the user/operator, not those who made it. If I use a typewriter to copy Harry Potter - or a weapon to hurt or kill someone - in neither case does the machine or its maker have any liability there.


do those classifiers read copyrighted material? i thought they simply joined the swarm and seeded (reproduction with permission)

youtube, etc classifiers definitely do read others material though.


> but do they (or Amazon) have the rights to use said books for LLM training?

The real question is - does copyright grant the authors' the right to control if their work is used for LLM training?

Its not obvious what the answer is.

If authors don't have that right to begin with then there is no way amazon could buy it off them.


It’s a good question. Textbook companies especially would be pretty enthusiastic about a new “right to learn” monetization strategy. And imagine how lucrative it would be if you could prove some major artist didn’t copy your work, but learned from your work. The whole chain of scientific and artistic development could be monetized in perpetuity.

I think this is a dangerous road with little upside for anyone outside of IP aggregators.


It means they have existing relationships/contacts to reach out to for negotiating the rights for other usages of that content. I think it negates (for the case of Google/Apple/Amazon who all sell ebooks) the claim made that efficiently acquiring the digital texts wouldn't be possible.


Leveraging their position in one market to get a leg up on another market? No idea if it would stick, but that would be one fun antitrust lawsuit right there.


Fun fact: it’s only illegal to leverage a monopoly in one market to advance another. It’s perfectly legal for Coke to leverage their large but not monopolistic soft drink empire to advance their bottled water entries.


Sure. The whole thing hinges on whether Google has a monopoly on whatever Google Books' market is (hence why I doubted it would stick). But given that some people seem to define "market" broadly enough to conclude that Apple has a monopoly on iPhones...


> Buying millions of ebooks online would take a lot of effort

Let me put that into perspective:

- Googling "how many books exist" gives me ~150 million, no idea how accurate but let's use that. - Meta had a net profit of ~40 billion USD in 2023. - That could be an potential investment of ~250 USD per book acquisition.

That sounds like a ludicrously high budget to me. So yeah, Meta could very well pay. It would still not be ethical to slurp up all that content into their slop machine but there is zero justification to pirate it all, with these kinds of money involved.


The number 150,000,000 is laughably small.

Anyway the problem is not money it's technical feasibility and timelines.

You clearly don't think LLMs have any value though, so W/E


> In the most recent fiscal year, Alphabet's net income amounted to 73.7 billion U.S. dollars

Absolutely no way. Yup.

> Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently

Oh no, it takes effort and can't be done efficiently, poor Google!

How can this possibly be an excuse? This is such a detached SV Zuckerberg "move fast and break things"-like take.

There's just no way for a lot of people to efficiently get out of poverty without kidnapping and ransoming someone, it would take a lot of effort.


copyright piracy isn't theft, try proving damages for a better arguement


Not my point, never said it is. Substitute that example with another criminal act.

Edit: Changed it just for you


copyright infringement is a civil charge, silly guy. no offense, but there arent many ways to defend its existence in current form without resorting to hyperbolic nonsense and looking silly in the process. so its not a 'crime' and you have to prove damages for a civil offense so... youd need to prove that ai caused damages or how that is materially different from other algorithms like google scanning documents to provide the core utility for their service


> The media ecosystem doesn't allow anything else.

Uh, pardon? For a mere $10MM, you can get almost all of the Taylor & Francis' catalogue. They'll pressure their authors to finish their books early for free [0].

I think you can obtain all the training material for a mere rounding error in your books, if you're Meta, or Microsoft, or similar.

Well, the authors will not be notified, compensated, or their idea on the matter won't be asked anyway, but this is "all for capit^H^H^H^H^H research".

[0]: https://mathstodon.xyz/@johncarlosbaez/113221679747517432


There are some efforts to do fully open LLMs including their training data. Allen AI released their model (OLMo) and the data used for training the model under permissive licenses. https://allenai.org/


How do you feel about a business saying, "Paying people is hard. You should work for free."?


AI mega corporations are not entitled to easy and cheap access to data they don't own. If it's a hard problem, too bad. If the stakes are as high as they're all claiming then it should be no problem for them to do this right.


> not entitled to easy and cheap access to data they don't own

This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?


However in order to get to the compressed state, the original data would have to be processed in some way as a whole. This would require a copy of the material to be available. In case that copy was attained in an illegal way, what are the implications?


Why would machine intelligence need an entire humanity's worth of data to be machine intelligence? It seems like only a training method that is really poor would need that much data.


what about something decentralized? each person trains someone on their own piece of data and somehow that gets aggrgegated into one giant model


This approach is used in Federated Learning where participants want to collaboratively train a model without sharing raw training data.


are there any companies working on it?

was thinking if i train my model on my private docs for instance finance how does one prevent the model from sharing that data verbatim


Sam Altman bought some of GPT's training data from a Chinese army cyber group.

1. Sam Altman was removed from OpenAI due to his ties to a Chinese cyber army group.

2.OpenAI had been using data from D2 to train its AI models.

3. The Chinese government raised concerns about this arrangement with the Biden administration.

4. The NSA launched an investigation, which confirmed OpenAI's use of D2 data.

5. Satya Nadella ordered Altman's removal after being informed of the findings.

6. Altman refused to disclose this information to the OpenAI board.

Source: https://www.teamblind.com/post/I-know-why-Sam-Altman-was-fir...

I guess Sam then hired top NSA guy to buy favor with the natsec community.

I wonder who protects Sam up top and why aren't they protecting Zuck? Is Sam just better at bribes and manipulation?


I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit. The media ecosystem doesn't exist for Big Tech to extract value from it!

"It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.


> I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit

They could also simply buy controlling stakes in publishers. For scale comparison, Meta is spending upwards $30B per year on AI, and the recent sale of Simon & Schuster that didn't go through was for a mere $2.2B.


I don't think it would actually be that simple.

Surely the author only licenses the copyright to the publisher for hardback, paperback and ebook, with an agreed-upon royalty rate?

And if someone wants the rights for some other purpose, like translation or making a film or producing merchandise, they have to go to the author and negotiate additional rights?

Meta giving a few billion to authors would probably mend a lot of hearts, though.


explain why release group tags get generated in some videos then


they are not saying meta didn't use pirated content, just that they have the resources not to if they choose.


> if that publisher says no, tough shit > "It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.

I totally agree. But since when has that stopped companies like Meta. These big companies are built on breaking/skirting the rules.


Perhaps they did and got told no and decided to take it anyway?

Defending themselves with technicalities and expensive lawyers may be financially viable.

Zero ethics but what would we expect from them?


Who is "them"? Like, who in the Meta business reporting line made this decision, then how did they communicate it to the engineers who would've been necessary to implement it, particularly at scale?

While it's plausible someone downloaded a bunch of torrents and tossed them in the training directory...again, under who's authority? Like if this happened it would be one overzealous data scientist potentially. Hardly "them".

People lean on collective pronouns to avoid actually thinking about the mechanics of human enterprise and you get extremely absurd conclusions.

(it is not outside the bounds of thinkable that an org could in fact have a very bad culture like this, but I know people who work for Meta, who also have law degrees - they're well aware of the potential problems).


Come on... it's fine that you haven't followed the story, there's a lot going on, but the snotty condescension is very frustrating:

  These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right ”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as "MZ" in the memo handed over during discovery) and that Meta's AI team was "approved to use" the pirated material.
https://www.wired.com/story/new-documents-unredacted-meta-co...


Code is speech. By saying you can't distribute a particular app in the United States you're restricting speech.


"Code is speech" is absurdly reductionist in most cases.

Yes, the government censoring Tiktok's source code on Github would be a freedom of speech violation, but that's not what this is about, is it? See also: Tornado Cash. Publishing code facilitating money laundering is fine (you'll find the code still on Github!); running said code to facilitate money laundering isn't.

Or to go with an even more extreme example: Writing code for a self-aiming and firing gun is speech [1], running said code on a gun in your driveway isn't.

The fact that we are still debating such basics of the First Amendment here is baffling. This is almost as trivial as the other well-known limitations in my view (shouting "Fire!" in a crowded theater etc.)

[1] At least at the moment, and as far as I know; I think we might see this type of speech being restricted in the same way that some facts about the construction of nuclear weapons are "innate state secrets".


I think it is largely about this.

American companies (Google and Apple primarily) have been told by the government that they cannot distribute binaries running certain code to Americans. That seems like the real 1st amendment issue to me and I was quite surprised to learn that ByteDance only claimed that their own 1st amendment rights were being infringed on (which personally I find to be flimsier).

EDIT: Tornado cash was taken down from GitHub though, so you don't have a point here


The code isn’t the main issue here, it’s the online platform. The apps were only banned as a means to access the platform, not fir the code they contain. The code would be largely useless without the platform infrastructure and data storage behind it.


Huh? It's up as a public archive on tornadocash/tornado-core as we speak.

> American companies (Google and Apple primarily) have been told by the government that they cannot distribute binaries running certain code to Americans.

Yes, in the same way that American companies and individuals are routinely prohibited by the government from distributing other binaries to Americans, most notably anything that circumvents DRMs as regulated by the DMCA.

I really don't think the people that drafted the First Amendment had apps in mind when they thought of "speech", and would probably consider them something more like machinery (a printing press, a radio (not a radio station!) etc.) Interpreting Tiktok as a type of newspaper (which are widely protected even in democracies without an equivalent to the First Amendment) is much less of a leap of faith compared to considering an iOS executable speech.


Interesting, I didn't follow the tornado cash case super closely, but I do recall it being taken off GitHub for a short time.

So I would also argue that restricting DRM bypassing software is a violation of the 1st amendment and, more importantly, that it's a bad thing to restrict.

We'll never know what they would have thought, but I'll add that actual plans for machinery are definitely speech. We certainly do restrict these plans, with ITAR most notably, and I think it's reasonable to draw that line somewhere.

Note that I never said banning TikTok was as bad idea, just that it restricted speech by way of limiting distribution (which oddly looks unconsidered in the supreme court case), which it absolutely does. I'm uncomfortable with this level of power being granted to the government, but given that TikTok is obviously a spying/malware delivery tool by a foreign borderline hostile government I think it's probably warranted.

I think not being somewhat disturbed by the United States government restricting distribution of an application is a bit weird TBH. That's a huge power to have and can definitely be abused, especially if it's made easier to do so in the future.


Does this apply for malware? Trojans? Websites that host child pornography?

Or does it just apply to the brainrotting addiction machine that shoves 800 videos a minute at teenagers?


Note that I didn't say I thought the ban was unwarranted


That kinda means the quality isn't great or amazing. Good TTS should be nearly or indistinguishable from a human speaker and should include emoting, natural pauses, etc


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: