Everyone is responding to the intellectual property issue, but isn't that the less interesting point?
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.
Even if all that about training is true, the bigger cost is inference and Deepseek is 100x cheaper. That destroys OpenAI/Anthropic's value proposition of having a unique secret sauce so users are quickly fleeing to cheaper alternatives.
Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).
More like OpenAI is currently charging more. Since R1 is open source / open weight we can actually run it on our own hardware and see what kinda compute it requires.
What is definitely true is that there are already other providers offering DeepSeek R1 (e.g. on OpenRouter[1]) for $7/m-in and $7/m-out. Meanwhile OpenAI is charging $15/m-in and $60/m-out. So already you're seeing at least 5x cheaper inference with R1 vs O1 with a bunch of confounding factors. But it is hard to say anything truly concrete about efficiency OpenAI does not disclose the actual compute required to run inference for O1.
There are even much cheaper services that host it for only slightly more than deepseek itself [1]. I'm now very certain that deepseek is not offering the API at a loss, so either OpenAI has absurd margins or their model is much more expensive.
[1] the cheapest I've found, which also happens to run in the EU, is https://studio.nebius.ai/ at $0.8/million input.
Edit: I just saw that openrouter also now has nebius
Well in the first years of AI no, it wasn't because nobody was using it.
But at some point if you want to make money you have to provide a service to users, ideally hundreds of millions of users.
So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.
Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)
The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
>The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
I think you're making assumptions here that don't necessarily have to be universally true for all successful models. Even without getting into particularly pathological cases, some models can be successful and profitable while only having a few customers. If you build a model that is very valuable to investment banks, to professional basketball teams, or some other much more limited group than consumers writ large, you might get paid handsomely for a limited amount of inference but still spend a lot on training.
if there is so much value for a small group, it is likely those are not simple inferences but of the new expensive kind with very long CoT chains and reasoning. So not cheap and it is exactly this trend towards inference time compute that make inference > training from a total resources needed pov.
That's not correct. First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually). But secondly, and more to your point, even if you were to use training data from another model, YOU STILL NEED TO DO ALL THE TRAINING.
Using data from another model won't save you any training time.
> training off of data generated by another AI is generally a bad idea
It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.
It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.
It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.
This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.
This is not obvious to me! For example, if you locked me in a room with no information inputs, over time I may still become more intelligent by your measures. Through play and reflection I can prune, reconcile and generate. I need compute to do this, but not necessarily more knowledge.
Again, this isn't how distillation work. Your task as the distillation model is to copy mistakes, and you will be penalized by pruning reconciling and generating.
"Play and reflection" is something else, which isn't distillation.
The initial claim was that distillation can never be used to create a model B that's smarter than model A, because B only has access to A's knowledge. The argument you're responding to was that play and reflection can result in improvements without any additional knowledge, so it is possible for distillation to work as a starting point to create a model B that is smarter than model A, with no new data except model A's outputs and then model B's outputs. This refutes the initial claim. It is not important for distillation alone to be enough, if it can be made to be enough with a few extra steps afterward.
You’ve subtly confused “less accurate” and “smarter” in your argument. In other words you’ve replaced the benchmark of representing the base data with the benchmark of reasoning score.
Then, you’ve asserted that was the original claim.
Sneaky! But that’s how “arguments” on HN are “won”.
LLMs are no longer trying to just reproduce the distribution of online text as a whole to push the state of the art, they are focused on a different distribution of “high quality” - whatever that means in your domain. So it is possible that this process matches a “better” distribution for some tasks by removing erroneous information or sampling “better” outputs more frequently.
While that is theoretically true, it misses everything interesting (kind of like the No Free Lunch Theorem, or the VC dimension for neural nets). The key is that the parent model may have been trained on a dubious objective like predicting the next word of randomly sampled internet text - not because this is the objective we want, but because this is the only way to get a trillion training points.
Given this, there’s no reason why it could not be trivial to produce a child model from (filtered) parent output that exceeds the child model on a different, more meaningful objective like being a useful chatbot. There's no reason why this would have to be limited to domains with verifiable answers either.
The latest models create information from base models by randomly creating candidate responses then pruning the bad ones using an evaluation function. The good responses improve the model.
It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.
Fine tuning an llm on the output of another llm is exactly how deepseek made its progress. The way they got around the problem you describe is by doing this in a domain that can be relatively easily checked for correctness, so suggested training data for fine tuning could be automatically filtered out if it was wrong.
> It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.
Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)
I mean, I can randomly developed fact claims about addition, and if I curate which ones go into a training set, train a model that reflects addition of integers much more accurately than the random process which generated the pre-curation input data.
Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.
No no no you don’t understand, the models will magically overcome issues and somehow become 100x and do real AGI! Any day now! It’ll work because LLM’s are basically magic!
Also, can I have some money to build more data centres pls?
I think you're missing the point being made here, IMHO: using an advanced model to build high quality training data (whatever that means for a given training paradigm) absolutely would increase the efficiency of the process. Remember that they're not fighting over sounding human, they're fighting over deliberative reasoning capabilities, something that's relatively rare in online discourse.
Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!
It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).
I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.
> it's always true that training on the output of another model will result in a worse model.
Not convincing.
You can imagine model doing some primitive thinking and coming to conclusion. Then you can train another model on summaries. If everything goes well it will be coming to conclusions quicker. That's at least. Or it may be able solve more complex problems with the same amount of 'thinking'. It will be self-propelled evolution.
Another option is to use one model to produce 'thinking' part from known outputs. Then train another to reproduce thinking to get the right output, unknown to it initially. Using humans to create such dataset would be slow and very expensive.
PS: if it was impossible humans would be still living on the trees.
Humans don't improve by "thinking." They improve my natural selection against a fitness function. If that fitness function is "doing better at math" then over a long time perhaps humans will get better at math.
These models don't evolve like they, there is not a random process of architectural evolution. Nor is there a fitness function anything like "get better at math."
A system like AlphaZero works because it has a rules to use as an oracle: the game rules. The game rules provide the new training information needed drive the process. Each game played produces new correct training data.
These LLMs have no such oracle. Their fitness function is and remains: predict the next word, followed by: produce text that makes a human happy. Note that it's not "produce text that makes ChatGPT happy."
it's more complicated than this. I mean what you get is defined by what you put in. At first is was random or selected internet garbage + books + docs. I.e. not designed for training. Than was tuning. Now we can use trained model to generate the data designed for training. With specific qualities, in this case reasoning. And train next model. Just intuitively it can be smaller and better at what we trained it for. I showed two options how data can be generated, there are others of course.
As for humans, assuming genetically they have the same intellectual abilities, you can see the difference in development of different groups. It's mostly defined by training the better next generation. Schools are exactly for this.
You're the sociopath who just confidently tried to blame the air crash on DEI and spewed baseless conspiracy theories about who was to blame, while bodies are still being recovered from the frozen river. What a disgusting ignorant cruel comment.
There is absolutely no thought going on in your close-minded bigoted head, and I'm not at all interested in listening to you fill the internet with garbage by mansplaining your worthless uneducated opinions about AI (like your ridiculous eugenics theory that AI can estimate IQ from a picture of a face), since you've failed to achieved even natural intelligence, empathy, or humanity.
Set showdead=true to see the evidence, because your comments that were rightfully flagged dead and hidden because they were such a blatant violations of the rules, with no redeeming value:
> training off of data generated by another AI is generally a bad idea
Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.
That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.
I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.
The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.
Not arguing with your point about training efficiency, but the degree to which R1 is a technical breakthrough changes if they were calling an outside API to get the answers, no?
It seems like the difference between someone doing a better writeup of (say) Wiles's proof vs. proving Fermat's Last Theorem independently.
R1 is--as far as we know from good ol' ClosedAI--far more efficient. Even if it were a "clone", A) that would be a terribly impressive achievement on its own that Anthropic and Google would be mighty jealous of, and B) it's at the very least a distillation of O1's reasoning capabilities into a more svelte form.
Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.
> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).
That is not true at all.
We have known how to solve this for at least 2 years now.
All the latest state of the art models depend heavily on training on synthetic data.
But it does mean moat is even less defensible for companies whose fortunes are tied to their foundation models having some performance edge, and a shift in the kinds of hardware used for inference (smaller, closer to the edge.)
That may be true. But an even more interesting point may be that you don’t have to train a huge model ever again? Or at least not to train a new slightly improved model because now we have open weights of an excellent large model and a way to train smaller ones.
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar"
If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?
Oppositely
If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.
All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.
I'm not an OpenAI apologist and don't like what they've done with other people's intellectual property but I think that's kind of a false equivalency. OpenAI's GPT 3.5/4 was a big leap forward in the technology in terms of functionality. DeepSeek-r1 isn't really a huge step forward in output, it's mostly comparable to existing models, one thing that is really cool about it is it being able to be trained from scratch quickly and cheaply. This is completely undercut if it was trained off of OpenAI's data. I don't care about adjudicating which one is a bigger thief, but it's notable if one of the biggest breakthroughs about DeepSeek-r1 is pretty much a lie. And it's still really cool that it's open source and can be run locally, it'll have that over OpenAI whether or not the training claims are a lie/misleading
How is it a “lie” for DeepSeek to train their data from ChatGPT but not if they train their data from all of Twitter and Reddit? Either way the training is 100x cheaper.
In the last few days? No, that would be impossible; no one has the resources to train a base model that quickly. But there are definitely a lot of people working on it.
Have people on HN never heard of public ChatGPT conversations data sets? They've been mentioned multiple times in past HN conversations and I thought it'd be common knowledge here by now. Pretty much all open source models have been training on them for the past 2 years, it's common practice by now. And haven't people been having conversations about "synthetic data" for a pretty long time by now? Why is all of this suddenly an issue in the context of DeepSeek? Nobody made a fuss about this before.
And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.
Funny how the first principles people now want to claim the opposite of what they’ve been crowing about for decades since techbros climbed their way out of their billion dollar one hit wonders. Boo fucking hoo.
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.