Hacker Newsnew | past | comments | ask | show | jobs | submit | isusmelj's commentslogin

I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...

There’s a domestic AI price war in China, plus pricing in mainland China benefits from lower cost structures and very substantial government support e.g., local compute power vouchers and subsidies designed to make AI infrastructure cheaper for domestic businesses and widespread adoption. https://www.notebookcheck.net/China-expands-AI-subsidies-wit...

All of this is true and credit assignment is hard, but the brutal competition between Chinese firms, especially in manufacturing, differentiates them from and advances them over economies in the west. It makes investment hard as profits are competed away, which is blasphemy in Thiel's worldview, but is excellent for consumers both local and global.

I guess they want to partially subsidize local developers?

Maybe that's a requirement from whoever funds them, probably public money.


Seriously? Does Netflix or Spotify cost the same everywhere around the world? They earn less and their buying power is less.

The costs of Netflix and Spotify are licensing. Offering the subscription at half price to additional users is non-cannibalizing and a way to get more revenue from the same content.

The cost of LLMs are the infrastructure. Unless someone can buy/power/run compute cheaper (Google w/ TPUs, locales with cheap electricity, etc), there won't be a meaningful difference in costs.


That assumes inference efficiency is static, which isn't really the case. Between aggressive quantization, speculative decoding, and better batching strategies, the cost per token can vary wildly on the exact same hardware. I suspect the margins right now come from architecture choices as much as raw power costs.

Sure so do professional tools like Microsoft teams or compute in different places of the world.

It could be that energy is a lot cheaper in China, but it could be other reasons, too.

Slightly off-topic, surveillance Pricing is a term being used more often, whereby even hotel room prices vary based on where you're booking from, what terms you searched for etc.

Here's a short video on the subject:

https://youtube.com/shorts/vfIqzUrk40k?si=JQsFBtyKTQz5mYYC


I think we are just very close to the peak of a typical Gartner hype cycle around LLMs. They are useful but overhyped. There will be more posts about fuckups that happen because people run things on autopilot and cannot keep up with reviewing AI generated code.

Do not get me wrong. I use AI all day to speed things up. But I believe that there is only a small group, maybe 5 percent or less, that actually knows how to use AI properly (I'd count myself not yet in that 5%), which I see as potentially dangerous. The other issue I see is inexperienced software engineers writing software. Although I see this as a great value add and productivity boost for prototyping, I am afraid of the “I do not know much about coding but can also make PRs to our codebase” mentality.

For those of you that run things on autopilot, how do you keep code quality under control? And how do you handle refactoring? I am really curious, because one option now is also to just YOLO your LLMs to write code based on the maturity of the product. You can refactor an app or parts of it pretty fast again with LLMs. While tech debt accumulates faster, we also have the opportunity to rebuild faster.


Are there any benchmarks? I didn’t find any. It would be the first model update without proof that it’s better.


Very proud as a Swiss that Soumith has a .ch domain!


Probably because his first name is Chintala


That d be his last name


true haha


Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here. Would love some expert opinion :)

FYI: the non thinking variant has the same price.


In short, the others have a huge margin if you ignore training costs. See https://martinalderson.com/posts/are-openai-and-anthropic-re... for details.


Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802


It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.


Yes, you may consider that opensource models hosted over Openrouter are charging about bare hardware costs, where in practice some providers there may run on subsidized hardware even, so there is money to be made.


I can only agree with your experience in Europe. I do not get how they do that, but Tesla Superchargers are more reliable. The occupancy information works better, they are easier to use, and they almost always offer a more competitive price. I often see other chargers that are 50 to 100 percent more expensive and only very rarely see offers that are within 10 to 50 percent.

What strikes me is that this difference can make EVs more expensive per kilometer if you only compare energy cost with fuel cost.

Here is the math with numbers. Tesla chargers in Switzerland and Germany are usually at most CHF 0.50 or EUR 0.60 per kilowatt hour at the more expensive locations, along highways for example. They offer fast charging of 150 kW or more. Alternative providers often start at around CHF 0.75 for 50 kW or CHF 1.00 for more than 250 kW fast charging. If your electric car consumes 20 kWh (Model 3 is at around 15 I think) per 100 km you end up with costs of CHF 10.00, CHF 15.00, or CHF 20.00 per 100 km at CHF 0.50, CHF 0.75, or CHF 1.00 per kilowatt hour. If you drive a petrol car that uses 8 l per 100 km and the cost per liter is CHF 1.70 you pay CHF 13.60 per 100 km.


The tesla 12 250 kW stall in my town in the south of France (Albi) price is one of the lowest in the area at 0.23 EUR/kWh (all taxes included).

It is under a solar covered parking


In Slovakia superchargers cost around 0.30-0.37 €/kWh, while the competitors are priced around 0.45-0.60, so yes there is a major price difference as well.

To be fair the others offer subscription plans which lower the price, but such plans don't suit me, so I pay the full prices.


Ze devil iz in ze detailz!

Try this one https://www.tesla.com/de_DE/findus/location/supercharger/Ham...

While the site says they are available 24/7 they are in the turnpiked parking space of a shopping mall. Closed between 8PM to 9AM and on Sundays. Fun to see danish tourists desperately try to reach them while very low on juice.

Extreme Fremdschäming/facepalming!


Are there any news about power consumption? I didn’t even see a tdp or so mentioned.


One of the first things I looked at too...


from my comment elsewhere in this thread, https://news.ycombinator.com/item?id=45048078, "up to 170W." was the quote from March.


Demand > Supply?


I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.


No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team


If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)


absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!


Oh sorry I might have missed it! I think you or your colleague emailed me (I think?) My email is daniel @ unsloth.ai if that helps :)


Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?


L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb


Wait, whole (english speaking) web content dataset size is ~50TB?


Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a video on this topic recently: https://www.youtube.com/watch?v=8yH3rY1fZEA


Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.


Distributed across 20 replicas.


So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!


Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).


That is not correct. The EU AI Act has no such provision, ans the data mining excemption does not apply as the EU has made clear. As for Switzerland copyrighted material cannot be used unless licensed.


Thanks for clarifying! I wish you all the best luck!


Are you using dbpedia?


no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq


Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)


we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale


Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.

I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.


When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using? I'm curious about the benchmarks!


The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.

But it's good to have more and more players in this space.


I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.


SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.


also isn't DeepSeek's Mixture of Experts? meaning not all params get ever activated on one forward pass?

70B feels like the best balance between usable locally and decent for regular use.

maybe not SOTA, but a great first step.


Is there anything like this also supporting other GPUs? Thinking of Apple Silicon or embedded ones in phones etc.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: