Hacker Newsnew | past | comments | ask | show | jobs | submit | johndough's commentslogin

> So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target.

Chaining chips does not decrease token throughput. In theory, you could run models of any size on Cerebras chips. See for example Groq's (not to be confused with Grok) chips, which only have 230 MB SRAM, yet manage to run Kimi K2.


Only if chip-to-chip communication is as fast as on-chip communication. Which it isn’t.

Only if chip-to-chip communication was a bottleneck. Which it isn't.

If a layer completely fits in SRAM (as is probably the case for Cerebras), you only have to communicate the hidden states between chips for each token. The hidden states are very small (7168 floats for DeepSeek-V3.2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/c... ), which won't be a bottleneck.

Things get more complicated if a layer does not fit in SRAM, but it still works out fine in the end.


It doesn't need to, during inference there's little data exchange between one chip and another (just a single embedding vector per token).

It's completely different during training because of the backward pass and weight update, which put a lot of strain on the inter-chip communication, but during inference even x4 PCIe4.0 is enough to connect GPUs together and not lose speed.


I am using piper-tts ( https://github.com/OHF-Voice/piper1-gpl ) with these voice files for GLaDOS: https://huggingface.co/rokeya71/VITS-Piper-GlaDOS-en-onnx/tr...

It is not perfect, but quite sufficient for simple system messages.


DeepSeek had a theoretical profit margin of 545 % [1] with much inferior GPUs at 1/60th the API price.

Anthropic's Opus 4.6 is a bit bigger, but they'd have to be insanely incompetent to not make a profit on inference.

[1] https://github.com/deepseek-ai/open-infra-index/blob/main/20...


American labs trained in a different way than the Chinese labs. They might be making profit on inference but they are burning money otherwise.


> they'd have to be insanely incompetent to not make a profit on inference.

Are you aware of how many years Amazon didn’t turn a profit?

Not agreeing with the tactic - just…are you aware of it?


Amazon was founded in 1994, went public in 1997 and became profitable in 2001. So Anthropic is two years behind with the IPO but who knows, maybe they'll be profitable by 2028? OpenAI is even more behind schedule.


How much loss did they accumulate until 2001? Pretty sure it wasn't the 44 billion OpenAI has. And Amazon didn't have many direct competitors offering the same services.


Did Amazon really not turn a profit, or apply a bunch of tricks to make it appear like they didn't in order to avoid taxes? Given their history, I'd assume the later: https://en.wikipedia.org/wiki/Amazon_tax_avoidance

Anyway, this has nothing to do with whether inference is profitable.


It has everything to do with whether they make profit on paper,

vs give away the farm via free tier accounts, free trials,

and last but not least: subsidized compute to hook entire organizations………


Deepseek lies about costs systematically. This is just another fantasy.


What do you base your accusations on? Is there a specific number from the link above that you claim is a lie?

And how are 7 other providers able to offer DeepSeek API access at roughly the same price as DeepSeek?

https://openrouter.ai/deepseek/deepseek-v3.2


Their price is not a signal of their costs, it is the result of competitive pressure. This shouldn't be so hard to understand. Companies have burned investor money for market share for quite some time in our world.

This is the expected, the normal, why are you so defensive?


> why are you so defensive?

Because you made stuff up, did not show any proof, and ignored my proof to the contrary.

You made the claim:

    > Deepseek lies about costs systematically.
DeepSeek broke down their cost in great detail, yet you simply called it "lies", but did not even mention which specific number of theirs you claim is a lie, so your statement is difficult to falsify. You also ignored my request for clarification.

You’re citing deepseek unaudited numbers. This is not even close to a proof. Unless proven otherwise it is propaganda. Meanwhile we have several industry experts pointing not only towards DeepSeek ridiculous claims of efficiency, but also the lies from other labs.

Again, your claims are impossible to verify or falsify, because they are too unspecific.

> Meanwhile we have several industry experts pointing not only towards DeepSeek ridiculous claims of efficiency, but also the lies from other labs.

What are those "industry experts" saying that is made up and what is their basis for that?

> You’re citing deepseek unaudited numbers.

Which specific number are you claiming to be fake?

I could just guess blindly and find alternative sources for random numbers from DeepSeek's article.

For example, the tokens-per-second efficiency can also be calculated based on the 30k tps from this NVIDIA article: https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...

But looking for other sources is a waste of my time, when you could just be more precise.


Not sure if this is what you are looking for, but here is Python compiled to WASM: https://pyodide.org/en/stable/

Web demo: https://pyodide.org/en/stable/console.html


No it's not. It's an "interpreter": The whole interpreter binary (in wasm) as well as the Python source is transferred to the client to be executed.


Oh, so you are looking for a real compiler. I do not think that it is possible to compile Python, since the language is just too dynamic.

You'd have to compile every function for every possible combination of types, since the types of the function arguments can not be known at compile time without solving the halting problem. Even worse, new types could be created at runtime.

You can either type everything (like Cython, which arguably is not really Python anymore) or include a compiler to compile types that were not known at compile time, but that is just a JIT compiler with extra steps.


But Python compilers exist, nuitka being a more famous one: https://en.wikipedia.org/wiki/Nuitka


Nutika uses CPython as fallback. From the Limitations section:

    > Standalone binaries built using the --standalone command line option include an embedded CPython interpreter to handle aspects of the language that are not determined when the program is compiled and must be interpreted at runtime, such as duck typing, exception handling, and dynamic code execution (the eval function and exec function or statement), along with those Python and native libraries that are needed for execution, leading to rather large file sizes.


    > Edit: Python has no JIT
There are quite a few JITs:

JIT-compiler for Python https://pypy.org/

Python enhancement proposal for JIT in CPython https://peps.python.org/pep-0744/

And there are several JIT-compilers for various subsets of Python, usually with focus on numerical code and often with GPU support, for example

Numba https://numba.pydata.org/numba-doc/dev/user/jit.html

Taichi Lang https://github.com/taichi-dev/taichi


Per PEP 744, cpython shipped with an experimental JIT (default disabled) in 3.13. It remains experimental in 3.14.

See https://docs.python.org/3/whatsnew/3.13.html#an-experimental...


I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?


You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.

Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/


High end consumer SSDs can do closer to 15 GB/s, though only with PCI-e gen 5. On a motherboard with two m.2 slots that's potentially around 30GB/s from disk. Edit: How fast everything is depends on how much data needs to get loaded from disk which is not always everything on MoE models.


Would RAID zero help here?


Yes, RAID 0 or 1 could both work in this case to combine the disks. You would want to check the bus topology for the specific motherboard to make sure the slots aren't on the other side of a hub or something like that.


Maybe not as far-fetched as one might think.

Linus about the Tux mascot:

    > But this wasn't to be just any penguin. Above all, Linus wanted one that looked happy, as if it had just polished off a pitcher of beer and then had the best sex of its life.
Linus about free software:

    > Software is like sex; it's better when it's free.


> (it doesn't hallucinate on this)

But how do we know that you did not hallucinate the claim that ChatGPT does not hallucinate its version number?

We could try to exfiltrate the system prompt which probably contains the model name, but all extraction attempts could of course be hallucinations as well.

(I think there was an interview where Sam Altman or someone else at OpenAI where it was mentioned that they hardcoded the model name in the prompt because people did not understand that models don't work like that, so they made it work. I might be hallucinating though.)


Confabulating* If you were hallucinating we would be more amused :)


The entire "Supreme Skills!" series is amazing. Highly recommend!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: