Hacker Newsnew | past | comments | ask | show | jobs | submit | raphaelj's commentslogin

Why does he need to manually do the tracing or reference counting of all these nodes?

Instead, he could just use the references he needs in the new tree, delete/override the old tree's root node, and expect the Javascript GC to discard all the nodes that are now referenced.


It's explained in the post:

> Then, my plan was to construct a ProseMirror transaction that would turn the old tree into the new one. To do that, it’s helpful to know which nodes appeared in the old document, but not the new one.

So, it's not actually about reclaiming the memory. It's about taking some action on the nodes that will be reclaimed. It's akin to a destructor/finalizer, but I need that to happen synchronously at a time that I control. JavaScript does now support finalization (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...) but it can't be relied on to actually happen, which makes it useless in this scenario.


> What is the basis of these revisions?

Ideology


Money.


Stupidity


Meanwhile, according to the IEA, renewables investments exceed those in fossil fuels since +/- 2022 [1], and are expected to be the top electric power generator by 2026 [2].

The Trump administration is basically following Kodak's strategy from the early 00s.

--

[1] https://www.iea.org/reports/world-energy-investment-2025/exe...

[2] https://www.eco-business.com/news/iea-renewables-will-be-wor...


Kodak in the early 2010s, maybe?

Eastman Kodak (the spun-off film business) in the 2020s has been more or less stable. They even brought a discontinued film (E100) back. Production and pricing are now in line with the limited demand from film studios and hobbyists.


Made a typo, I meant 00s. I edited it, thanks :)


That can be fixed with absurd tarriffs on PVs


At least they don't follow Kodak's strategy from the late 10s, with KodakCoin.

Oh, wait.


I've been trying to use other LLM providers than OpenAI over the past few weeks: Claude, Deepseek, Mistral, local Ollama ...

While Mistral might not have the best LLM performances, their UX is IMO the best, or at least a tie with OpenAI's:

- I never had any UI bug, while these were common with Claude or OpenAI (e.g. a discussion disappearing, LLM crashing mid-answer, long context errors on Claude ...);

- They support most of the features I liked from OpenAI, such as libraries and projects;

- Their app is by far the fastest, thanks to their fast reply feature;

- They allow you to disable web-search.


It is painful, but I have done the same thing: dropping any paid use of OpenAI. For years, basically since I retired from managing a deep learning team at Capital One, I have spent a ton of time experimenting with all LLM options.

Enough! I just paid for a year of Gemini Pro, I use gemini-cli for free for small sessions, turn on using my API key for longer sessions to avoid timeout, and most importantly: for API use I mostly just use Gemini 2.5-flash, sometimes -pro, and Moonshot’s Kimi K2. I also use local models on Ollama when they are sufficient (which is surprisingly often.)

I simply decided that I no longer wanted the hobby of always trying everything. I did look again at Mistral a few weeks ago, a good option, but Google was a good option for me.


Couldn't the battery just do, as an example, 1 minute long charge then discharge cycles?

For example, if the electricity price is -28€/MWh (like today in Germany), and your battery efficacy is 80%, you could get paid 28€/MWh charging, then only pay back 22€ discharging, generating a 6€/MWh profit.


The wholesale energy markets don't have sub 5-minute granularity anywhere that I'm aware of. In the US, 1-hour is standard in the day-ahead markets and 5-minutes is standard for the spot markets.

There is also the problem that your battery would likely degrade fast depending on the technology.


There might not even be any need for V2G or V2H.

Just charging your car when the demand is low is probably enough to drastically reduce the overall cost of the system. And this has basically no impact on the battery lifespan.


A trial in the UK resulted in customers earning up to £725/year [1]. With increased renewables on the grid leading to increased flutucations in the wholesale price of electricity, providing V2G/V2H will further reduce a customer's electricity bill on top of the savings offered by smart charging eg. Charge Anytime Tariff is 7p per kWh for EV charging [2] vs 27p kWh average Apr - Jun 2025 [3].

1. https://www.kaluza.com/case-studies/case-study-kaluza-enable...

2. https://www.ovoenergy.com/electric-cars/charge-anytime

3. https://www.nimblefins.co.uk/average-cost-electricity-kwh-uk


High demand is not the sole reason for outages.


This is definitively possible. Are you thinking about YouTube or social media links?


Those come to mind, but also Internet Archive item links; this enables potentially dating any video or audio content in their archive (assuming it contains the necessary audio data from mains). I admit this is lazy, as it is easy to retrieve the content and then upload, but still worth mentioning in case it was value product feedback.


I've an EcoFlow plug & play inverter, and it automatically shuts off if the grid comes down. That's a requirement for all these devices.


Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;

5. They are using FP8 instead of FP16 when it does not impact accuracy.

It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.

1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).


The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.


https://planetbanatt.net/articles/mla.html this is a great overview of how MLA works.


They also did bandwidth scaling to handle work around the nerfed H800 interconnects.

> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths

> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.

(I know some of those words)

https://arxiv.org/html/2412.19437v1


I think the fact that they used synthetic/distilled high-quality data from GPT4-o output to train in the style of Phi models are of significance as well.


Do we know which change(s) made DeepSeek V3 so much more efficient than other models?

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. new MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens, not only the first one. This is supposed to improve the transformer capabilities of predicting sequences of tokens. Note that they don't use this for inference, except for some latency optimisation by doing speculative execution on the 2nd token.

5. They are using FP8 instead of FP16 when it does not impact accuracy.

My guess would be that 4) is the most impactful improvement. 1), 2), 3) and 5) could explain why their model train faster, but not how is performs greatly better than models with way more activated parameters (e.g. llama 3).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: