More

timlarshanson · 2025-11-08T02:44:51 1762569891

This was very surprising to me, so I just fact-check this statement (using Kimi K2 thinking, natch), and it's presently is off by a factor of 2 - 4. In 2024 China installed 277 GW solar, so 0.25 GW / 8 hours. First half of 2025 they installed 210 GW, so 0.39 GW / 8 hours.

Not quite at 1 GW / 8 hrs, but approaching that figure rapidly!

(I'm not sure where the coal plant comes in - really, those numbers should be derated relative to a coal plant, which can run 24/7)

mullingitover · 2025-11-08T16:30:21 1762619421

> (I'm not sure where the coal plant comes in - really, those numbers should be derated relative to a coal plant, which can run 24/7)

It works both ways: you have to derate the coal plant somewhat due to the transmission losses, whereas with a lot of solar power being generated and consumed on/in the same building the losses are practically nil.

Also, pricing for new solar with battery is below the price of building a new coal plant and dropping, it's approaching the point where it's economical to demolish existing coal plants and replace them with solar.

timlarshanson · on Feb 19, 2025

Re the Apportionment Act of 1929 -- care to elaborate? Are there figures for "the worst representation in the free world"?

My impression is that there are many reasons for the dysfunction of congress; the media feedback control system (in a literal and metaphorical sense) plays an important role, as does the filibuster, lobbyists, and other corruption.

(Aside: in aging, an organisms feedback and homeostatic systems tend to degrade / become simpler with time, which leads to decreased function / cancer etc. While some degree of refactoring & dead-code cruft-removal is necessary - and hopefully is happening now, as I think most Americans desire - the explicit decline in operational structure is bad. (Not that you'd want a systems biologist to run the country.))

monocasa · on Feb 19, 2025

Not the parent, but broadly agree that a change to apportionment would heavily change the US for the better. I don't think it would be a single fix for the country, but I think it would greatly help quite a few of the issues.

Originally there were about 35k constituents/rep. Today it's an average of ~750k constituents/rep, with some districts at over a million.

This is because of the Apportionment Act of 1929 capped the number of reps. If we had the same constituent/rep ratio, we'd have ~10k reps total.

If instead we went back to the constituent/rep ratio that existed originally, a lot of our structural problems go away, via a mechanism that's accessible via US code rather than a change to the constitution.

For instance, the electoral college is based on federal representation. If you expand the house by ~50x, that dominates the electoral college by nearly two orders of magnitude, and creates a very close to popular election.

It's also much much harder to gerrymander on that scale.

That scale would also have a return to a more personal form of politics, where people actually have a real chance to meet with their reps (and the candidates) face to face.

It also feels that by having a much larger, more diffuse legislative body, we'd better approximate truly democratic processes in a representative democratic model.

kelnos · 2025-02-20T07:20:47 1740036047

> If you expand the house by ~50x

Wow, that's a lot! I recall reading a piece, I believe in the Washington Post, sometime within the past few years, on this topic. They didn't run the numbers for such a dramatic increase, but I think talked about a House size of around 1000 representatives. And I was surprised to find out that this didn't shift the balance of power as much as I expected it would.

But regardless, as much as I would like for it to be easier for Democrats to win elections (in what would be an entirely fair way for them to do so!), that just puts one party in power more frequently. It doesn't fix the underlying dysfunction.

thephyber · on Feb 19, 2025

Biology is a bad example when applied to a government.

Almost all change in biology happens to populations, not individuals. In order for that to apply to governments, we would need to have massive churn and rapid experimentation of government policies and structures. These are not conducive to voter feedback (eg. Democracy) and would be so disruptive to business and life as to make governments useless until they reached some steady state.

I remember hearing that Italy had 52 governments in 50 years. It’s suffering from all of the same problems as the rest of western countries, perhaps somewhat worse than average.

timlarshanson · on Jan 17, 2025

I doubt it. This does not seem to be a particularly well written or well thought-out paper -- e.g. equations 6 and 7 contradict their descriptions in the sentence below; the 'theorem' is an assertion.

After reading a few times, I gather that, rather than kernelizing or linearizing attention (which has been thoroughly explored in the literature), they are using a MLP to do run-time modelling of the attention operation. If that's the case (?), (which is interesting, sure): 1 -- Why did they not say this plainly. 2 -- Why does eq. 12 show the memory MLP being indexed by the key, whereas eq. 15 shows it indexed by the query? 3 -- What's with all the extra LSTM-esque forget and remember gates? Meh. Wouldn't trust it without ablations.

I guess if a MLP can model a radiance field (NeRF) well, stands to reason it can approx attention too. The Q,K,V projection matrices will need to be learned beforehand using standard training.

While the memory & compute savings are clear, uncertain if this helps with reasoning or generalization thereof. I doubt that too.

andy12_ · on Jan 17, 2025

The eq. 12 is a loss function to associate a given key and value in the memory MLP using test-time training with gradient-descent.

The eq. 15 is simply the operation to query a value that was previously inserted in previous tokens using eq. 12.

Basically, for each autoregressively processed segmented you do:

1) Test-time inference: query values from memory with eq. 15.

2) Test-time training: associate new keys and values into the memory with the loss from eq. 12.

The forget and remember gates is because... well, the architecture in general is very similar to a LSTM, but using test-time gradient descent to decide what to insert to the long-term memory.

timlarshanson · on Jan 17, 2025

Ok, thanks for the clarification.

Seems the implicit assumption then is that M(q) -> v 'looks like' or 'is smooth like' the dot product, otherwise 'train on keys, inference on queries' wouldn't work ? (safe assumption imo with that l2 norm & in general; unsafe if q and k are from different distributions).

Correct me if I'm wrong, but typically k and v are generated via affine projections K, V of the tokens; if M is matrix-valued and there are no forget and remember gates (to somehow approx the softmax?), then M = V K^-1

andy12_ · on Jan 18, 2025

It's actually implied in the paper that the neural memory module M can be anything, and there's probably a lot of room to test different kinds of architectures for M. But in this paper M is an MLP of 1 layer (fig. 7 is an ablation study using different number of layers for the MLP).

> using a matrix-valued memory M [...] is an online linear regression objective and so the optimal solution assumes the underlying dependency of historical data is linear. On the other hand, we argue that deep memory modules (i.e., M ≥ 2) . Aligning with the theoretical results that MLPs with at least two layers are strictly more expressive than linear models (Hornik, Stinchcombe, and White 1989), in Section 5.5, we show that deep memory modules are more effective in practice

ljlolel · on Jan 17, 2025

The paper has ablations

timlarshanson · on Jan 6, 2025

Yes, this bothered be as well - the department of government efficiency is, as with all government agencies, is working for the public good in the public interest. This means everything must default to being open, unless there is a good reason not to be (military, CIA etc).

I don't trust Elon, and don't see why DOGE should (or could) be secret - unless it's a cover to acquire more power, which seems to be his true objective. (recently, at least)

timlarshanson · on Oct 9, 2024

Yep. From what I've seen, if the head wants to do nothing, it can attend to itself = no inter-token communication.

Still, differential attention is pretty interesting & the benchmarking good, seems worth a try! It's in the same vein as linear or non-softmax attention, which also can work.

Note that there is an error below Eq. 1: W^V should be shape [d_model x d_model] not [d_model, 2*d_model] as in the Q, K matrices.

Idea: why not replace the lambda parameterization between softmax operations with something more general, like a matrix or MLP? E.g: Attention is the affine combination of N softmax attention operations (say, across heads). If the transformer learns an identity matrix here, then you know the original formulation was correct for the data; if it's sparse, these guys were right; if it's something else entirely then who knows...

timlarshanson · on July 15, 2023

Agreed.

Also, how does it get started? Seems like the only force pushing the piston against the axial cam is the fuel explosion. (Perhaps extra springs to retract the piston for intake?)

pengaru · on July 15, 2023

> Also, how does it get started? Seems like the only force pushing the piston against the axial cam is the fuel explosion. (Perhaps extra springs to retract the piston for intake?)

In lieu of combustion you still have the air spring rebound from the compression... (at least until the exhaust opens)

Ancapistani · on July 15, 2023

Presumably an electric starter, much like a traditional engine - only it’ll turn the driver shaft instead of a cam shaft.

timlarshanson · on Feb 3, 2022

Yep. This is a very interesting material, and of course it's a research prototype -- but it's not very strong. They list the modulus as 12.7 GPa and the yield strength (= ultimate tensile strength, since the film tears) as 488 MPa.

In comparison, polyimide (PMDA-PPD), which also easily solvent processable, has a modulus of 8.9 GPa, and a yield strength of 350 MPa.

Less equal comparisons involve polymers that are molecuarly aligned by drawing, spinning, or chemical processes. Dyneema UHMEPE has a modulus of 110 GPa and a ultimate tensile strength of 3.5 GPa. Kevlar is similar; it utilizes interlocking hydrogen bonds to convey strength. Even stronger are glass fibers (>4 GPa tensile strength) or PAN carbon fiber (> 6 GPa tensile strength).

You of course lose some strength when you make composites out of fiber -- but irregardless this polymer is many times weaker and softer.

timlarshanson · on Feb 16, 2021

I don't follow how punctuated equilibrium fits in here, but I do agree with your general intuition. Evolution 'likes' spaces that are navigable. Protein evolution is, in my mind, the paragon of this: even though the space of possible amino acid sequences is tremendously huge, since 2010 relatively few new folds have been discovered, and it seems that there are only ~ 100k of them in nature. See https://ebrary.net/44216/health/limits_fold_space

Proteins get the substrate right, and a handful of folds are sufficient for all the interactions an organism could need -- so evolution can find new solutions quickly. (It only took hundreds of millions of years for LUCA's parents to figure /that/ out.)

It seems that being able to parameterize the problem space such that solutions are plentiful and accessible via random search is nearly equivalent to solving the problem... In this case, using an ANN to stand in for ('parameterize') organismal development is entirely reasonable (and would hence 'solve' the problem), look forward to seeing the results of that. But as with the OP I'm cautious as to the efficiency of backprop vs evolution.

timlarshanson · on Feb 15, 2021

But, if your realistically-spiking, stateful, noisy biological neural network is non-differentiable (which, so far as I know, is true), then how are you going to propagate gradients back through it to update your ANN approximated learning rule?

I suspect that given the small size of synapses the algorithmic complexity of learning rules (and there are several) is small. Hence, you can productively use evolutionary or genetic algorithms to perform this search/optimization. Which I think you'd have to due to the lack of gradients, or simply due to computational cost. Plenty of research going on in this field. (Heck, while you're at it, might as well perform similar search over wiring typologies & recapitulate our own evolution without having to deal with signaling cascades, transport of mRNA & protein along dendrites, metabolic limits, etc)

Anyway, coming from a biological perspective: evolution is still more general than backprop, even if in some domains it's slower.

ericjang · on Feb 15, 2021

This is a good question. I think many "biologically plausible" neural models are willing to make some approximations for the benefit of computational power (e.g. rate coding instead of spike coding, point neurons and synapses instead of a cable model). As for non-differentiable operations, I think one strategy might be to formulate it as a multi-agent communication problem (e.g. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFil...), where gradients are obtained via a differentiable relaxation or using a score-function gradient estimator (e.g. REINFORCE)

orbifold · on Feb 15, 2021

You can actually calculate exact gradients for spiking neurons using the adjoint method: https://arxiv.org/abs/2009.08378 (I'm the second author). In my PhD thesis I show how this can be extended to larger problems and more complicated and biologically plausible neuron models. I agree with the gist of your post though: Retrofitting back propagation (or the adjoint method for that matter) is the wrong approach. One should rather use these methods to optimise biologically plausible learning rules. The group of Wolfgang Maass has done exciting work in that direction (e.g. https://arxiv.org/abs/1803.09574, https://www.frontiersin.org/articles/10.3389/fnins.2019.0048..., https://igi-web.tugraz.at/PDF/256.pdf).

timlarshanson · on Feb 16, 2021

I was aware of Neftci's work, but not your result -- I stand corrected! Given the perspective, given LIF networks are causal systems, of course you can reverse it with sufficient memory. I understand the memory in this case are input synaptic currents at the time of every spike (e.g. what synapses contributed to the spike). This is suspiciously similar to spine and dendritic calcium concentrations. Those variables are usually only stored for a short time - but that said the hippocampus (at least) is adept at reverse replay so there is no reason calcium could not be a proxy for 'adjoint'. hum.

Interesting Maass references too. Cheers

orbifold · on Feb 16, 2021

I agree that calcium seems like a natural candidate and I suggest as much in my thesis. Coming from physics, I didn't know about reverse replay in the hippocampus for a long time, but I also have this association now. I would be glad to talk more, is there a way to reach you?

timlarshanson · on Jan 31, 2020

Agreed. As I've grown older, I spend less time running tight verbal loops in my mind, and more time examining things visually. It seems more externally-oriented, and allows for better sleep.