> I'm saying none of them have yet shown enough promise to be called another bas...

og_kalu · 2025-08-08T03:19:02 1754623142

>Can you clarify what this threshold is? I know that's one sentence, but I think it is the most important one in my reply. It is really what everything else comes down to. There's a lot of room between even academic scale and industry scale. There's very few things with papers in the middle.

Something like GPT-2. Something that even before being actually useful or particularly coherent, was interesting enough to spark articles like these. https://slatestarcodex.com/2019/02/19/gpt-2-as-step-toward-g... So far, only LLM/LLM adjacent stuff fulfils this criteria.

To be clear, I'm not saying general R&D must meet this requirement. Not at all. But if you're arguing about diverting millions/billions in funds from x that is working to y then it has to at least clear that bar.

> My argument is simply "there should be no threshold, it should be continuous".

I don't think this is feasible for large investments. I may be wrong, but i also don't think other avenues aren't being funded. They just don't compare in scale because....well they haven't really done anything to justify such scale yet.

godelski · 2025-08-08T06:20:57 1754634057

  > Something like GPT-2

I got 2 things to say here

1) There's plenty of things that can achieve similar performance to GPT-2 these days. We mentioned Mamba, they compared to GPT-3 in their first paper[0]. They compare with the open sourced version and you'll also see some other architectures referenced there like Hyena and H3. It's the GPT-Neo and GPT-J models. Remember GPT-3 is pretty much just a scaled up GPT-2.

2) I think you are underestimating the costs to train some of these things. I know Karpathy said you can now train GPT-2 for like $1k[1] but a single training run is a small portion of the total costs. I'll reference StyleGAN3 here just because the paper has good documentation on the very last page[2]. Check out the breakdown but there's a few things I want to specifically point out. The whole project cost 92 V100 years but the results of the paper only accounted for 5 of those. That's 53 of the 1876 training runs. Your $1k doesn't get you nearly as far as you'd think. If we simplify things and say everything in that 5 V100 years cost $1k then that means they spent $85k before that. They spent $18k before they even went ahead with that project. If you want realistic numbers, multiply that by 5 because that's roughly what a V100 will run you (discounted for scale). ~$110k ain't too bad, but that is outside the budget of most small labs (including most of academia). And remember, that's just the cost of the GPUs, that doesn't pay for any of the people running that stuff.

I don't expect you to know any of this stuff if you're not a researcher. Why would you? It's hard enough to keep up with the general AI trends, let alone niche topics lol. It's not an intelligence problem, it's a logistics problem, right? A researcher's day job is being in those weeds. You just get a lot more hours in the space. I mean I'm pretty out of touch of plenty of domains just because time constraints.

  > I don't think this is feasible for large investments. I may be wrong, but i also don't think other avenues aren't being funded.

So I'm trying to say, I think your bar has been met.

And I think if we are actually looking at the numbers, yeah, I do not think these avenues are being funded. But don't take it from me, take it from FeiFei Li[3]

  | Not a single university today can train a ChatGPT model

I'm not sure if you're a researcher or not, you haven't answered that question. But I think if you were you'd be aware of this issue because you'd be living with it. If you were a PhD student you would see the massive imbalance of GPU resources given to those working closely with big tech vs those trying to do things on their own. If you were a researcher you'd also know that even inside those companies that there aren't much resources given to people to do these things. You get them on occasion like the StarFlow and TarFlow I pointed out before, but these tend to be pretty sporadic. Even a big reason we talk about Mamba is because of how much they spent on it.

But if you aren't a researcher I'd ask why you have such confidence that these things are being funded and that these things cannot be scaled or improved[4]. History is riddled with examples of inferior tech winning mostly due to marketing. I know we get hyped around new tech, hell, that's why I'm a researcher. But isn't that hype a reason we should try to address this fundamental problem? Because the hype is about the advance of technology, right? I really don't think it is about the advancement of a specific team, so if we have the opportunity for greater and faster advancement, isn't that something we should encourage? Because I don't understand why you're arguing against that. An exciting thing of working at the bleeding edge is seeing all the possibilities. But a disheartening thing about working at the bleeding edge is seeing many promising avenues be passed by for things like funding and publicity. Do we want meritocracy to win out or the dollar?

I guess you'll have to ask yourself: what's driving your excitement?

[0] I mean the first Mamba paper, not the first SSM paper btw: https://arxiv.org/abs/2312.00752

[1] https://github.com/karpathy/llm.c/discussions/677

[2] https://arxiv.org/abs/2106.12423

[3] https://www.ft.com/content/d5f91c27-3be8-454a-bea5-bb8ff2a85...

[4] I'm not saying any of this stuff is straight de fact better. But there definitely is an attention imbalance and you have to compare like to like. If you get to x in 1000 man hours and someone else gets there in 100, it may be worth taking a look deeper. That's all.

og_kalu · 2025-08-08T18:00:18 1754676018

I'm not a researcher.

I acknowledge Mamba, RWKV, Hyena and the rest but like I said, they fall under the LLM bucket. All these architectures have 7B+ models trained too. That's not no investment. They're not "winning" over transformers because they're not slam dunks, not because no-one is investing in them. They bring improvements in some areas but with detractions that make switching not a straightforward "this is better", which is what you're going to need to divert significant funds from an industry leading approach that is still working.

What happens when you throw away state information vital for a future query ? Humans can just re-attend (re-read that book, re-watch that video etc), Transformers are always re-attending, but SSMs, RWKV ? Too bad. A lossy state is a big deal when you can not re-attend.

Plus some of those improvements are just theoretical. Improved inference-time batching and efficient attention (flash, windowed, hybrid, etc.) have allowed transformers to remain performant over some of these alternatives rendering even the speed advantage moot, or at least not worth switching over. It's not enough to simply match transformers.

>Because I don't understand why you're arguing against that.

I'm not arguing anything. You asked why the disproportionate funding. Non-transformer LLMs aren't actually better than transformers and non-LLM options are non-existent.

godelski · 2025-08-08T22:54:24 1754693664

So fair, they fall under the LLM bucket but I think most things can. Still, my point is about that there's a very narrow exploration of techniques. Call it what you want, that's the problem.

And I'm not arguing there's zero investment, but it is incredibly disproportionate and there's a big push for it to be more disproportionate. It's not about all or none, it is about the distribution of those "investments" (including government grants and academic funding).

With the other architectures I think you're being too harsh. Don't let perfection get in the way of good enough. We're talking about research. More specifically, about what warrants more research. Where would transformers be today if we made similar critiques? Hell, we have a real life example with diffusion models. Sohl-Dickstein's paper came out at a year after Goodfellow's GAN paper and yet it took 5 years for DDPM to come out. The reason this happened is because at the time GANs were better performing and so the vast majority of effort was over there. At least 100x more effort if not 1000x. So the gap just widened. The difference in the two models really came down to scale and the parameterization of the diffusion process, which is something mentioned in the Sohl-Dickstein paper (specifically as something that should be further studied). 5 years really because very few people were looking. Even at that time it was known that the potential of diffusion models was greater than GANs but the concentration went to what worked better at that moment[0]. You can even see a similar thing with ViTs if you want to go look up Cordonnier's paper. The time gap is smaller but so is the innovation. ViT barely changes in architecture.

There's lots of problems with SSM and other architectures. I'm not going to deny that (I already stated as much above). The ask is to be given a chance to resolve those problems. An important part of that decision is understanding the theoretical limits of these different technologies. The question is "can these problems be overcome?" It's hard to answer, but so far the answer isn't "no". That's why I'm talking about diffusion and ViTs above. I could even bring in Normalizing Flows and Flow Matching which are currently undergoing this change.

  > It's not enough to simply match transformers.

I think you're both right and wrong. And I think you agree unless you are changing your previous argument.

Where I think you're right is that the new thing needs to show capabilities that the current thing can't. Then you have to provide evidence that its own limitations can be overcome in such a way that overall it is better. I don't say strictly because there is no global optima. I want to make this clear because there will always be limitations or flaws. Perfection doesn't exist.

Where I think you're wrong is a matter of context. If you want the new thing to match or be better than SOTA transformer LLMs then I'll refer you back to the self-fulfilling prophecy problem from my earlier comment. You never give anything a chance to become better because it isn't better from the get go.

I know I've made that argument before, but let me put it a different way. Suppose you want to learn the guitar. Do you give up after you first pick it up and find out that you're terrible at it? No, that would be ridiculous! You keep at it because you know you have the capacity to do more. You continue doing it because you see progress. The logic is the exact same here. It would be idiotic of me to claim that because you can only play Mary Had A Little Lamb that you'll never be able to play a song that people actually want to listen to. That you'll never amount to anything and should just give up playing now.

My argument here is don't give up. To look how far you've come. Sure, you can only play Mary Had A Little Lamb, but not long ago you couldn't play a single cord. You couldn't even hold the guitar the right way up! Being bad at things is not a reason to give up on them. Being bad at things is the first step to being good at them. The reason to give up on things is because they have no potential. Don't confuse lack of success with lack of potential.

  > I'm not arguing anything. You asked why the disproportionate funding.

I guess you don't realize it, but you are making an argument. You were trying to answer my question, right? That is an argument. I don't think we're "arguing" in the bitter or upset way. I'm not upset with you and I hope you aren't upset with me. We're learning from each other, right? And there's not a clear answer to my original question either[1]. But I'm making my case for why we should have a bit more of what we currently use so that we get more in the future. It sounds scary, but we know that by sacrificing some of our food that we can use it to make even more food next year. I know it's in the future, but we can't completely sacrifice the future for the present. There needs to be balance. And research funding is just like crop planning. You have to plan with excess in mind. If you're lucky, you have a very good year. But if you're unlucky, at least everyone doesn't starve. Given that we're living in those fruitful lucky years, I think it is even more important to continue the trend. We have the opportunity to have so many more fruitful years ahead. This is how we avoid crashes and those cycles that tech so frequently goes through. It's all there written in history. All you have to do is ask what led to these fruitful times. You cannot ignore that a big part was that lower level research.

[0] Some of this also has to do with the publish or perish paradigm but this gets convoluted and itself is related to funding because we similarly provide more far more funding to what works now compared to what has higher potential. This is logical of course, but the complexity of the conversation is that it has to deal with the distribution.

[1] I should clarify, my original question was a bit rhetorical. You'll notice that after asking it I provided an argument that this was a poor strategy. That's framing of the problem. I mean I live in this world, I am used to people making the case from the other side.