Hacker News new | past | comments | ask | show | jobs | submit login

Here's a list of tools for scaling up transformer context that have github repos:

* FlashAttention: In my experience, the current best solution for n² attention, but it's very hard to scale up beyond the low tens of thousands of tokens. Memory use is O(n) but compute is O(n²). Code: https://github.com/HazyResearch/flash-attention

* Heinsen Routing: In my experience, the current best solution for n×m attention, i.e., mapping n tokens to m tokens. It's like a souped-up version of attention. I've used it to pull up more than a million tokens as context. Memory use and compute are O(nm). It works, but in my (limited) experience, it doesn't work out-of-the-box as well as FlashAttention for n² attention. Code: https://github.com/glassroom/heinsen_routing

* RWKV: A sort-of-recurrent model which claims to have performance comparable to n² attention in transformers. In my (limited) experience, it doesn't. Others seem to agree: https://twitter.com/arankomatsuzaki/status/16390003799784038... . Code: https://github.com/BlinkDL/RWKV-LM

* RMT (this method): I'm skeptical that the recurrent connections will work as well as n² attention or n×m routing in practice, but I'm going to give it a try. Code: https://github.com/booydar/t5-experiments/tree/scaling-repor...

In addition, the group that developed FlashAttention is working on state-space models (SSMs) that look promising to me. The idea is to approximate n² attention dynamically using only O(n log n) compute. There's no code available, but here's a blog post about it: https://hazyresearch.stanford.edu/blog/2023-03-27-long-learn... [CORRECTION: Code is available. See comment by lucidrains below. I'm hopeful this will go to the top of my list.]

If anyone here has other suggestions for working with long sequences (hundreds of thousands to millions of tokens), I'd love to learn about them.




Very nice list. I didn´t knew about Heinsen Routing, looks very interesting.

From my tests, SSMs are a very promising line of research and on my (small) tests on S4, it really has better characteristics than transformers, as it learned faster, a larger context and with smaller dataset.


Agree on SSMs: they look promising. They're on my list of "things to explore more thoroughly." I've done very little with them so far. I'm still making my way through the related papers, trying to get a superficial but accurate/intuitive understanding of these models.


the code is here https://github.com/hazyresearch/safari you should try it and let us know your verdict.


Thank you. Somehow I missed that. I'm still making my way through the related papers, trying to get a superficial but accurate/intuitive understanding of these models. Embarrassingly, the work hasn't quite 'clicked' for me yet. Looking forward to tinkering with the code!


cs702, fantastic comment. I am sorta poking around this area too. I'd be curious what benchmark you're using to evaluate performance amongst these repos? If you're up for it, shoot me an email -- my email is in my profile.


Thank you!

Working on proprietary stuff. Not allowed to share details.

But I'll ask about connecting online :-)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: