Unlimiformer: Long-Range Transformers with Unlimited Length Input

mxwsn · on May 5, 2023

1. This is not exact attention, but an approximation of it. Specifically, they use k-nearest neighbors to retrieve the top-k most similar tokens, out of an "unlimited-length input" say of size N, where k << N.

2. This idea is quite similar to retrieval transformers and Hopfield networks which have been known and published for several years now. It's not really that novel.

3. Due to the preceding points, the title can easily mislead people. It's not really a conventional transformer, and it's not a breakthrough.

4. This paper is a preprint and not peer-reviewed.

"I generally don't enjoy seeing preprints like this going to the top of Hacker News. This would be a higher quality submission if the paper was peer-reviewed or put into a greater context, like a blog post discussion or something like that."

Let me retract this and say something a bit nicer :) I personally think there this specific preprint making it to the top of HN is potentially harmful, because of the hype around LLMs, the diverse audience of readers here, and the specific title that implies a claim of "transformer with unlimited context length", when this is misleading. I don't have anything against preprints in general - a lot of work outside of the peer-review process ends up being very impactful.

cs702 · on May 5, 2023

After a very quick read, that's my understanding too: It's just KNN search with some bells and whistles. So I agree on points 1-3.

When something works well, I don't care much about point 4.

Personally, I've had only mixed success with KNN search on long sequences. Maybe I haven't done it right? I don't know. In my experience, nothing seems to work quite as well as explicit token-token interactions by some form of attention, which as we all know is too costly for long sequences (O(n²)). Lately I've been playing with https://github.com/hazyresearch/safari , which uses a lot less compute and seems promising, though it reminds me of things like FNet. Otherwise, for long sequences I've yet to find something better than https://github.com/HazyResearch/flash-attention for n×n interactions and https://github.com/glassroom/heinsen_routing for n×m interactions. If anyone has other suggestions, I'd love to hear about them.

ftxbro · on May 5, 2023

> I generally don't enjoy seeing preprints like this going to the top of Hacker News. This would be a higher quality submission if the paper was peer-reviewed or put into a greater context, like a blog post discussion or something like that.

This opinion seems totally backwards to me. I'm not sure what you think peer-reviewed means? Also I prefer full preprints than blog posts. But then again, I have no idea why ones like the daily blogposts of Seth Godin (to pick on one randomly, sorry it's not personal) so often go to the top of hacker news. Maybe opinions like yours explains it?

MacsHeadroom · on May 5, 2023

> This opinion seems totally backwards to me.

I agree.

> I'm not sure what you think peer-reviewed means?

Posting to HN is a form of peer-review, typically far better than the form of "peer-review" coopted by journal publishers.

pyth0 · on May 5, 2023

> Posting to HN is a form of peer-review, typically far better than the form of "peer-review" coopted by journal publishers.

This is a rather self-aggrandizing view, and I think it speaks to the level of ego that underpins a lot of the discussion on here.

godelski · on May 5, 2023

There's a lot of junk comments on HN but there's also a lot of junk comments at top conferences like CVPR, ICCV, and NIPS. The system is just noisy. I've had plenty of inane reviews that clearly break reviewer guidelines (ACs do nothing)[0,1].

Also, I want to remind everyone that ML uses conferences as the main publishing mechanism, not journals. While things like JMLR exist, that's not where papers are targeting.

Maybe we just need to let researchers evaluate works based on their merits and not concern ourselves with things like popularity, prestige, and armchair experts' opinions. The latter seems antiscientific to me. We need to recognize that the system is noisy and Goodhart shows us we aren't optimizing merit.

[0] an example is that I had a strong reject with 2 lines of text. One stating that it wasn't novel (no further justification) and the other noting a broken citation link to the appendix. No comments about actual content.

[1] As another example, I've had reviewers all complain because I didn't compare one class of model to another and wasn't beating their performance. I beat the performance of my peers, but different models do different things. Image quality is only one metric. You wouldn't compare PixelCNN to StyleGAN.

xg15 · on May 5, 2023

> Maybe we just need to let researchers evaluate works based on their merits and not concern ourselves with things like popularity, prestige, and armchair experts' opinions.

Ok, but how would the researchers communicate their evaluation to non-experts? (Or other experts who didn't have the time to validate the paper)

Isn't that exactly what a review is?

My impression is the armchair experts are more likely to be found on HN.

godelski · on May 6, 2023

> Ok, but how would the researchers communicate their evaluation to non-experts?

Conferences, journals, and papers are not for non-experts. They are explicitly for experts to communicate with experts. The truth is that papers have never been validated and likely never will. Code often isn't uploaded alongside papers and when it is I know only a handful of people that look at it (including myself) and only one that executes it (and not often). Validation only happens with reproduction (i.e. grad students learning) and funding doesn't encourage that. Despite open source code, lots of ML is still difficult to reproduce, if it can be done at all.

We also use normal forms of communication like Twitter, HN, Reddit, email, etc but there's a lot of noise (as you note). We speak a different language though, so you can often tell.

Frankly, a lot of us are not concerned with explaining our work to layman. It's a lot of work, especially the more complex a subject is and we're already under high pressure to continue researching. It's never good enough. There's no clear "done with work" time in jobs like this. You're always working, so allocate your energy (I'm venting and mentally fatigued right now). I used to be passionate about teaching laymen but I'm tired of arguing with armchair experts. Still happy and passionate about teaching my students and performing research, so that's where I'll spend most of my energy: in the classroom or blogs. The more popular a subject is, the more likely this is to happen too, ironically.

Communication should come from news, university departments, and specialty science communicators, but that's broken down. Honestly, I just think it's a tough time for laymen to get accurate information. There's a lot of good information out there for you all (us researchers learn from publicly available materials) but expertise is being able to distinguish signal from noise, and the greater the popularity, the greater the noise. This isn't just true for ML, we see this in things like climate, nuclear, covid, gender/sexuality, and other hot topics. Only thing you can do is actually use a common strategy from researchers: have high doubt and look for patterns from research groups.

PaulHoule · on May 6, 2023

Personally I relish many of the third-string papers that people post on arXiv about run-of-the-mill text analysis projects they do because they give me more insight into the results I'll get and the challenges I'll face when I do my own text analysis projects.

If you go to a computer science conference you might talk about the headliners later but you actually learn a lot from talking to less famous people at the back of the room, scanning large numbers of poster papers, sharing a bottle of wine at dinner with four people and having one of them get way too drunk and talk trash about academics you distantly know, etc.

Lower-quality papers on arXiv give me a bit of that feel.

Grimblewald · on May 5, 2023

>This is a rather self-aggrandizing view, and I think it speaks to the level of ego that underpins a lot of the discussion on here.

I'm not so sure about that. I've read a lot of things that should have never left peer review or editing stages, while some of the most impotent papers for my field never left preprint.

Overall I think the most imprortant step of peer review is you as the reader in the field. Peer review should catch the worst offenders out, saving us all some time, but it should never be viewed as a seal of approval. Everything you read should be critically evaluated as if it were a preprint anyway.

pyth0 · on May 5, 2023

I realize some people have taken my comment to be speaking on the efficacy of the peer review process but that was not my intent. I have no experience reading or reviewing papers, or with the journal publication process. My point was more to the fact that HN is a public forum in which anyone can participate and so elevating it above (what I hope are) subject matter experts seemed rather arrogant. To be fair, the OP has since expanded with a more complete comment and it seems to be a similar sentiment to the things you and a couple others have shared.

nomel · on May 5, 2023

> is a public forum in which anyone can participate

I don't think "participate" and "leave a comment" are the same thing. A random person most likely wouldn't be able to follow or contribute to the conversation. They could only leave a comment.

It's a bit pedantic, but noise usually sinks to the bottom.

freeone3000 · on May 5, 2023

Having been on a paper review board, the selection process is essentially credentialism for credentialism’s sake. Anyone who’s done a paper or two is deemed to be qualified, and as it’s unpaid, uncredited bonus work on top of your day job, the slots aren’t competed for very hard.

I would say the primary difference between a conference peer review board and HN is that the author is obliged to respond to the reviewers on the board. I would not say there’s any particular difference in qualifications.

xg15 · on May 5, 2023

> Anyone who’s done a paper or two

That already narrows it down greatly compared to the general public you find on the internet.

nullc · on May 6, 2023

but not necessarily for the better. :)

JoshuaDavid · on May 5, 2023

Do you think it's factually incorrect that the HN comment section is more likely to find problems which invalidate the conclusions of the paper than the journal-driven peer review process?

anonymousDan · on May 5, 2023

JoshuaDavid · on May 5, 2023

On reflection, I probably agree that the answer is "yes" to the question as I phrased it. I think that if you take a random paper, the peer reviewers probably do have much more useful feedback than HN would.

However, if you limit the question to "papers which make bold conclusions of the type that generates lots of discussion on HN", I think HN will be more likely to find methodological flaws in those papers than the peer review process would. I think that's mostly because papers are optimized pretty hard not to have any problems which would cause them to be rejected by the peer review process, but not optimized very hard to not have other problems.

Which means, on average, I expect the HN comment section to have more interesting feedback about a paper, given that it's the sort of paper that gets lots of HN discussion, and also given that the author put a lot of effort into anticipating and avoiding the concerns that would come up in the peer review process.

Which, to a reader of HN, looks like "a lot of peer-reviewed papers have obvious flaws that are pointed out by the HN commentariat".

I do think, on the object level, a pre-print which the author intends to publish in a reputable journal will be improved more by fixing any problems pointed out by HN commenters than by fixing any problems pointed out by peer reviewers, and as such I think "post the pre-print on HN and collect feedback before peer review" is still a good step if the goal is to publish the best paper possible.

pyth0 · on May 5, 2023

This is a considerably more thoughtful comment and I appreciate your reflection. I also can see how my initial response was a little broad and over-generalizing. I do think there is an interesting conversion in there about whether a group of technically minded people outside the "in group" of the peer reviewer circle (of whatever paper in question) could offer different and potentially important feedback.

Although I should add I have no background in academia and don't feel prepared to have that discussion.

nullc · on May 6, 2023

> "post the pre-print on HN and collect feedback before peer review" is still a good step

It'll cause some journals to not publish your work.

mrbungie · on May 5, 2023

I think that it depends on what journal we are talking about. Most of them have some biases in their processes, just as HN commenters also do.

nullc · on May 6, 2023

It would be more charitable and accurate to read it as a statement of the sad state of review at many journals. Plenty are rubber stamps where the most you might expect from reviewers is an insistence to add citations to their own barely relevant papers.

19h · on May 5, 2023

There's no need to attack the entire HN community over one person's opinion. Preprints and discussions here both have value, and different forms of review suit different needs.

pyth0 · on May 5, 2023

This was not an attack against the community or the paper in question. I am only speaking from my experience as (primarily) a lurker.

19h · on May 5, 2023

My apologies, I misinterpreted your comment. You make a fair point that HN discussions are not equivalent to formal peer review.

xg15 · on May 5, 2023

That's redefining what "peer-review" is. And I'll take credentialism over some board of anonymous internet people, I'm sorry.

I mean, hypothetically, this whole thread could be stuffed with sock puppet accounts of the author. How would you know?

snovv_crash · on May 6, 2023

You can check the commenters post history?

yawnxyz · on May 5, 2023

I got let in on the secret of what "peer review" actually looks like for microbiology papers, and... it's just three PhD students in a trenchcoat, posing as the PI who's too busy reviewing papers.

Except of course for Nature-level papers, but most people never get to review papers like that

chaxor · on May 6, 2023

PhD students are typically more knowledgeable than the PI, so this is probably optimal. Perhaps a good Postdoc is better. PIs are often way outdated in their knowledge and don't have the time or energy to stay abreast the developments in the field. That's what PhDs are for - they have to prove themselves in developing what should be their most important work of their lifetime before switching to what is essentially management, rather than science, unfortunately.

chaxor · on May 5, 2023

There's nothing really wrong with a preprint making it to the top - there can be genuinely good work that stays in preprint for quite some time. I believe the original ELMo work that spurred the Sesame street gang is still in preprint despite its importance in NLP (:shocked Pikachu face: not a transformer?!).

But yes, you're correct in this instance that it's not necessarily 'huge news' since it is highly similar to a long list of the Reformer (LSH-based), Performer (FAVOR**), FNet (Fourier-based), Routing Transformer, Sparse Transformer, Longformer (task specific sparse), blockbert, XLNet/xfmr-xl (slide + relative PE), BP-Transformer (binary partition), BigBird (global and rand attn), RWKV which is..., etc.

** FAVOR actually is innovative and different in this space, but towards similar ends anyway

visarga · on May 5, 2023

How come you know the efficient-transformers family, when I ask questions about transformers in ML interviews nobody has heard of them. Can't figure out why it's not common knowledge. For years all the transformer papers were about reducing O( N^2 )

zwaps · on May 6, 2023

Two reasons: They were only recently implemented in large production models and are not part of ye-standard-ML-coursera. And I mean, for half the papers claiming that their particular efficiency variant reduces O(n^2) to whatever without performance loss, we found that in practice it ain't quite so shiny.

Anyone who has been for whatever reason reading the papers since 2017 has invariably read dozens of these papers.

Anyone who has heard of GPT-x in 202x and started from there probably didn't.

This will likely change with implementation of memory retrieval, some form of linear attention etc. in many productions models, and the democratization of some decoder models... although I have been thinking this for a while.

Don't get me wrong, you want to hire the people who know these papers, especially if they started after 2017 :-)

chaxor · on May 6, 2023

I have a team of bioinformaticians with very little ML knowledge, but they know of these basic papers... So yeah, if someone claims to be in ML and aren't aware of this foundational knowledge... they're not to be taken seriously.

f_devd · on May 5, 2023

To be fair ML is (used to be?) pretty broad, so unless someone is actively keeping up with the sota in the high-data sequence modeling area it's quite possible to miss. I know ML teams which were entirely made up of OSML practicioners, because that was the most commonly useful until recently.

chaxor · on May 6, 2023

The reason they don't know them is they're not serious researchers or practitioners of ML - it's as simple as that. Anyone in this area should have this exceedingly basic common knowledge.

Nimitz14 · on May 6, 2023

Why learn something noone is using.

visarga · on May 6, 2023

Oh I didn't test their in-depth knowledge, just checking to know if they were aware people were writing hundreds of papers on this topic, and what impact would have a sub-quadratic attention that works as well as the regular one. Fishing to see if they remember at least one approach or two.

chaxor · on May 6, 2023

Many people are using it, but people that are relying on OpenAI to do all of the work for them aren't looking deep enough to realize it.

Nimitz14 · on May 11, 2023

Can you point to a single major project that does? AFAIK it always leads significant regressions in real world use cases, which is why no major repo uses them.

dhruvdh · on May 5, 2023

I generally don't enjoy something being diminished on account of being "not really that novel".

Your comment essentially says - this is not a high quality submission because readers might not actually read it, which is no fault of the work, or submitter.

MasterScrat · on May 5, 2023

> Your comment essentially says - this is not a high quality submission because readers might not actually read it

I'd argue that on average, most readers won't have a good enough understanding, or read the paper far enough, to understand that the reality is closer to "it's not a breakthrough" rather than "Transformers with Unlimited Length Input".

So, I wholeheartedly welcome this type of hype-breaking leading comment.

jjoonathan · on May 5, 2023

Agreed 100%. Not only do I appreciate "well actually" comments, I think they are the single most useful aspect of forum discussions.

The headline will always be "BATTERY BREAKTHROUGH PROMISES TO ROCKET ELON MUSK TESLA TO THE MOON!!!" and while it's easy to know that some amount of cold water is necessary you need to spend a nontrivial amount of attention and have a nontrivial amount of knowledge to figure out just how much cold water. It's a useful thing to outsource. Did a research group see outperformance in an experiment with 1% probability of translating into production? Or is CATL scaling up a production process? The "well actually" comment will contextualize for you. If there's a "well actually" reply to the "well actually" comment, that tells you something too. Upvotes/downvotes dial in the distributed consensus.

It's far from perfect, but I'd challenge detractors to point to a more effective method for large-scale democratic truth seeking.

zamnos · on May 5, 2023

well actually (sorry, couldn't resist) the refrain of "correlation is not causation" despite not having read beyond the headline, or "...in mice" when that's mentioned in the abstract, is pretty frustrating when that's the entire substance of the comment.

it seems some technical and social countermeasures could be deployed so there was at least token visiting of a link before commenting was allowed, which court raise discourse in this forum, at least. It's a false dichotomy to only consider the two extremes of Peer Reviewed and HN reviewed. In particular, as mentioned up thread, the incentive to do peer review (or replicate an experiment, for that matter) isn't as high as working on your own research, attempting to do some novel work for a shot at a Nobel prize.

as every coder who's been the reviewer on a code review knows, it's difficult, and one that often is not well prioritized among other priorities a senior IC might have, leading to poor quality reviewing or other work slipping (only so many hours in a week). Thus, one could imagine a system that gives direct compensation for review work to grad and smart undergrad students who work in the area being discussed vetting and/or contextualizing claims like "this work is/is not novel", rather than hoping someone who does work in that area is procrastinating on HN at just the right time to make the claim and the rebuttal and the rebuttal-rebuttal. If the rubuttal is posted after the thread falls off the front page, is anyone not in that thread even going to know that what they read was wrong?

swores · on May 5, 2023

It's possible to approve of the "hype-breaking" (aka TLDRing / ELI5ing so that HN comment readers can understand the degree to which it's interesting for those of us not close enough to the field to understand that for ourselves) without agreeing that that same comment should also complain that preprints shouldn't be submitted to / upvoted on HN.

That's how I feel, anyway. I'd rather have seen a comment that has the same explanations in it but just generally less grumpy! Saying stuff like "It's not really that novel." doesn't really contribute much, when it could either be explained why it isn't novel by explaining how similar it is to something earlier that can be referenced, or thinking about what if anything is novel in this research - assuming it isn't being accused of just replicating something already done.

whimsicalism · on May 5, 2023

It doesn't have to be someone's fault to not be a good suited submission.

ShamelessC · on May 5, 2023

> This idea is quite similar to retrieval transformers and Hopfield networks which have been known and published for several years now. It's not really that novel.

Is it? I had thought retrieval transformers "merely" used retrieval as a backend of sorts rather than a substitute for the attention itself?

mxwsn · on May 5, 2023

Yeah, RETRO [0] embeds all an entire question/prompt, and searches for similar text passages with k-NN, then does further processing. This can kind of be understood as attention on paragraphs. This preprint instead does k-NN and calls it attention on single tokens. So not the same. But similar.

[0] https://jalammar.github.io/illustrated-retrieval-transformer...

ShamelessC · on May 5, 2023

Ah, I see - thanks for the clarification.

make3 · on May 5, 2023

retro doesn't attend itself, which is a big difference

godelski · on May 5, 2023

Honestly, these complaints (other than 4) apply to the vast majority of papers. #4 is just false. It has already been viewed by other lab members (peers) and open publication is peer reviewing. The "peer review system" (publishing to conferences/journals) is relatively new and I think ML demonstrates all the problems with the system (yay hype).

Novelty is especially a joke. ViTs are "just" NLP encoding transformers. T2I models are "just" NLP models connected to generative models. Diffusion models are "just" whitening models. GPT3 is just GPT2 with more layers and more data which is just GPT with more layers and more data. We can go even deeper if we pull from math and physics works. But that doesn't mean these works haven't been highly fruitful and useful. I'm happy all of these have been published.

> because of the hype around LLMs

I too hate the hype, but it is often bimodal. There are people who are far too critical and people who are far too accepting. The harm is not preprints or people reading papers, the harm is people who have no business/qualifications evaluating works confidently spouting out critiques. It is people not understanding that researchers are just critical of one another's work by default and that doesn't mean it shouldn't have been published.

It is well known that reviewers are good at identifying bad papers but not good at identifying good papers[0,1]. Which let's be honest, that means reviewers just have high reject rates in a noisy system. Making publication as a metric for merit a highly noisy one at best.

As for the paper:

Many LLMs and large models are using attention approximations. Nor is the kNN technique particularly new. My main complaints are a lack of comparisons for Figure 3 and 4, but I'm not a NLP person so I don't even know if there's some other good works that can compare better (BART is a common baseline). But generative models are (unfortunately not notoriously known) extremely difficult to evaluate. Paper seems fine to me. It is useful to the community. I don't like the name either, but their input is limited by computer memory, not the model. I would want to see more on this. Not a NLP person all I can say is that this looks neither like a strong reject nor a strong accept. I'll leave it to the community to determine if they want more experiments for the conference publication or not but the work seems useful.

[0] https://inverseprobability.com/talks/notes/the-neurips-exper...

[1] https://arxiv.org/abs/2109.09774

GistNoesis · on May 5, 2023

I've read the paper quickly, the main idea is simple and interesting, but maybe a little dubious (it's kind of an accuracy for memory trade-off).

In the transformer architecture one has to compute QKT.

QKT=(hd * Wq * WkT)heT (equation (2) page 3 in the paper).

Where hd is the hidden state of the decoder, and he is the hidden state of the encoder, and Wq and Wd are some parameters matrices, and T denotes the transposition operation.

By grouping the calculation this way, in a transformer encoder-decoder architecture, they can build and use only a single index (you index the he vectors using a vector database) for all the decoder layers queries. Instead of having to build 2 * L * H indices (with L the number of layers of the decoder and H the number of head in the decoder).

But what makes it a little dubious, is that this transformation mean you make your near neighbor queries in a space of dimension "dimension of the hidden state", instead of "dimension of a head" that is H times smaller.

So if you had to build 2 * L * H indices each index would be H times smaller.

So you only gain a factor 2 * L. But the trade-off is that you are doing a near neighbor search in higher dimension where you are then subjected to the curse of dimensionality (the higher the dimension the more similar all points are to each other). Whereas the whole point of projections in transformer is to lower the dimension so that the knn search make more sense. So to get the same accuracy, your near-neighbor search engine will have to work a lot harder.

Also as an approximation of the transformer, because it's using some knn search, it comes with the problems associated with it (for example harder to train because more sparse, and a tendency to hyperfocus), but it can be complemented with low-rank linearization of the attention to also have the neural net act on the gist rather than the closest neighbors.

jerrygenser · on May 5, 2023

This is a nitpick but also it's been a few years since I was taking academic ML and linear algebra courses. However regarding this part.

> So you only gain a factor 2 * L. But the trade-off is that you are doing a near neighbor search in higher dimension where you are then subjected to the curse of dimensionality (the higher the dimension the more similar all points are to each other).

I thought that the curse of dimensionality meant that in higher dimension, points got farther apart

dwaltrip · on May 6, 2023

https://en.m.wikipedia.org/wiki/Curse_of_dimensionality

Sounds like you are both right?

numeri · on May 5, 2023

This technique can be added on to any encoder–decoder Transformer model post-training, so the added training difficulties you mention don't apply. It honestly is a very interesting approach to me – the main issue I see (which they discuss in the paper) is in pure latency. If you're using a large enough vector database, it will be on the CPU, and transferring hidden states from GPU to CPU and then the embeddings back from CPU to GPU is going to eat up a ton of time.

space_fountain · on May 5, 2023

As I understand it the approach here is to use an approximate nearest neighbor database to retrieve highly relevant tokens from across large documents using the existing attention heads. So each attention head retrieves context from entire document. They say this can work without fine tuning, but performance improves with it. This is apparently extending this piece of prior work, but they've managed to re-range the linear algebra of attention so they only need one database for all attention heads across all layers of the model. I'm a bit confused how attention would here for layers below the top and a bit confused about how position is encoded for tokens across a long document like this.

im3w1l · on May 5, 2023

I don't understand how this could work. Like if you select a small fixed number of tokens from a large document won't you necessarily lose a lot of important data?

MagicMoonlight · on May 6, 2023

When you read a book do you remember every word? Even every chapter?

You only need the important concepts, not individual words

Xelynega · on May 6, 2023

I feel that's a bit different. When skimming or reading a book, even though we don't remember all the information typically a lot of the information is dependent on prior information, so we form a sparse knowledge graph from the entire text.

If I'm understanding the article, this approach would not use the skipped words to influence the output, so I thinks ita necessarily different.

sva_ · on May 5, 2023

I think infiniformer would've sounded better. The bench scores seem pretty marginal.

mirekrusin · on May 5, 2023

Pretty marginal score gains once a week is all you need.

sdenton4 · on May 5, 2023

Only so long as a) the gains are real, and not overfitting the test dataset, and b) you don't balloon in complexity, so that stacking approaches becomes impossible to manage.

Point (a) is extremely hard to discern, especially when people are chasing third-significant-digit gains on common benchmarks; it's essentially multiple-testing false discovery in action. I've seen whole families of methods fail to transfer to new domains...

Point (b) is also a real issue. As you increase the number of bells and whistles, each with their own hyperparameters with non-linear impacts on model quality, it becomes impossible to say what's working or not.

In practice, i think we see some cycles of baroque incremental improvements, followed by someone spending a year stripping away the bullshit and getting something simple that outperforms the pack, essentially because it's easier to do hyperparam search over simpler models once you figure out the bits that actually matter.

smusamashah · on May 5, 2023

What does it mean for ChatGPT and likes? Can they employ this method to virtually get rid of context tokens limit?

Kranar · on May 5, 2023

Yes it looks like it can use this method. This method is a preprocessor and post-processor that can be used on an existing GPT model to augment it to handle unlimited tokens.

vintermann · on May 6, 2023

And that makes it pretty notable compared to all the linear attention/retrieval schemes that didn't pan out. Not saying this will pan out, but we'll know more without waiting six months for the model to train.

chrgy · on May 5, 2023

In the age of transformers , lets ask a transformer to summarize this paper:

The Unlimiformer paper is about a new way to make computer programs that can summarize really long pieces of text. Normally, when you ask a computer program to summarize something, it can only handle a certain amount of text at once. But with Unlimiformer, the program can handle as much text as you want!

The way Unlimiformer works is by using a special technique called a "k-nearest-neighbor index" to help the program pay attention to the most important parts of the text. This makes it possible for the program to summarize even really long documents without losing important information.

Overall, Unlimiformer is an exciting new development in natural language processing that could make it easier for computers to understand and summarize large amounts of text.

moffkalast · on May 5, 2023

Said transformer as it handled the article's length anyway: sensible chuckle

TeMPOraL · on May 5, 2023

Is this how Kagi's "universal summarizer" works? They wrote a lot of copy about how it's able to summarize websites and documents of arbitrary length, while not revealing how on Earth this actually works. It does seem to work, though.

KaoruAoiShiho · on May 5, 2023

Could that not just be some kind of langchain like system?

logophobia · on May 6, 2023

An alternative which I've used with some succes are structured state space models: https://srush.github.io/annotated-s4/. A very different approach that works well for quite a few types of problems.

opportune · on May 5, 2023

This seems like a definite attention optimization but I think the fundamental problem with attention is that it doesn’t handle state in a way that scales well.

Personally I think the RNN/LSTM state handling approach is going to be something we revisit when trying to advance past transformers. It handles state in a way that generalizes and scales better (it should in theory learn an attention-like mechanism anyway, and state is independent of input size).

It may be harder to train, and require further improvements, but it really seems more like an engineering or cost problem than a theoretical one. But I’m only an amateur and not an expert. Maybe continued improvement on attention will approach generalized state handling in a way that efficiently trains better than improvements on more generalized stateful approaches improve training.

nephanth · on May 5, 2023

Btw, why do transformers have a limit input size in the first place? I'm pretty sure the self-attention mechanisms scale (although with bad complexity) to arbitrary sizes

MacsHeadroom · on May 5, 2023

>(although with bad complexity)

Because of exactly that.

Also the attention mechanism is baked in during pretraining. So whatever max context length you want increases the compute cost of training by at least a function of said "bad complexity." Even just 4096 tokens of max context is much more expensive to train than 2048. So if we want models with 8k, 32k, or more context then the training costs get out of hand quickly.

rewq4321 · on May 12, 2023

> Also the attention mechanism is baked in during pretraining

IIUC, this is no longer necessarily true with positional encodings like ALiBi: https://github.com/ofirpress/attention_with_linear_biases

ztratar · on May 5, 2023

Given the model performance is thus affected by a k-nearest neighbor, but those algorithms are proving not great for baseline vector search, how well will this actually work?

It seems mostly like a vertically integrated vector DB + existing LLM call, but correct me if I'm wrong. There are of course some performance gains with that, but the holy grail of "understanding" at unlimited length still seems unsolved.

mrbungie · on May 5, 2023

Isn't the performance (as in the capacity of retrieval, not performance as compute/memory usage) of kNN mostly given by the quality of the vectors/embeddings themselves?

Most vector DBs use (at least) some kind of KNN anyways.

ftxbro · on May 5, 2023

Other times this was put on hacker news:

https://news.ycombinator.com/item?id=35823039

https://news.ycombinator.com/item?id=35803470

swores · on May 5, 2023

While I appreciate your intent and effort - I don't think it's actually useful to link to other submissions unless either they have comments (ideally only if there's at least one interesting comment, but at least more than no comments at all), or if it's a submission of the same subject but to a different source link - in which case it's probably more useful to just link the alternative source, if it's worth reading, rather than potentially split the discussion into separate comment threads if the other is empty.

Linking to a different submission of the same link with 0 comments doesn't add anything.

ftxbro · on May 5, 2023

I must have submitted it at the wrong time of day.

swores · on May 5, 2023

Sure or just random luck, maybe this submission just happened to take place when the only few people who care about this subject happened to come online, or vice versa for bad luck before etc.

But unlike sites like Reddit, with the exception of self / ask HN / etc posts, nobody really pays attention to who the submitter is, so enjoy the conversation finally breaking out on it as consolation for not getting karma points, but skip linking to dead submissions :)

FYI, if you ever submit something that fails to get any traction / upvotes, then I've seen mods say before (@dang will hopefully correct me if I'm wrong) that a) it's OK to try submitting a second time maybe after a day or so (but not keep submitting over and over) or b) send the mods an email with a brief reason why it's a link that should interest HN readers for it to be potentially added to a "second chance pool". Though in the case of this link, between three of you it was posted two days ago, one day ago, and today which has finally got a bit more notice, so worked out alright in the end :)

XorNot · on May 5, 2023

Hang on, how unlimited is unlimited here? Surely the immediate thing you'd do with this is just never delete any prior inputs so it becomes defacto long term memory for the model?

shishy · on May 5, 2023

Last paragraph touches on that:

The length of inputs is theoretically bounded by the memory limitations of the computer used. More practically, using a CPU datastore is many times slower than a GPU datastore because of slower search and the need to transfer retrieved embed- dings to the GPU... (continues)

0xDEF · on May 5, 2023

The limit is RAM but GPU RAM is much faster than computer RAM.

davrosthedalek · on May 5, 2023

Is that really the limit? There is no real restriction that everything is in memory at the same time, right? You could maybe stream from SSD?

capableweb · on May 5, 2023

Create a swapfile and you essentially trade disk space for memory space.

intalentive · on May 6, 2023

The ML community keeps rediscovering the work of Steve Grossberg. This is very similar to his decades-old ART model.

vardhanw · on May 6, 2023

Could you explain in simple terms (if possible) what is the similarity? For context, I worked for a brief time with ART before Y2K in BU CNS, and took a few courses there, but had to leave it for 'reasons'.

intalentive · on May 6, 2023

Some ART concepts

1) Normalize input (batch norm, 2015)

2) Competitive dynamics / lateral inhibition (softmax in attention layers, 2017)

3) Cluster best matching activation vectors (top-k keys, 2023)

szundi · on May 5, 2023

Input should be the Internet then.

quickthrower2 · on May 5, 2023

Pricing: $0.1 per nano token.

jfisher4024 · on May 6, 2023

Neubig is the real deal. I’d take this paper seriously.

j2kun · on May 5, 2023

Appendix C of the PDF has an example book summary of The Brothers Karamazov, with noted minor factual errors

https://arxiv.org/pdf/2305.01625.pdf

> Unlimiformer summary:

> The first part of the novel focuses on the question of whether or not the Russian nobleman, Dmitri Fyodorovitch, has killed his father. In the town of Ivanovna, the lieutenant-colonel of the Mushenkhanovitch is accused of the murder of his brother Ivanovitch. The lieutenant-incommand, Vasilyevitch, takes the form of a dog, and the two men–the two men and the woman who are questioned by the court-martial–murphy. The two men cry out to the God of Russia for help in their quest to save the town. The man, afraid of the wrath of the God, hands the dog a bunch of letters that are supposed to be proof of his love for his brother. The old man–the one who had killed his mother, and then found the letter–arrives. He reads it–asked the old man to forgive him for the murder and then takes the dog away. The other men, all of whom are prisoners, demand that the man confess his crime to the court. The first and most important thing they tell the court is that they love the man. The court acquits the man and sentences the man to death. The second man–an old officer of the town, Alekandrovitch–askes to tell them the same thing. The third man–in the process of confessing his crime–is Vashenka, a drunk man who has been sent to the town to kill his father, for reasons which are not entirely clear to the people. The woman’s servant, Evgenyevna, is also the one who has told the court the story of the Medvedevitch’s murder, for the good old man’s and the young man’s love. The three men, who are separated for the first time, are laughing at the man’s attempt to seduce Mitya. The young man, in the meantime, is conscripted into the town-side. He tells the court that he loves her, but he has yet to tell her the true story. The men, in this room, demand a man to kill her, and she will not betray them. The women, in their own country, are rebelling against the man who had sent them three thousand roubles, and they will not allow the man of the people to see them. They will not let the man in the town be allowed to see the man–or Dmitriovitch; he will have her husband killed him. He will not tell the people who love him. The next man, named Vashenovitch, arrives, and takes the man away. They all begin to laugh at the fact that he has succeeded in seducing and entrusting his brother Dmitri. He is then taken away to the old woman’s house, where the governor-side-of-the-world, and his sister, Arkadin, is being punished. The priestesses and the baron are shocked, for they have been so virtuous and well-suited. The only thing they will be able to do is kill the priest. They threaten to burn the priestess to death, for she has been so wicked and libidinous that she has not yet seen the priest, for her husband. The priests–ostensibly convinced that she is a woman who loves the priest and has been punished for her love and for allowing the priest to marry her. The last man, Yakivitch, arrives at the house, and, after a long day of drinking and then some of the men–is killed. He and the priest are ordered to leave the town so that the priest can finally be reunited with the people of the old lady. The final man, the commander of the St. Petersburg town of Arkadina, is sentenced to death for the crime of having killed and then the lieutenant of the governor, for taking the money. The commander, the former lieutenant-delegation of the People’s Army, is summarily executed, and all the men, except for the commander, have been summarily punished for their crime. The entire town is shocked and, in a very dramatic way, the priestesses plead for the forgiveness of the man, for allowing them to kill and imprison Ivan. They plead for their brother to be restored as well, for all the people they have loved, and for the priestor to tell the story

edflsafoiewq · on May 5, 2023

That summary hardly inspires confidence, it's awful.

KaoruAoiShiho · on May 5, 2023

That's because their model sucks (very old, not SOTA) not because the idea in this paper doesn't work.

timy2shoes · on May 5, 2023

Just like the book, that summary was too long; didn’t read.

MacsHeadroom · on May 5, 2023

Sounds like your context window is too short.

verdverm · on May 5, 2023

because internet?

adamnemecek · on May 5, 2023

The attention mechanism corresponds to the Hopf algebraic convolution, a generalization of the commonly known convolution.

I'm in the process of implementing a framework based on this idea.

I have written a paper on this recently, https://arxiv.org/abs/2302.01834

I have a discord channel https://discord.cofunctional.ai.

capableweb · on May 5, 2023

You never just work on something until it's being ready to be shared, and then share it once? It has to be shared before it's even a little bit usable, with just some vague words about what it might be?

adamnemecek · on May 5, 2023

I'm gauging interest and looking for potential users. Steve Blank and all that.

verdverm · on May 5, 2023

The first step to crossing the chasm is finding those innovators and learning if you are solving a problem!

adamnemecek · on May 5, 2023

I have and I am. Next.