The problem with those numbers is they hit the internal limit before you use all those tokens. There's a limit to how many rules or factors their conditional probability model can keep track of. Once you hit that having a bigger context window doesn't matter.
That's insane. The highest I've personally seen in the open-source space is RWKV being trained on (IIRC) 4k but being able to handle much longer context lengths in practice due to being an RNN (you can simply keep feeding it tokens forever and ever). It doesn't generalize infinitely by any means but it can be stretched for sure, sometimes up to 16k.
It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it. But it's quite interesting nonetheless.
> It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it.
That's a well known limitation. But if you actually know that a "context" comprises multiple sentences (or other elements of syntax) and that any ordering among them is completely arbitrary, the principled approach is to RNN-parse them all in parallel and sum the activations you end up with as vectors - like in bag-of-words model, essentially enforcing commutativity on the network: that's pretty much how attention-based models work under the hood. The really basic intuition is just that a commutative and associative function can be expressed (hence "learned") in terms of vector sum modulo some arbitrary conversion of the inputs and outputs.
The numbers are high, but whether 8k is low depends on your use case. Do you want to process whole book chapters, or feed lots of related documents at the same time? If not, and you're just doing a normal question/answer session with some priming prompt, 8k is already a lot.
to be fair, I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak. RAG based methods are probably a better option for these ultra long contexts
Most 4K models can use context window extension to get to 8K reasonably, but you're starting to see 16K, 32K, 128K (see YaRN for example) tunes become more common, or even a 200K version of Yi-34B.
YaRN is to blame for making llama.cpp misbehave if you accidentally zero-initialize the llama_context_params structure rather than calling llama_context_default_params :)
Is this supposed to be low? All the chat models I've used top out at 4096.