> Only 8k context window Is this supposed to be low? All the chat models I've us...

Sai_ · on Dec 1, 2023

GPT-4-turbo is at 128k. Claude 2.1 is 200k. But yes, among open source models 8k is roughly middle to top of the pack.

smeagull · on Dec 1, 2023

The problem with those numbers is they hit the internal limit before you use all those tokens. There's a limit to how many rules or factors their conditional probability model can keep track of. Once you hit that having a bigger context window doesn't matter.

LoganDark · on Dec 1, 2023

That's insane. The highest I've personally seen in the open-source space is RWKV being trained on (IIRC) 4k but being able to handle much longer context lengths in practice due to being an RNN (you can simply keep feeding it tokens forever and ever). It doesn't generalize infinitely by any means but it can be stretched for sure, sometimes up to 16k.

It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it. But it's quite interesting nonetheless.

zozbot234 · on Dec 1, 2023

> It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it.

That's a well known limitation. But if you actually know that a "context" comprises multiple sentences (or other elements of syntax) and that any ordering among them is completely arbitrary, the principled approach is to RNN-parse them all in parallel and sum the activations you end up with as vectors - like in bag-of-words model, essentially enforcing commutativity on the network: that's pretty much how attention-based models work under the hood. The really basic intuition is just that a commutative and associative function can be expressed (hence "learned") in terms of vector sum modulo some arbitrary conversion of the inputs and outputs.

LoganDark · on Dec 2, 2023

> That's a well known limitation.

I know. I did a lot of work on state handling in rwkv.cpp

viraptor · on Dec 1, 2023

The numbers are high, but whether 8k is low depends on your use case. Do you want to process whole book chapters, or feed lots of related documents at the same time? If not, and you're just doing a normal question/answer session with some priming prompt, 8k is already a lot.

kolinko · on Dec 1, 2023

8k is very little if you want to add almost any additional data in context, or have a more complicated prompt.

Otherwise your knowledge retrieval needs to be almost spot on for llm to provide a proper reply.

Ditto with any multi shot prompts.

jimmyl02 · on Dec 1, 2023

to be fair, I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak. RAG based methods are probably a better option for these ultra long contexts

lhl · on Dec 1, 2023

Haystack testing on GPT-4's 128K context suggests otherwise: https://twitter.com/SteveMoraco/status/1727370446788530236

jeswin · on Dec 1, 2023

> I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak.

For 32k GPT4 contexts, that's not accurate. GPT4 Turbo is a bit weaker than GPT4-32k, but not to the extent that you claim.

kolinko · on Dec 1, 2023

Are you talking about claude or Gpt4 as well? Anybspecific examples where ChatGPT4 fails for long contexts?

lhl · on Dec 1, 2023

Most 4K models can use context window extension to get to 8K reasonably, but you're starting to see 16K, 32K, 128K (see YaRN for example) tunes become more common, or even a 200K version of Yi-34B.

LoganDark · on Dec 2, 2023

> see YaRN for example

YaRN is to blame for making llama.cpp misbehave if you accidentally zero-initialize the llama_context_params structure rather than calling llama_context_default_params :)

(guess how I know...)