Hacker News new | past | comments | ask | show | jobs | submit login

> Only 8k context window

Is this supposed to be low? All the chat models I've used top out at 4096.




GPT-4-turbo is at 128k. Claude 2.1 is 200k. But yes, among open source models 8k is roughly middle to top of the pack.


The problem with those numbers is they hit the internal limit before you use all those tokens. There's a limit to how many rules or factors their conditional probability model can keep track of. Once you hit that having a bigger context window doesn't matter.


That's insane. The highest I've personally seen in the open-source space is RWKV being trained on (IIRC) 4k but being able to handle much longer context lengths in practice due to being an RNN (you can simply keep feeding it tokens forever and ever). It doesn't generalize infinitely by any means but it can be stretched for sure, sometimes up to 16k.

It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it. But it's quite interesting nonetheless.


> It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it.

That's a well known limitation. But if you actually know that a "context" comprises multiple sentences (or other elements of syntax) and that any ordering among them is completely arbitrary, the principled approach is to RNN-parse them all in parallel and sum the activations you end up with as vectors - like in bag-of-words model, essentially enforcing commutativity on the network: that's pretty much how attention-based models work under the hood. The really basic intuition is just that a commutative and associative function can be expressed (hence "learned") in terms of vector sum modulo some arbitrary conversion of the inputs and outputs.


> That's a well known limitation.

I know. I did a lot of work on state handling in rwkv.cpp


The numbers are high, but whether 8k is low depends on your use case. Do you want to process whole book chapters, or feed lots of related documents at the same time? If not, and you're just doing a normal question/answer session with some priming prompt, 8k is already a lot.


8k is very little if you want to add almost any additional data in context, or have a more complicated prompt.

Otherwise your knowledge retrieval needs to be almost spot on for llm to provide a proper reply.

Ditto with any multi shot prompts.


to be fair, I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak. RAG based methods are probably a better option for these ultra long contexts


Haystack testing on GPT-4's 128K context suggests otherwise: https://twitter.com/SteveMoraco/status/1727370446788530236


> I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak.

For 32k GPT4 contexts, that's not accurate. GPT4 Turbo is a bit weaker than GPT4-32k, but not to the extent that you claim.


Are you talking about claude or Gpt4 as well? Anybspecific examples where ChatGPT4 fails for long contexts?


Most 4K models can use context window extension to get to 8K reasonably, but you're starting to see 16K, 32K, 128K (see YaRN for example) tunes become more common, or even a 200K version of Yi-34B.


> see YaRN for example

YaRN is to blame for making llama.cpp misbehave if you accidentally zero-initialize the llama_context_params structure rather than calling llama_context_default_params :)

(guess how I know...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: