Interesting architecture. For these "large" models, I'm interested in synthesis, fluidity, conceptual flexibility.
A sample prompt: "Tell me a love story about two otters, rendered in the FORTH language".
Or: "Here's a whitepaper, write me a simulator in python that lets me see the state of these variables, step by step".
Or: "Here's a tarball of a program. Write a module that does X, in a unified diff."
These are super hard tasks for any LLM I have access to, BTW. Good for testing current edges of capacity.
Arctic does not do great on these, unfortunately. It's not willing to make 'the leap' to be creative in FORTH where creativity = storytelling, and tries to redirect me to either getting a story about otters, or telling me things about FORTH.
Google made a big deal about emergent sophistication in models as they grew in parameter size with the original PaLM paper, and I wonder if these horizontally-scaled MOE of many small models are somehow architecturally limited. The model weights here, 480B, are sized close to the original PaLM model (540B if I recall).
Anyway, more and varied architectures are always welcome! I'd be interested to hear from the Snowflake folks if they think the architecture has additional capacity with more training, or if they think it could improve on recall tasks, but not 'sophistication' type tasks.
What your evaluating is not what you think it is. You're evaluating the models ability to execute multiple complex steps (think about all of the steps it takes for your second example) not so much if it is capable of doing those things. If you broke it down into 2-3 different prompts it could do all of those things easy.
BTW, I wouldn't rate that very high in that it's trying to put out syntactic FORTH, but not defining verbs or other things which themselves tell the story.
A sample prompt: "Tell me a love story about two otters, rendered in the FORTH language".
Or: "Here's a whitepaper, write me a simulator in python that lets me see the state of these variables, step by step".
Or: "Here's a tarball of a program. Write a module that does X, in a unified diff."
These are super hard tasks for any LLM I have access to, BTW. Good for testing current edges of capacity.
Arctic does not do great on these, unfortunately. It's not willing to make 'the leap' to be creative in FORTH where creativity = storytelling, and tries to redirect me to either getting a story about otters, or telling me things about FORTH.
Google made a big deal about emergent sophistication in models as they grew in parameter size with the original PaLM paper, and I wonder if these horizontally-scaled MOE of many small models are somehow architecturally limited. The model weights here, 480B, are sized close to the original PaLM model (540B if I recall).
Anyway, more and varied architectures are always welcome! I'd be interested to hear from the Snowflake folks if they think the architecture has additional capacity with more training, or if they think it could improve on recall tasks, but not 'sophistication' type tasks.