We're nearing a point where we'll just need a prompt router in front of several specialised models (code, chat, math, sql, health, etc)... and we'll have a local Mixture of Experts kind of thing.
1. Send request to router running a generic model.
2. Prompt/question is deconstructed, classified, and proxied to expert(s) xyz.
3. Responses come back and are assembled by generic model.
Is any project working on something similar to this?
I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"
Some of the tools it already has are:
Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)
Text question answering: given a long text and a question, answer the question in the text (Flan-T5)
Unconditional image captioning: Caption the image! (BLIP)
Image question answering: given an image, answer a question on this image (VILT)
Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)
Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)
Text to speech: convert text to speech (SpeechT5)
Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)
Text summarization: summarize a long text in one or a few sentences (BART)
Translation: translate the text into a given language (NLLB)
Text downloader: to download a text from a web URL
Text to image: generate an image according to a prompt, leveraging stable diffusion
Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
Text to video: generate a small video according to a prompt, leveraging damo-vilab
It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.
I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)
There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:
First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.
the first layer could be a mix of nlp and zero-shot classification to clarify the nature of the request.
Then using LLM deconstruct the request into several specific parts that would be sent to specialized LLMs.
Then stitch it back together at the end again with LLM as the summarization machine.
Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.
> Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)
Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.
It was rumored a few months ago that this is how GPT-4 works. A controller model routing data to expert models. Perhaps also by running all the experts and comparing probabilities. So far as I know thats just speculation based on a few details leaked on Xitter though.