We're nearing a point where we'll just need a prompt router in front of several ...

b_mc2 · on Dec 1, 2023

I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"

Some of the tools it already has are:

Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)

Text question answering: given a long text and a question, answer the question in the text (Flan-T5)

Unconditional image captioning: Caption the image! (BLIP)

Image question answering: given an image, answer a question on this image (VILT)

Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)

Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)

Text to speech: convert text to speech (SpeechT5)

Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)

Text summarization: summarize a long text in one or a few sentences (BART)

Translation: translate the text into a given language (NLLB)

Text downloader: to download a text from a web URL

Text to image: generate an image according to a prompt, leveraging stable diffusion

Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion

Text to video: generate a small video according to a prompt, leveraging damo-vilab

It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.

https://huggingface.co/docs/transformers/transformers_agents

reexpressionist · on Dec 2, 2023

I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)

There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:

First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.

ekianjo · on Dec 1, 2023

it's kind of trivial today.

the first layer could be a mix of nlp and zero-shot classification to clarify the nature of the request. Then using LLM deconstruct the request into several specific parts that would be sent to specialized LLMs. Then stitch it back together at the end again with LLM as the summarization machine.

Problem is running so many LLMs in parallel means you need quite a bunch of resources.

fintechie · on Dec 1, 2023

Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.

> Problem is running so many LLMs in parallel means you need quite a bunch of resources.

Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)

generalizations · on Dec 1, 2023

Could lora fine tunes be used instead of completely different models? I wonder if that would save space.

amilios · on Dec 1, 2023

Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.

ilaksh · on Dec 1, 2023

https://github.com/predibase/lorax does this, it's not that slow, since LoRAs aren't usually very big.

Kubuxu · on Dec 1, 2023

With a fast NVME loading a model is only 2-3s.

ij23 · on Dec 1, 2023

I'm the LiteLLM maintainer, can you elaborate what you're looking for us to do here?

adam_smith123 · on Dec 1, 2023

Idk a paper literally just came out showing that improved prompting of bigger general models was generally superior to specialized models.

https://arxiv.org/pdf/2311.16452.pdf

StrauXX · on Dec 1, 2023

It was rumored a few months ago that this is how GPT-4 works. A controller model routing data to expert models. Perhaps also by running all the experts and comparing probabilities. So far as I know thats just speculation based on a few details leaked on Xitter though.

trash_cat · on Dec 1, 2023

It does not explain why it's so expensive to run.

yeldarb · on Dec 1, 2023

Yeah, check out LLaVA-Plus (they call the experts in your vocabulary "tools") https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

hskalin · on Dec 1, 2023

Yeah I thought that is how GPT4 works (remember reading it somewhere). Some 10-11 expert models in an ensemble

smusamashah · on Dec 1, 2023

Is that what an MoE is? I thought LLMs in an MoE talk to each other and come up with the response together.

susi22 · on Dec 1, 2023

Semantic Kernel is something like that

Havoc · on Dec 1, 2023

For self hosted it seems likely that swapping out the fine tuning lora on the fly is a better option

theturtle32 · on Dec 1, 2023

This is how image generation works for DALL-E 3 via ChatGPT.

yawnxyz · on Dec 1, 2023

What's the best model for health right now?