Hacker News new | past | comments | ask | show | jobs | submit login

We're nearing a point where we'll just need a prompt router in front of several specialised models (code, chat, math, sql, health, etc)... and we'll have a local Mixture of Experts kind of thing.

  1. Send request to router running a generic model.
  2. Prompt/question is deconstructed, classified, and proxied to expert(s) xyz.
  3. Responses come back and are assembled by generic model.
Is any project working on something similar to this?



I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"

Some of the tools it already has are:

Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)

Text question answering: given a long text and a question, answer the question in the text (Flan-T5)

Unconditional image captioning: Caption the image! (BLIP)

Image question answering: given an image, answer a question on this image (VILT)

Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)

Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)

Text to speech: convert text to speech (SpeechT5)

Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)

Text summarization: summarize a long text in one or a few sentences (BART)

Translation: translate the text into a given language (NLLB)

Text downloader: to download a text from a web URL

Text to image: generate an image according to a prompt, leveraging stable diffusion

Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion

Text to video: generate a small video according to a prompt, leveraging damo-vilab

It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.

https://huggingface.co/docs/transformers/transformers_agents


I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)

There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:

First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.


it's kind of trivial today.

the first layer could be a mix of nlp and zero-shot classification to clarify the nature of the request. Then using LLM deconstruct the request into several specific parts that would be sent to specialized LLMs. Then stitch it back together at the end again with LLM as the summarization machine.

Problem is running so many LLMs in parallel means you need quite a bunch of resources.


Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.

> Problem is running so many LLMs in parallel means you need quite a bunch of resources.

Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)


Could lora fine tunes be used instead of completely different models? I wonder if that would save space.


Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.


https://github.com/predibase/lorax does this, it's not that slow, since LoRAs aren't usually very big.


With a fast NVME loading a model is only 2-3s.


I'm the LiteLLM maintainer, can you elaborate what you're looking for us to do here?


Idk a paper literally just came out showing that improved prompting of bigger general models was generally superior to specialized models.

https://arxiv.org/pdf/2311.16452.pdf


It was rumored a few months ago that this is how GPT-4 works. A controller model routing data to expert models. Perhaps also by running all the experts and comparing probabilities. So far as I know thats just speculation based on a few details leaked on Xitter though.


It does not explain why it's so expensive to run.


Yeah, check out LLaVA-Plus (they call the experts in your vocabulary "tools") https://github.com/LLaVA-VL/LLaVA-Plus-Codebase


Yeah I thought that is how GPT4 works (remember reading it somewhere). Some 10-11 expert models in an ensemble


Is that what an MoE is? I thought LLMs in an MoE talk to each other and come up with the response together.


Semantic Kernel is something like that


For self hosted it seems likely that swapping out the fine tuning lora on the fly is a better option


This is how image generation works for DALL-E 3 via ChatGPT.


What's the best model for health right now?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: