Maybe others' experiences are different, but I find smaller models to work just as well for "reductive" tasks.
Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.
I suspect OpenAI is routing prompts to select models based on similar logic.
There are various posts online on how to set it up, either for Linux or Windows. There was an older post here on how to install opt-65b on a mac studio ultra, and smaller models on mac pros. There was also a post if I remember correctly about running vicuna-7b on an iPhone.
Side note. You need bonkers hardware to run it efficiently. I'm currently using a 16-core cpu, 128G RAM, a Pcie 4.0 nvme and an RTX 3090. There are ways to run it on less powerful hardware, like 8cores, 64GB RAM, simple ssd and an RTX 3080 or 70, but I happen to have a large corpus of data to process so I went all in.
I think the previous comment is more interested in your experience with your large data: what are you doing with it?
I have similar hardware at home, so I wonder how reliably you can process simple queries using domain knowledge + logic which work on on mlc-llm, something like "if you can chose the word food, or the word laptop, or the word deodorant, which one do you chose for describing "macbook air"? answer precisely with just the word you chose"
If it works, can you upload the weights somewhere? IIRC, vicuna is open source.
If these problems are all very similar in structure, then you may not need an LLM. Simple GloVe or W2V may suffice with a dot product. The. You can plow through a few terabytes by the time the LLM goes through a fraction of that.
It's also good for math lessons.