Both are methods to reduce the overall size of your embeddings, but from what I understand, quantization is generally better than dimensionality reduction, especially if training is quantization-aware.
"AGI" will never be achieved without building a model that a) _continually_ learns, and b) learns from not just text, but from combined auditory and visual (multimodal) sensory information as well.
The reason a 16-year-old can learn how to drive much quicker than existing self-driving models is because the 16-year-old already has built up 16 years worth of prior knowledge about the physical world.
Don't discount the millions of years of evolution to provide the "blank slate" human learner with perceptual systems, physics-based reasoning, and motor systems ready to be fine-tuned for this slightly different variant of goal-forming, planning, and locomotion.
d) thinks. unprompted, unstimulated. Decides for itself what's important to think about, makes new connections by that process alone, and understands the implications of those new connections and how to use them.
Here's also where I see it ending. It will need energy--likely a LOT, paid by someone, to do this. Who is going to pay that bill for it to maybe, maybe not, come up with something useful, likely mixed with mostly noise and distraction, over undefined timescales, of largely non-measurable value, when there's far greater value, less cost, less risk, in simply training it deterministically?
That would be like a bird saying humans aren't a Natural General Intelligence because they can't fly. How much vision and audio is required to be intelligent? There's a lot of electromagnetic radiation we can't see and audio bands we can't hear. Would you say that Helen Keller wasn't generally intelligent?
I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.
It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.
IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.
I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).
You need different indexing algorithms for different use cases - brute-force indexing, for example, is "SOTA" when it comes to recall (100%). If you have multiple use cases or if you might have domain shift, you'll want a vector database that supports multiple indexes.
Here's my 2¢:
- If you're just playing around with vector search locally and have a very small dataset, use brute-force search. Don't worry about indexes until later.
- If you have plenty of RAM and CPU cores and would like to squeeze out the most performance, use ScaNN or HNSW plus some form of quantization (product quantization or scalar quantization).
- If you have limited RAM, use IVF plus PQ or SQ.
- If you want to maintain reasonable latency but aren't very concerned about throughput, use a disk-based index such as DiskANN or Starling. https://arxiv.org/pdf/2401.02116.pdf
- If you have a GPU, use GPU-specific indexes. CAGRA (supported in Milvus!) seems to be one of the best. https://arxiv.org/abs/2308.15136
All of these indexes are supported in Milvus (https://milvus.io/docs/index.md), so you can pick and choose the right one for your application. Tree-based indexes such as Annoy don't seem to have a sweet spot just yet, but I think there's room for improvement in this subvertical.
Regarding "future-proof-ness": we've been building production-grade vector search since 2018 and have a number of organizations running it at billion+ scale in production environments. It's all open source too.
I was at Yahoo almost a decade ago, when vector search within Vespa was first being rolled out in production use cases. It was already serving similarity search requests for Flickr back then.
Even though I'm with Zilliz/Milvus now, I wholeheartedly support and recommend folks check out and try Vespa. Congrats to the Vespa team!
EDIT: For folks on Twitter, you should follow Jo (https://twitter.com/jobergum) from Vespa if you aren't already. Great combo of technical content, hot takes, and vector database memes!
Huge congratulations to JKB, Frode, Kim, and the rest of the Vespa team! We are infinitely grateful for all of your help and advice. We are lucky enough to have Vespa as the foundation for our developer-focused enterprise search product.
Having worked with both Solr and Elastic in past search companies, it’s incredible to be able to deploy into Fortune 50 enterprises without any doubts about stability and with all the benefits of a cutting-edge hybrid engine.
Can’t wait to watch you on the next leg of your journey!
Off topic... looking forward to more engineers moving to Mastodon. I have Twitter/X blocked at DNS level and still fairly frequently encounter interesting accounts that I can't check out.
Nobody's saying it's anybody else's problem - they're just looking forward to more content being on Mastodon (or elsewhwere) as people move away from X.
It's not that weird, you are reading too much into the subtext. The web is in a transitory state, platforms change, people move. Wishing for more content to be available on a specific platform without blaming the author is an acceptable comment in my view.
Is there a particular mastodon server or set of servers that the engineering community is favoring? I know it technically doesn’t matter in a federated network, but curious anyway.
I'm from Pinecone. We use proprietary indexes. We could've used HNSW but decided the high memory consumption (ie, costly at scale) and slow index updates (ie, data gets stale) won't cut it for production use cases.
Oh I read a lot of HNSW stuff on your/Pinecone blog series. (Great learning resource btw, well done!) So I assumed you were using HNSW already. It's a news to me that you don't use it.
Gonna add some information here since this isn't very descriptive.
milvus-lite is a bit like sqlite where it runs in-process. Here are some scenarios you'd want to use it in:
- You want to use Milvus directly without having it installed using Milvus - Operator, Helm, or Docker Compose etc.
- You do not want to launch any virtual machines or containers while you are using Milvus.
- You want to embed Milvus features in your Python applications.