Hacker News new | past | comments | ask | show | jobs | submit | more fzliu's comments login

Both are methods to reduce the overall size of your embeddings, but from what I understand, quantization is generally better than dimensionality reduction, especially if training is quantization-aware.


"AGI" will never be achieved without building a model that a) _continually_ learns, and b) learns from not just text, but from combined auditory and visual (multimodal) sensory information as well.

The reason a 16-year-old can learn how to drive much quicker than existing self-driving models is because the 16-year-old already has built up 16 years worth of prior knowledge about the physical world.


Don't discount the millions of years of evolution to provide the "blank slate" human learner with perceptual systems, physics-based reasoning, and motor systems ready to be fine-tuned for this slightly different variant of goal-forming, planning, and locomotion.


Though it does seem like robots are reaching this baseline level soon.


(imo) c) is made to be aware of its own death

The 16 year old has a lot of motivations to learn how to drive, including the pursuit of reproduction (a cope for mortality)


d) thinks. unprompted, unstimulated. Decides for itself what's important to think about, makes new connections by that process alone, and understands the implications of those new connections and how to use them.

Here's also where I see it ending. It will need energy--likely a LOT, paid by someone, to do this. Who is going to pay that bill for it to maybe, maybe not, come up with something useful, likely mixed with mostly noise and distraction, over undefined timescales, of largely non-measurable value, when there's far greater value, less cost, less risk, in simply training it deterministically?


Various religious types think they'll live on in heaven. I'm not sure that stuff correlates much with learning to drive.


That would be like a bird saying humans aren't a Natural General Intelligence because they can't fly. How much vision and audio is required to be intelligent? There's a lot of electromagnetic radiation we can't see and audio bands we can't hear. Would you say that Helen Keller wasn't generally intelligent?


Okay but if that's the case we are no more than a decade away from integrating those into a newer and bigger model.


I think you are underestimating just how many challenges there are in self-driving: https://www.youtube.com/watch?v=kcKchbfn1VY


Nobody will accept self-driving cars that are as dangerous as a teenage driver.


In my mind, what's more crucial here is code for downloading/scraping and labeling the data, not the model architecture nor training script.

As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.


I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.


To avoid controversy AND potential lawsuits.


Yup.

I think many countries (japan already has) will allow IP for training data.

They just need to buy time until then.


It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.


IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.


I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).


You need different indexing algorithms for different use cases - brute-force indexing, for example, is "SOTA" when it comes to recall (100%). If you have multiple use cases or if you might have domain shift, you'll want a vector database that supports multiple indexes.

Here's my 2¢:

- If you're just playing around with vector search locally and have a very small dataset, use brute-force search. Don't worry about indexes until later.

- If you have plenty of RAM and CPU cores and would like to squeeze out the most performance, use ScaNN or HNSW plus some form of quantization (product quantization or scalar quantization).

- If you have limited RAM, use IVF plus PQ or SQ.

- If you want to maintain reasonable latency but aren't very concerned about throughput, use a disk-based index such as DiskANN or Starling. https://arxiv.org/pdf/2401.02116.pdf

- If you have a GPU, use GPU-specific indexes. CAGRA (supported in Milvus!) seems to be one of the best. https://arxiv.org/abs/2308.15136

All of these indexes are supported in Milvus (https://milvus.io/docs/index.md), so you can pick and choose the right one for your application. Tree-based indexes such as Annoy don't seem to have a sweet spot just yet, but I think there's room for improvement in this subvertical.


Throwing a few more on here (mix of beginner and advanced):

- Wikipedia article: https://en.wikipedia.org/wiki/Vector_database

- Vector Database 101: https://zilliz.com/learn/introduction-to-unstructured-data

- ANN & Similarity search: https://vinija.ai/concepts/ann-similarity-search/

- Distributed database: https://15445.courses.cs.cmu.edu/fall2021/notes/21-distribut...



Like IVF, Annoy partitions the entire embedding space into high-dimensional polygons. The difference is how the two algorithms do it - IVF (https://zilliz.com/learn/vector-index) uses centroids, while Annoy (https://zilliz.com/learn/approximate-nearest-neighbor-oh-yea...) is basically just one big binary tree.


Most breakthroughs are discovered accidentally and retroactively, so I'd think that having multiple breakthrough papers is fairly uncommon.


Regarding "future-proof-ness": we've been building production-grade vector search since 2018 and have a number of organizations running it at billion+ scale in production environments. It's all open source too.

https://milvus.io


I was at Yahoo almost a decade ago, when vector search within Vespa was first being rolled out in production use cases. It was already serving similarity search requests for Flickr back then.

Even though I'm with Zilliz/Milvus now, I wholeheartedly support and recommend folks check out and try Vespa. Congrats to the Vespa team!

EDIT: For folks on Twitter, you should follow Jo (https://twitter.com/jobergum) from Vespa if you aren't already. Great combo of technical content, hot takes, and vector database memes!


Huge congratulations to JKB, Frode, Kim, and the rest of the Vespa team! We are infinitely grateful for all of your help and advice. We are lucky enough to have Vespa as the foundation for our developer-focused enterprise search product.

Having worked with both Solr and Elastic in past search companies, it’s incredible to be able to deploy into Fortune 50 enterprises without any doubts about stability and with all the benefits of a cutting-edge hybrid engine.

Can’t wait to watch you on the next leg of your journey!

The Atolio team


Thank you for the shout-out Frank!


Off topic... looking forward to more engineers moving to Mastodon. I have Twitter/X blocked at DNS level and still fairly frequently encounter interesting accounts that I can't check out.


Every time I see Twitter/X here, I want to say this.


How is it other people’s problem that you block a site that they’re on?

(Fwiw I have Twitter blocked too, though not for moral reasons)


Nobody's saying it's anybody else's problem - they're just looking forward to more content being on Mastodon (or elsewhwere) as people move away from X.


, in response to a thread sharing the Twitter profile of a key person mentioned in the article, who also comments here.

The subtext is clearly “jkb, please move to Mastodon cause I blocked Twitter”. Not spelling that out doesn’t make it a lot less weird IMO.


It's not that weird, you are reading too much into the subtext. The web is in a transitory state, platforms change, people move. Wishing for more content to be available on a specific platform without blaming the author is an acceptable comment in my view.


Is there a particular mastodon server or set of servers that the engineering community is favoring? I know it technically doesn’t matter in a federated network, but curious anyway.


I'm on fosstodon. floss.social is also big.


any resource for one to learn on vector search? any textbook or whitepaper recommendations?

I am learning lsh right now and find it fascinating


We have a "Vector Database 101" series that covers vector search and vector indexes as well: https://zilliz.com/learn/what-is-vector-database

I've been meaning to dive into SCaNN and DiskANN as well, but haven't gotten around to it yet.


> I've been meaning to dive into SCaNN and DiskANN as well, but haven't gotten around to it yet.

quick TLDR on both vs HNSW?


any ideas what do cloud providers like pinecone user underneath ?


I'm from Pinecone. We use proprietary indexes. We could've used HNSW but decided the high memory consumption (ie, costly at scale) and slow index updates (ie, data gets stale) won't cut it for production use cases.


Oh I read a lot of HNSW stuff on your/Pinecone blog series. (Great learning resource btw, well done!) So I assumed you were using HNSW already. It's a news to me that you don't use it.


Shameless self-plug for milvus-lite:

   $ pip install milvus
   $ python
   >>> import milvus
   >>> milvus.start()


Gonna add some information here since this isn't very descriptive.

milvus-lite is a bit like sqlite where it runs in-process. Here are some scenarios you'd want to use it in:

- You want to use Milvus directly without having it installed using Milvus - Operator, Helm, or Docker Compose etc. - You do not want to launch any virtual machines or containers while you are using Milvus. - You want to embed Milvus features in your Python applications.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: