More

raphaelty · on March 21, 2024

I did develop this when I was PhD student in NLP. I think it's fine to learn to develop a personal search system if you want to learn.

raphaelty · on March 21, 2024

The ask button is provided by chatgpt api here

raphaelty · on Feb 2, 2024

The library is clean and well documented

raphaelty · on Nov 18, 2023

You could recommend content based on user query, tag content produced by the user, use colbert as part of a ChatBot to show evidences to the user questions

raphaelty · on Nov 18, 2023

It's because of the loss of the model. I ask the model to produce a higher similarity between the query and the positive document rather than between the query and the negative document. I'll add more losses soon so there are more choices

alexmolas · on Nov 18, 2023

is the loss the usual lambdarank?

raphaelty · on Nov 18, 2023

Nice, it might already be compatible with BGE, I'll try it and add it to the documentation soon

raphaelty · on Nov 18, 2023

Yes exactly

vorticalbox · on Nov 18, 2023

Does that help much in terms of training?

nerdponx · on Nov 18, 2023

It's a well-established technique for learning a similarity function: https://en.m.wikipedia.org/wiki/Triplet_loss

rolisz · on Nov 18, 2023

Yes, this is called triplet loss and has made embeddings much better.

raphaelty · on Nov 18, 2023

In the documentation there is an evaluation module with detailed informations. The idea is to gather relevant pairs of queries and documents that are not part of the training set. Then the idea is to measure, using various metrics, how your model can retrieve accurate documents.

raphaelty · on Nov 18, 2023

Hi, there is a single loss right now, but I plan to add some Sentence Transformers losses. ColBERT is slow as a retriever, but is quite efficient as a Ranker on GPU (way faster than cross-encoder). I plan to release pre-trained checkpoints on HuggingFace with benchmarks using BEIRand inference speed info.

aashu_dwivedi · on Nov 18, 2023

Do you mean it's faster when the embeddings are pre-computed or is it faster when the embeddings are computed on the fly as well. Also, what's the recommended way to store the colbert embeddings as, because of the 2d nature of the embeddings it's not practical to store in a vector database.

raphaelty · on Nov 18, 2023

Yes, ColBERT is fast because you can pre-compute most embeddings. It's important to compute documents embeddings only once. neural-cherche do not compute embeddings on the fly and the retrieve method ask for queries and documents embeddings rather than queries and documents texts.

Documents and queries embeddings can be obtained using .encode_documents and .encode_queries methods

I save most of my embeddings (python dictionnary with documents id as key and embeddings as values) using joblib in a Bucket in the cloud. I don't really know if it's a good pratice but it does scale fine to few millions documents for offline (no real-time) applications.

raphaelty · on May 8, 2023

Cherche 2.0 is now available, and it's been optimized for batch-computing, along with other new features. Whether you're a practitioner, researcher, or hacker interested in semantic search, Cherche might be a good fit for your needs.