I build my personnal search engine which record things I like on twitter, blog posts etc.. It automatically calls those APIs using Github Action and store them in an open source database (json file)
I actualy use it at least twice a week to retrieve content I bookmarked, so I'm happy to have created such a tool.
I think 10 million documents is a large corpus. A retriever like Sklearn TfIdf will have a hard time handling it in a reasonable time. The main goal of Cherche is to prototype a neural search engine quickly and with a large choice of retrievers and rankers for corpus sizes < 1 million documents which is a common use case in the industry.
Search implements a wrapper of the Python ElasticSearch client that is scalable and dedicated to corpora composed of tens of millions of documents.
1) The dependency on the Elasticsearch python client allows Elasticsearch to be used as a retriever. The same goes for Lunr. It might be interesting to separate the different dependencies.
Knowledges graphs are structured resources in the form of graphs that contain knowledge. These resources are used in a large number of applications linked to the machine learning.
I just published a library dedicated to knowledges graphs embeddings. The Mkb API is inspired by Scikit Learn. I provide modular tools for building latent graph representations.
I actualy use it at least twice a week to retrieve content I bookmarked, so I'm happy to have created such a tool.
The app: https://raphaelsty.github.io/knowledge/?query=bayesian
The Github: https://github.com/raphaelsty/knowledge