Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I found that grep actually outperformed vector search for many queries. The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.

I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.

More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)



> The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

I think that's the only things GUI (or TUI) directories have over CLI. I remember having Wikipedia locally (english texts, back in 2010) and the portals were surprisingly useful. They act like the semantic space in case you can't find an article for your exact word. So Literature > Fiction > Fantasy > Epic Fantasy will probably land you somewhere close to "The Lord of The Rings".


Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory


Have you tried the more modern solutions like gripgrep, ack, etc.?

Or for something more comprehensive (to also search PDF, docx, etc.) there is ripgrep-all:

https://github.com/phiresky/ripgrep-all


As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).

I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).


Tracker is an open source project for that. It has been around for some 10+ years now. https://tracker.gnome.org/overview/


Try ripgrep.


The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.


A combination of both could help!


Most developers are going to outperform vector search. We “get” how computers do lookups so we build our queries appropriately.

Vector search is amazing for using layman concepts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: