Wow. This is nice. I can see putting it to use with csv and json with gron or something. I don’t have that much markdown to search, but having a tool be just there in the shell might change that.
I see they have enabled Snowball stemmers. I wonder if using other Lucene analyzers such as Voikko for Finnish is feasible. Snowball wasn’t particularly good when text get complex. I used to deal with Lucene and Solr way back when. Based on the OP I see Graal requires change.
I realize some relatively obscure Finnish stemmer and Lucene with GraalVM aren't exactly a common use case. I did some testing and provided my use case. I certainly have much English language content to search with using lucene-grep. So, thank you for making it!
One usecase where Lucene’s tokenizing approach tends to work less well than something like grep is when for some reason we want to query for a substring of a token, e.g. if the text is “I walked through the town” and I want to search for “oug”.
Does lmgrep offer a performant solution for this kind of case, or would it be a situation where it’s better sticking with regular grep?
I’ve been looking for something exactly like this. I keep the complete text of articles I’ve enjoyed but, to-date, effective searching has meant spinning up an ES instance, which is painful. This is a specific use case that isn’t necessarily well-served by something like grep or ripgrep. I’ll definitely try this, thanks - looks very elegant.
Is it automated in some way during web browsing, remembering to copy to a folder when you enjoyed it enough, or do you use a reading app/e-reader to read them so they're already downloaded
Sure, I just hacked something together as I wanted it to fit around my existing workflow. I’ve been using Instapaper since forever, and I wanted something built around that instead of “every URL I visit”, as most shit you read has a low signal:noise ratio.
I wrote some Python to drive Selenium to get the URLs (not the full text) from Instapaper, then pass those URLs to newspaper3k, where a lot of the downloading and parsing work is done. I then save the output to SQLite. From there I was previously having ES build indexes but recently just switched to hosted Algolia, which seems to be basically free for my use case and has some nice libraries for building real-time search front ends too. I’ll be trying lmgrep as a substitution though.
The key thing about searching the text of articles you’ve read is that you want an intelligent ranking of all articles that bear on a subject, in order of relevance. That’s not something you can get with grep/ripgrep. ES is pretty good at it out of the box. But it’s also a pain to set up and run - you’ll probably end up needing something like Docker.
There are a thousand different ways you could do something like this - this is just the way I do it.
Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.
I'm the author of `lmgrep`.
Happy to hear that you liked it.
I have a similar user use-case: searching for blog posts that are in markdown source files.
I think there's a company attempting to implement their own version of something similar. A very important part of search is also the understanding of language semantics. Something that is really cool for this is Kythe [0].
"Then the most complicated part was to prepare executable binaries for different operating systems. Plenty of CPU, RAM, VirtualBox with Windows and macOS virtual machines, and here we go."