Lmgrep: Lucene-based grep-like utility

antman · on April 25, 2021

Very nice, examples are in the repository https://github.com/dainiusjocas/lucene-grep#text-analysis

dainius_jocas · on April 25, 2021

Thanks! The text analysis machinery baked into `lmgrep` is the thing that I'm very proud of.

antman · on April 25, 2021

I agree it is its strength, keep it as prominent as possible!

benatkin · on April 25, 2021

I'm surprised/impressed that it can build an index and search it in 29 milliseconds. Pretty clever solution!

dainius_jocas · on April 25, 2021

Thanks! This speed is possible because two things: GraalVM native-image (fast startup) and Lucene (for doing the work).

jarpineh · on April 25, 2021

Wow. This is nice. I can see putting it to use with csv and json with gron or something. I don’t have that much markdown to search, but having a tool be just there in the shell might change that.

I see they have enabled Snowball stemmers. I wonder if using other Lucene analyzers such as Voikko for Finnish is feasible. Snowball wasn’t particularly good when text get complex. I used to deal with Lucene and Solr way back when. Based on the OP I see Graal requires change.

dainius_jocas · on April 25, 2021

Author of lmgrep here. Could you create an issue for Finnish text analysis? https://github.com/dainiusjocas/lucene-grep/issues

I'd take look.

jarpineh · on April 25, 2021

Here goes: https://github.com/dainiusjocas/lucene-grep/issues/84

I realize some relatively obscure Finnish stemmer and Lucene with GraalVM aren't exactly a common use case. I did some testing and provided my use case. I certainly have much English language content to search with using lucene-grep. So, thank you for making it!

iachimoe · on May 1, 2021

One usecase where Lucene’s tokenizing approach tends to work less well than something like grep is when for some reason we want to query for a substring of a token, e.g. if the text is “I walked through the town” and I want to search for “oug”. Does lmgrep offer a performant solution for this kind of case, or would it be a situation where it’s better sticking with regular grep?

Siira · on May 14, 2021

Opened an issue and asked this: https://github.com/dainiusjocas/lucene-grep/issues/99

darkteflon · on April 25, 2021

I’ve been looking for something exactly like this. I keep the complete text of articles I’ve enjoyed but, to-date, effective searching has meant spinning up an ES instance, which is painful. This is a specific use case that isn’t necessarily well-served by something like grep or ripgrep. I’ll definitely try this, thanks - looks very elegant.

JZL003 · on April 25, 2021

Can you say more, I'm curious.

Is it automated in some way during web browsing, remembering to copy to a folder when you enjoyed it enough, or do you use a reading app/e-reader to read them so they're already downloaded

darkteflon · on April 26, 2021

Sure, I just hacked something together as I wanted it to fit around my existing workflow. I’ve been using Instapaper since forever, and I wanted something built around that instead of “every URL I visit”, as most shit you read has a low signal:noise ratio.

I wrote some Python to drive Selenium to get the URLs (not the full text) from Instapaper, then pass those URLs to newspaper3k, where a lot of the downloading and parsing work is done. I then save the output to SQLite. From there I was previously having ES build indexes but recently just switched to hosted Algolia, which seems to be basically free for my use case and has some nice libraries for building real-time search front ends too. I’ll be trying lmgrep as a substitution though.

The key thing about searching the text of articles you’ve read is that you want an intelligent ranking of all articles that bear on a subject, in order of relevance. That’s not something you can get with grep/ripgrep. ES is pretty good at it out of the box. But it’s also a pain to set up and run - you’ll probably end up needing something like Docker.

There are a thousand different ways you could do something like this - this is just the way I do it.

d3nj4l · on April 25, 2021

Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs

dainius_jocas · on April 25, 2021

I'm the author of `lmgrep`. Happy to hear that you liked it. I have a similar user use-case: searching for blog posts that are in markdown source files.

rahimnathwani · on April 25, 2021

Have you considered using Docfetcher or Recoll?

flarg · on April 25, 2021

Or Zotero?

29athrowaway · on April 25, 2021

I think one of the best code search tools I've seen is the one here: https://source.chromium.org/chromium

I guess that's what Google uses internally? Is there an open source alternative?

gravypod · on April 25, 2021

I think there's a company attempting to implement their own version of something similar. A very important part of search is also the understanding of language semantics. Something that is really cool for this is Kythe [0].

[0] - https://kythe.io/

O_H_E · on April 25, 2021

I am pretty sure that's what sourcegraph is trying to do.

hashhar · on April 25, 2021

There is DXR from Mozilla but I'm not sure how generalised it is.

https://github.com/mozilla/dxr

There is also Sourcegraph.

29athrowaway · on April 25, 2021

Thank you, DXR looks amazing.

boyter · on April 25, 2021

Neat. This is similar to a tool I have been working on (but need to finish off) as I saw the same issue.

Except rather than build an index I brute forced the search each time. For most repositories it’s fast enough even with ranking.

https://github.com/boyter/cs For those interested it’s still very WIP with noticeable issues in TUI mode.

axiosgunnar · on April 24, 2021

Is there an actual link to the source code somewhere in that blog post?

Kina · on April 24, 2021

Yes, it is there as an in-text link:

"Then the most complicated part was to prepare executable binaries for different operating systems. Plenty of CPU, RAM, VirtualBox with Windows and macOS virtual machines, and here we go."

https://github.com/dainiusjocas/lucene-grep/releases/tag/v20...