Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’ve been looking for something exactly like this. I keep the complete text of articles I’ve enjoyed but, to-date, effective searching has meant spinning up an ES instance, which is painful. This is a specific use case that isn’t necessarily well-served by something like grep or ripgrep. I’ll definitely try this, thanks - looks very elegant.


Can you say more, I'm curious.

Is it automated in some way during web browsing, remembering to copy to a folder when you enjoyed it enough, or do you use a reading app/e-reader to read them so they're already downloaded


Sure, I just hacked something together as I wanted it to fit around my existing workflow. I’ve been using Instapaper since forever, and I wanted something built around that instead of “every URL I visit”, as most shit you read has a low signal:noise ratio.

I wrote some Python to drive Selenium to get the URLs (not the full text) from Instapaper, then pass those URLs to newspaper3k, where a lot of the downloading and parsing work is done. I then save the output to SQLite. From there I was previously having ES build indexes but recently just switched to hosted Algolia, which seems to be basically free for my use case and has some nice libraries for building real-time search front ends too. I’ll be trying lmgrep as a substitution though.

The key thing about searching the text of articles you’ve read is that you want an intelligent ranking of all articles that bear on a subject, in order of relevance. That’s not something you can get with grep/ripgrep. ES is pretty good at it out of the box. But it’s also a pain to set up and run - you’ll probably end up needing something like Docker.

There are a thousand different ways you could do something like this - this is just the way I do it.


Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs


I'm the author of `lmgrep`. Happy to hear that you liked it. I have a similar user use-case: searching for blog posts that are in markdown source files.


Have you considered using Docfetcher or Recoll?


Or Zotero?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: