I’ve been looking for something exactly like this. I keep the complete text of a...

JZL003 · on April 25, 2021

Can you say more, I'm curious.

Is it automated in some way during web browsing, remembering to copy to a folder when you enjoyed it enough, or do you use a reading app/e-reader to read them so they're already downloaded

darkteflon · on April 26, 2021

Sure, I just hacked something together as I wanted it to fit around my existing workflow. I’ve been using Instapaper since forever, and I wanted something built around that instead of “every URL I visit”, as most shit you read has a low signal:noise ratio.

I wrote some Python to drive Selenium to get the URLs (not the full text) from Instapaper, then pass those URLs to newspaper3k, where a lot of the downloading and parsing work is done. I then save the output to SQLite. From there I was previously having ES build indexes but recently just switched to hosted Algolia, which seems to be basically free for my use case and has some nice libraries for building real-time search front ends too. I’ll be trying lmgrep as a substitution though.

The key thing about searching the text of articles you’ve read is that you want an intelligent ranking of all articles that bear on a subject, in order of relevance. That’s not something you can get with grep/ripgrep. ES is pretty good at it out of the box. But it’s also a pain to set up and run - you’ll probably end up needing something like Docker.

There are a thousand different ways you could do something like this - this is just the way I do it.

d3nj4l · on April 25, 2021

Not OP so I can't speak for them. There's a bunch of ways to do this, ranging from more turnkey solutions to collections of scripts and extensions you can use. On the turnkey side, there's programs like ArchiveBox[1] which take links and store them as WARC files. You can import your browsing history into ArchiveBox and set up a script to do it automatically. If you'd like to set something up yourself, you can extract your browsing history (eg, firefox stores its history in a sqlite database) and manually wget those urls. For a reference to the more "bootstrapped" version, I'll link to Gwern's post on their archiving setup [2]. It's fairly long, so I advise skipping to the parts you're interested in first.

1: https://github.com/ArchiveBox/ArchiveBox

2: https://www.gwern.net/Archiving-URLs

dainius_jocas · on April 25, 2021

I'm the author of `lmgrep`. Happy to hear that you liked it. I have a similar user use-case: searching for blog posts that are in markdown source files.

rahimnathwani · on April 25, 2021

Have you considered using Docfetcher or Recoll?

flarg · on April 25, 2021

Or Zotero?