I wish someone would make a good open-source library for this complicated and annoying but valuable task.
We did some work in 2014 on an archive of all stories posted to HN, with the goals of (a) having a lightweight, readable version of everything quickly available and (b) doing analytics on the content. But we got bogged down on getting the actual content programmatically across the full spectrum of cases. This is one of those problems where not merely one devil is in the details but a whole legion of them, and not the glamorous kind. Getting it right would have sucked up all our resources, and the APIs out there (e.g. Readability) came with problems too, so we dropped the project.
But for a programmer who enjoys the snake-pit-of-corner-cases type of challenge, this would make a fine project, one with real public-service potential. We can't work on it ourselves, but we'd consider funding it.
Completely agree. A friend and I tried to do something like this as a fun project at a hackathon, getting to 80% wasn't difficult, just a lot of parsing the DOM for articles. Dealing with things like adverts, photo captions, comments, and other text that shouldn't be in the actual article was the real pain -- especially when we wanted to detect paragraph/subheader breaks since we wanted to parse articles and text-to-speech.
Good point, the constant (constant!) maintenance aspect means there would need to be a sustainable plan. On the other hand, if lots of projects started depending on the library, you'd at least get a steady supply of notifications about breakage, and perhaps fixes as well.
The challenge is that there's often several open source variants of content extraction, but without the continual commercial support, the often languish after serving the needs of the original creators.
Goose[0] for example was a project I last explored for article summarization.
No, but if someone from the project contacts us (hn@ycombinator.com) we'll invite them to do a Show HN. Failing that, I'll send a repost invite to the user who first posted the project here (https://news.ycombinator.com/from?site=wallabag.org).
Agreed. It would be nice if they had an API for Instaparser. First thing I did was CTRL+F api and check their API docs for any updates. I tried using Readability on all of PG's essays and I was surprised by how poor the results were.
I would imagine the key to solving this problem effectively is less about parsing DOM and more about recognizing what content on a page is valuable and relevant.
I could see an algorithm merely trying to find the longest prose on a page, then analyzing it for content and then trying to find other content within other elements that are likely relevant. Armed with this and multiple articles from the same source, you should be able to correlate the two to get a smarter parser that can adapt to changing DOM on each site.
I realize there are business reasons for not sharing the full details of the parser (or open sourcing it entirely), but it would still be interesting to hear more details about the actual components/architecture/techniques being used.
I would assume this is built on some sort of headless browser implementation but who knows, maybe not. Hopefully Instapaper does a followup with more technical details.
A very simple thing that I've been reporting to Instapaper for multiple times over a year now is fixing the GitHub support. I often save a lot of pages from GitHub to Instapaper (such as https://github.com/docker/docker/blob/master/docs/userguide/...) and it only saves the repository root link, so I usually end up losing that link entirely and it's pretty annoying.
I'm not sure how this article is supposed to make a paying user happy, it doesn't show any measurable metrics (neither any user should actually care about that, it should just work). I still wonder how hard adding a simple "if github then" check is.
MarkItDown is a "toy" implementation of a rich-text to markdown converter, that I wrote 3 years ago and still enjoys significant usage. Maybe it could be used as a start for a more full-featured parser.
I am not talking about a content parsing problem. Because of some stupid canonical URL header GitHub serves, Instapaper is constantly saving redacting the URL path to the repo level and they haven't been fixing it since forever.
We did some work in 2014 on an archive of all stories posted to HN, with the goals of (a) having a lightweight, readable version of everything quickly available and (b) doing analytics on the content. But we got bogged down on getting the actual content programmatically across the full spectrum of cases. This is one of those problems where not merely one devil is in the details but a whole legion of them, and not the glamorous kind. Getting it right would have sucked up all our resources, and the APIs out there (e.g. Readability) came with problems too, so we dropped the project.
But for a programmer who enjoys the snake-pit-of-corner-cases type of challenge, this would make a fine project, one with real public-service potential. We can't work on it ourselves, but we'd consider funding it.