A New Instapaper Parser

dang · on Jan 14, 2016

I wish someone would make a good open-source library for this complicated and annoying but valuable task.

We did some work in 2014 on an archive of all stories posted to HN, with the goals of (a) having a lightweight, readable version of everything quickly available and (b) doing analytics on the content. But we got bogged down on getting the actual content programmatically across the full spectrum of cases. This is one of those problems where not merely one devil is in the details but a whole legion of them, and not the glamorous kind. Getting it right would have sucked up all our resources, and the APIs out there (e.g. Readability) came with problems too, so we dropped the project.

But for a programmer who enjoys the snake-pit-of-corner-cases type of challenge, this would make a fine project, one with real public-service potential. We can't work on it ourselves, but we'd consider funding it.

fortes · on Jan 14, 2016

I used to work at Flipboard, and we invested a lot in this issue. It is not easy, and requires constant (constant!) maintenance.

Getting to 80% quality isn't hard. 90% is tricky. 95% incredibly costly.

rkho · on Jan 14, 2016

Completely agree. A friend and I tried to do something like this as a fun project at a hackathon, getting to 80% wasn't difficult, just a lot of parsing the DOM for articles. Dealing with things like adverts, photo captions, comments, and other text that shouldn't be in the actual article was the real pain -- especially when we wanted to detect paragraph/subheader breaks since we wanted to parse articles and text-to-speech.

dang · on Jan 14, 2016

Good point, the constant (constant!) maintenance aspect means there would need to be a sustainable plan. On the other hand, if lots of projects started depending on the library, you'd at least get a steady supply of notifications about breakage, and perhaps fixes as well.

kevinastone · on Jan 14, 2016

The challenge is that there's often several open source variants of content extraction, but without the continual commercial support, the often languish after serving the needs of the original creators.

Goose[0] for example was a project I last explored for article summarization.

[0]: https://github.com/grangier/python-goose

kylemathews · on Jan 14, 2016

Newspaper[0] forked Goose and added a number of improvements. Haven't used it in anger however so can't comment on its quality.

https://github.com/codelucas/newspaper

jacques-noris · on Jan 14, 2016

Did you have a look at https://www.wallabag.org/? it's open source and the new version 2.0 has a Rest API.

dang · on Jan 14, 2016

No, but if someone from the project contacts us (hn@ycombinator.com) we'll invite them to do a Show HN. Failing that, I'll send a repost invite to the user who first posted the project here (https://news.ycombinator.com/from?site=wallabag.org).

wasd · on Jan 14, 2016

Agreed. It would be nice if they had an API for Instaparser. First thing I did was CTRL+F api and check their API docs for any updates. I tried using Readability on all of PG's essays and I was surprised by how poor the results were.

malandrew · on Jan 15, 2016

I would imagine the key to solving this problem effectively is less about parsing DOM and more about recognizing what content on a page is valuable and relevant.

I could see an algorithm merely trying to find the longest prose on a page, then analyzing it for content and then trying to find other content within other elements that are likely relevant. Armed with this and multiple articles from the same source, you should be able to correlate the two to get a smarter parser that can adapt to changing DOM on each site.

cpach · on Jan 14, 2016

”We can’t work on it ourselves, but we’d consider funding it.”

Interesting! Besides being useful for HN, I’m sure it could also be useful for quite a few of the YC startups.

zodiac · on Jan 15, 2016

Did you try diffbot? It's the same service Pocket uses for parsing.

jeromenerf · on Jan 15, 2016

Been there. Very expensive. The ui is quite good if you need to define special "recipes" for some sites, but you can't remove inlined ads ...

dchuk · on Jan 14, 2016

I realize there are business reasons for not sharing the full details of the parser (or open sourcing it entirely), but it would still be interesting to hear more details about the actual components/architecture/techniques being used.

I would assume this is built on some sort of headless browser implementation but who knows, maybe not. Hopefully Instapaper does a followup with more technical details.

alpb · on Jan 14, 2016

A very simple thing that I've been reporting to Instapaper for multiple times over a year now is fixing the GitHub support. I often save a lot of pages from GitHub to Instapaper (such as https://github.com/docker/docker/blob/master/docs/userguide/...) and it only saves the repository root link, so I usually end up losing that link entirely and it's pretty annoying.

I'm not sure how this article is supposed to make a paying user happy, it doesn't show any measurable metrics (neither any user should actually care about that, it should just work). I still wonder how hard adding a simple "if github then" check is.

bambax · on Jan 14, 2016

The GitHub page, when copy-and-pasted on http://markitdown.medusis.com renders okay.

MarkItDown is a "toy" implementation of a rich-text to markdown converter, that I wrote 3 years ago and still enjoys significant usage. Maybe it could be used as a start for a more full-featured parser.

alpb · on Jan 14, 2016

I am not talking about a content parsing problem. Because of some stupid canonical URL header GitHub serves, Instapaper is constantly saving redacting the URL path to the repo level and they haven't been fixing it since forever.

rjknight · on Jan 14, 2016

Obligatory link to the nearly-identically-named project: https://github.com/Engelberg/instaparse