I’ve been looking for a python library to replace boilerpipe in java and stumbled across this one the other week. Does anyone have experience in comparing results?
Newspaper for the most part "just works", and when it doesn't it doesn't do much except provide you with the source so you can munge it and try again. (It also seems to work well for non-English languages, though I've only tried it against Spanish.)
Boilerpipe provides a lot more knobs and things to tweak the output. There are a lot more extractors than the basics that Newspaper provides.
Newspaper will give you the basics. Article content, titles, some metadata.
Boilerpipe let's you make a turn-key solution for particular sets of sites, which can be more or less helpful depending on where you want to use it.
Lastly... Newspaper isn't the fastest thing in the world. It does tend to have high accuracy for article extraction, but it tends to be slow.
Thanks for posting this - I spent most of today kicking around idea of an instapaper-style webpage parser, and came to the conclusion that I probably didn't have time to do it well.