Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>BeautifulSoup

> Features: Excellent HTML/XML parser, easy web scraping interface, flexible navigation and search.

It does not feature any parser. It’s basically a wrapper over lxml.

>lxml

> Features: Very fast XML and HTML parser.

It’s fast, but there are alternatives that are literally 5x faster.

This article is just another rewrite of a basic introduction. It’s not a guide, since it does mot describe any issues that you face in practice.



Parsing HTML super-fast is very low on the list of priorities when web-scraping things. Yes, in practice.

Most of the time it won't even register on the scale, compared to the time spent sending/receiving requests and data.


Beautiful Soup comes with a "html.parser", and by default it doesn't not use or even install lxml.


lxml is written in Cython and is very efficient in my tests. Much faster than BeautifulSoup, which is pure Python.

What alternatives are 5x faster?


I'm sorry but BeautifulSoup is not just a wrapper over lxml.

lxml even has a module for using beautifulsoup's parser.

> lxml can make use of BeautifulSoup as a parser backend

https://lxml.de/elementsoup.html

> A very nice feature of BeautifulSoup is its excellent support for encoding detection which can provide better results for real-world HTML pages that do not (correctly) declare their encoding.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: