Oh this looks lovely, congratulations! I would really like this but running in P...

mdaniel · on Aug 23, 2022

you want something better than Scapy? maybe I have Stockholm syndrome but I find it to be very well structured, testable, and has solved every problem I've had with running scrapers

dennisy · on Aug 24, 2022

Do you feel this and Scrapy are similar? In my reading this has a different feature set.

It allows headed crawling + avoiding blockers etc.

mdaniel · on Aug 24, 2022

In that they're trying to be crawling frameworks, and for sure Scrapy allows headed crawling via Splash, it's just not something I've needed or advocate for

Scrapy also has a long lineage of extensions, which maybe Crawlee will gain as it increases in popularity but I didn't see any obvious way of decoupling (for example) if one wanted to plug a new storage engine into Crawlee: https://crawlee.dev/docs/guides/result-storage versus that delineation is very strong in Scrapy for all its moving parts

Also, Parsel (the selector library powering Scrapy) is A++, in that it allows expressing one's intent via xpath, css selector, and regex matches in a fluent API; I'm sure this nodejs framework allows doing something similar because it seems to be all-in on the DOM, but it for sure will not be `response.xpath("//whatever").css("#some-id").re("firstName: (.+)").extract()`

Further, as I mentioned -- and as someone pointed out elsewhere in this submission -- Scrapy is prepared to store requests to disk and makes testing spider methods super easily since they're very well defined callback methods. If you have the HTML from a prior run and need to reproduce a bad outcome, testing just the "def parse_details_page" is painless. It certainly may be possible to test Crawlee code, too, but I didn't see anything mentioned about it

riekus · on Aug 25, 2022

so what would be the the approach with this library? used scrapy and like it but more in the JS ecosystem now so would like this to be similair.

mdaniel · on Aug 25, 2022

I don't know anything about this other than the announcement here and on reddit, so you'll likely want to post your question as a top comment so Jan can see it, or open a GH issue so they can help you evaluate