Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Jan, thanks for the open approach to running the tech behind apify!

The libraries look useful - one question which wasn't obvious in the doc, how do you manage / suggest approaching rate limiting by domain? Ideally respecting crawl-delay in robots.txt, or just defaulting to some sane value.. most naive queue implementations make it challenging, and queue-per-domain feels annoying to manage.



The ideal approach would depend on your architecture. It's really easy and cheap to create new queues on the Apify platform (we create ~500k every day) so we usually run a crawler per domain. It performs the best and it's the easiest to set up.

On Crawlee level, you can open new queues with one line of code and name them with the hostname, so the most straightforward solution would be to run multiple Crawler instances with multiple queues and then rate limit using the options explained here https://crawlee.dev/docs/guides/scaling-crawlers and push the new URLs to the respective queues using the URLs' hostname.

If you'd like to discuss this a bit more in depth, you can join our Discord or ask in GitHub discussions. Both are linked from Crawlee homepage.


Most mentions of crawl-delay in robots.txt set a limit so slow that the website can't be fully crawled before the heat death of the Universe. That's why Google and bing etc. ignore crawl-delay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: