Jan, thanks for the open approach to running the tech behind apify! The librarie...

mnmkng · on Aug 23, 2022

The ideal approach would depend on your architecture. It's really easy and cheap to create new queues on the Apify platform (we create ~500k every day) so we usually run a crawler per domain. It performs the best and it's the easiest to set up.

On Crawlee level, you can open new queues with one line of code and name them with the hostname, so the most straightforward solution would be to run multiple Crawler instances with multiple queues and then rate limit using the options explained here https://crawlee.dev/docs/guides/scaling-crawlers and push the new URLs to the respective queues using the URLs' hostname.

If you'd like to discuss this a bit more in depth, you can join our Discord or ask in GitHub discussions. Both are linked from Crawlee homepage.

wumpus · on Aug 23, 2022

Most mentions of crawl-delay in robots.txt set a limit so slow that the website can't be fully crawled before the heat death of the Universe. That's why Google and bing etc. ignore crawl-delay.