Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks cool at first glance. I'll dig into it more.

One note that may be helpful, if all you care about is the HTML, it's better to take a "snapshot" of the page by streaming the response directly to blob storage like S3. That way if something fails and you need to retry, you can reference the saved raw data from storage vs making another request and potentially getting blocked. Node pipelines makes it really easy to chain this stuff together with other logic.

For reference, I run a company that does large scale scraping / data aggregation.



Yeah I agree, keeping the source HTML is great for debugging or retro-fixing issues. We also like to take screenshots on important errors, when running headless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: