Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).
I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).
I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)
Currently I use only one thread for scraping, I do not require more. It gets the job done. Also I know too little to play more with python "celery" threads.
My project can be used for various things. Depends on needs. Recently I am playing with using it as a 'search engine'. I am scraping the Internet to find cool stuff. Results are in https://github.com/rumca-js/Internet-Places-Database. No all domains are interesting though.
> Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).
What? Why? Regardless of the programming language used to generate content, the standard, well known HTTP status codes should be returned as expected . If your JS served site, gives me a 200 code when it should be a 404, you're wrong.
I think you are misunderstanding, your application is expected to give mostly 200s codes, if you get a 404, then a link is broken or a page misbehaving which is exactly why that page url is displayed on the console with a warning.
I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).
I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)
Thanks, btw what's your project!? Share!