binux's comments

binux · on Nov 18, 2014

https://gist.github.com/binux/67b276c51e988f8e2c31

binux · on Nov 18, 2014

I'm working on a benchmarking suite https://gist.github.com/binux/67b276c51e988f8e2c31 and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.

- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.

binux · on Nov 17, 2014

sorry :(

binux · on Nov 17, 2014

To make it more flexible and easy to reuse? I have implemented most features I need now.

_bitliner · on Nov 17, 2014

Because I already have a powerful distributed architecture. I was curious about the architecture of pyspider.

For example, how the queue is handled? Is it centralized? Is there a server managing it?

binux · on Nov 17, 2014

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

maratc · on Nov 17, 2014

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

binux · on Nov 17, 2014

The fetcher fit you already...

maratc · on Nov 17, 2014

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

binux · on Nov 17, 2014

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.

binux · on Nov 17, 2014

http://demo.pyspider.org/debug/js_test_sciencedirect is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did. Furthermore, you can have some JavaScript running before/after page loaded to simulate a mouse click.

pknerd · on Nov 17, 2014

But will it not be slow? Assuming downloading css/images etc?

binux · on Nov 17, 2014

Images not downloaded default. Both the fetcher and the phantomjs proxy is totally async.

binux · on Nov 17, 2014

Yes, the scheduler, fetcher, processor is stand alone here, they are running in different process. But they are sharing some common libs. I haven't made a decision how to put them into a single package, and running together.

Any advice or project that I can refer to?

binux · on Nov 17, 2014

agree

redacted · on Nov 17, 2014

What is the recommended way? (Serious question, I have larger projects that I would someday like to refactor into proper packages)

iamtew · on Nov 17, 2014

There is also these guides that provide plenty of information on how packages work and best practices:

https://packaging.python.org/en/latest/distributing.html

https://github.com/pypa/sampleproject

binux · on Nov 17, 2014

I have "organize the code using a single top-level package".

ngoldbaum · on Nov 17, 2014

Because the name "libs" is now installed into the global module namespace. It's better to use a less generic name.

binux · on Nov 17, 2014

Currently, yes and no.

pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.

prht · on Nov 17, 2014

I am not trying to offend you, but I really don't understand when someone says "yes and no". I hear it more and more these days. Is this becoming a cliche? It can be "yes" or "no", not both together. "yes and no" is "no" for me.

smoe · on Nov 17, 2014

Don't know about other languages, but in german this phrase is pretty common when there is no clear yes or no answer. Like "yes to some extend but not completely"