Hi everyone. Library author here. I worked on this a bunch back in maybe 2009-2010. It's inspired by the work I did with tectonic on selectorgadget.
So my current thinking on the idea is reflected in https://github.com/fizx/pquery. PQuery addresses some weaknesses of Parsley by embedding the ideas in Javascript.
(1) Parsley isn't turing-complete, and many web pages are ugly, so you often have to resort to pre/post-processing in some scripting language. I never was able to get sufficient power out of a purely declarative language.
(2) Javascript environments are readily available (even in embedded form), and are more accessible than C.
(3) If your crawler already executes Javascript to render dynamic pages, then running more Javascript in that environment is pretty easy.
I guess I'm a little late to the thread (yay weekends) but I'll answer any questions people may have.
I was interested in "kangaroo"; Wikipedia suggests a kangaroo word should, in addition to having the same letters and in the same order as its parent, also have the same meaning (eg. masculine -> male), so "parse" isn't quite a kangaroo for "parsley", apparently. It'd be a moderately fun little challenge to write a little program to find some kangaroos - though Wiktionary (predictably) has a nice list already here: http://en.wiktionary.org/wiki/Appendix:Kangaroo_words
"Parsley" is indeed a great name. I'll contribute another Parsley - a Flex framework of yore also bore that name.
I'm afraid there's https://github.com/darius/parson already too. (I went through the same process to name it -- there's not much left in that well, is there?)
First commit on fizx/parsley was in 2008. The library was open-sourced in 2009. This predates all of the name conflicts so far mentioned, though no doubt someone used the name before in some capacity.
Looks pretty awesome, esp the clean DSL for your page model, but it seems like most of the documentation might be missing. How sophisticated is the crawler portion? Does it support Nutch-style generators that crawl more frequently updated pages more frequently? Or is it more designed for focused, one-off crawls a la Scrapy?
Looks pretty neat, +1 for concept & implementation. When I get time I'll be trying it out.
I'd also like to give some sort of -1 for the recycled library name, though it's not a technical nit pick, just a personal one. The name of the library is mostly dominating the discussion here at the moment, and that's a shame.
Btw I had this idea of a web query language but never went anywhere
https://gist.github.com/keyle/10951106
Have a look and let me know your thoughts if any.
So my current thinking on the idea is reflected in https://github.com/fizx/pquery. PQuery addresses some weaknesses of Parsley by embedding the ideas in Javascript.
(1) Parsley isn't turing-complete, and many web pages are ugly, so you often have to resort to pre/post-processing in some scripting language. I never was able to get sufficient power out of a purely declarative language.
(2) Javascript environments are readily available (even in embedded form), and are more accessible than C.
(3) If your crawler already executes Javascript to render dynamic pages, then running more Javascript in that environment is pretty easy.
I guess I'm a little late to the thread (yay weekends) but I'll answer any questions people may have.