Hacker News new | past | comments | ask | show | jobs | submit login
Parsley: a simple language for extracting structured data from web pages (github.com/fizx)
149 points by pmoriarty on Jan 11, 2015 | hide | past | favorite | 24 comments



Hi everyone. Library author here. I worked on this a bunch back in maybe 2009-2010. It's inspired by the work I did with tectonic on selectorgadget.

So my current thinking on the idea is reflected in https://github.com/fizx/pquery. PQuery addresses some weaknesses of Parsley by embedding the ideas in Javascript.

(1) Parsley isn't turing-complete, and many web pages are ugly, so you often have to resort to pre/post-processing in some scripting language. I never was able to get sufficient power out of a purely declarative language.

(2) Javascript environments are readily available (even in embedded form), and are more accessible than C.

(3) If your crawler already executes Javascript to render dynamic pages, then running more Javascript in that environment is pretty easy.

I guess I'm a little late to the thread (yay weekends) but I'll answer any questions people may have.


Those were fun times :)

Ruby bindings: https://github.com/fizx/parsley-ruby

Python bindings: https://github.com/fizx/pyparsley


Not to be confused by Christopher Grand's clojure parser library:

https://github.com/cgrand/parsley


(also replying to siblings)

Parsley is just so good a name. It has the word parse as a kangaroo, and it evokes the image of fresh, green, edible.

I just ran

    grep /usr/share/dict/words -e "^pars[^']*[^s]$"
to find the following list of words beginning with pars

    parse
    parsec
    parsed
    parser
    parsimony
    parsing
    parsley
    parsnip
    parson
    parsonage
I think I might call my next parsing-oriented tool either parson or parsnip. :)


I was interested in "kangaroo"; Wikipedia suggests a kangaroo word should, in addition to having the same letters and in the same order as its parent, also have the same meaning (eg. masculine -> male), so "parse" isn't quite a kangaroo for "parsley", apparently. It'd be a moderately fun little challenge to write a little program to find some kangaroos - though Wiktionary (predictably) has a nice list already here: http://en.wiktionary.org/wiki/Appendix:Kangaroo_words

"Parsley" is indeed a great name. I'll contribute another Parsley - a Flex framework of yore also bore that name.


The reason I thought it was a kangaroo was specifically because it was being used as a name for a parsing library.

After a library is named "Parsley", the list of meanings for Parsley now includes "A parsing library", and so I see it as a kangaroo for Parse.


> grep /usr/share/dict/words -e "^pars[^']<star>[^s]$"

(I changed a literal asterisk to <star> to avoid formatting.) You get a few more options if you remove the anchor at the beginning:

    grep /usr/share/dict/words -e "pars[^']<star>[^s]$"
For example, 'sparse' seems like a good name for a lightweight parser library.


I'm afraid there's https://github.com/darius/parson already too. (I went through the same process to name it -- there's not much left in that well, is there?)


Oops, looks like I was beaten by cgrand himself. https://github.com/cgrand/parsnip



Or Parsley: a pattern-matching language based on OMeta and Python

https://github.com/python-parsley/parsley


Or a PEG grammar parser written in python:

https://pypi.python.org/pypi/Parsley


First commit on fizx/parsley was in 2008. The library was open-sourced in 2009. This predates all of the name conflicts so far mentioned, though no doubt someone used the name before in some capacity.


Why not start with Xpath and CSS selectors and pre-/post-process in js as needed?


Looks pretty awesome, esp the clean DSL for your page model, but it seems like most of the documentation might be missing. How sophisticated is the crawler portion? Does it support Nutch-style generators that crawl more frequently updated pages more frequently? Or is it more designed for focused, one-off crawls a la Scrapy?


The crawler portion is about as sophisticated as `wget -R`.


Shameless plug... https://github.com/keyle/json-anything did this a while back


Looks pretty neat, +1 for concept & implementation. When I get time I'll be trying it out.

I'd also like to give some sort of -1 for the recycled library name, though it's not a technical nit pick, just a personal one. The name of the library is mostly dominating the discussion here at the moment, and that's a shame.


Btw I had this idea of a web query language but never went anywhere https://gist.github.com/keyle/10951106 Have a look and let me know your thoughts if any.


Looks a lot like using xml linq provider to query html page.


Using https://github.com/MindTouch/SGMLReader you can use Linq to XML or anything else that accepts an XmlReader interface in .Net.


Interesting straight duh idea (as in duh why didn't I think of that). Would use it. Wonder how it handles looping.

Note 3/4 of the links on the main page are to not yet created wiki pages. Looking forward to it, or just writing it for myself in go :)


last commit in 2013. Is it a done, or is it just no longer maintained?


I don't know. Perhaps a little of each. See my other comments here.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: