Parsley: a simple language for extracting structured data from web pages

fizx · on Jan 12, 2015

Hi everyone. Library author here. I worked on this a bunch back in maybe 2009-2010. It's inspired by the work I did with tectonic on selectorgadget.

So my current thinking on the idea is reflected in https://github.com/fizx/pquery. PQuery addresses some weaknesses of Parsley by embedding the ideas in Javascript.

(1) Parsley isn't turing-complete, and many web pages are ugly, so you often have to resort to pre/post-processing in some scripting language. I never was able to get sufficient power out of a purely declarative language.

(2) Javascript environments are readily available (even in embedded form), and are more accessible than C.

(3) If your crawler already executes Javascript to render dynamic pages, then running more Javascript in that environment is pretty easy.

I guess I'm a little late to the thread (yay weekends) but I'll answer any questions people may have.

tectonic · on Jan 12, 2015

Those were fun times :)

Ruby bindings: https://github.com/fizx/parsley-ruby

Python bindings: https://github.com/fizx/pyparsley

tbatchelli · on Jan 12, 2015

Not to be confused by Christopher Grand's clojure parser library:

https://github.com/cgrand/parsley

nvader · on Jan 12, 2015

(also replying to siblings)

Parsley is just so good a name. It has the word parse as a kangaroo, and it evokes the image of fresh, green, edible.

I just ran

    grep /usr/share/dict/words -e "^pars[^']*[^s]$"

to find the following list of words beginning with pars

    parse
    parsec
    parsed
    parser
    parsimony
    parsing
    parsley
    parsnip
    parson
    parsonage

I think I might call my next parsing-oriented tool either parson or parsnip. :)

bshimmin · on Jan 12, 2015

I was interested in "kangaroo"; Wikipedia suggests a kangaroo word should, in addition to having the same letters and in the same order as its parent, also have the same meaning (eg. masculine -> male), so "parse" isn't quite a kangaroo for "parsley", apparently. It'd be a moderately fun little challenge to write a little program to find some kangaroos - though Wiktionary (predictably) has a nice list already here: http://en.wiktionary.org/wiki/Appendix:Kangaroo_words

"Parsley" is indeed a great name. I'll contribute another Parsley - a Flex framework of yore also bore that name.

nvader · on Jan 12, 2015

The reason I thought it was a kangaroo was specifically because it was being used as a name for a parsing library.

After a library is named "Parsley", the list of meanings for Parsley now includes "A parsing library", and so I see it as a kangaroo for Parse.

JadeNB · on Jan 12, 2015

> grep /usr/share/dict/words -e "^pars[^']<star>[^s]$"

(I changed a literal asterisk to <star> to avoid formatting.) You get a few more options if you remove the anchor at the beginning:

    grep /usr/share/dict/words -e "pars[^']<star>[^s]$"

For example, 'sparse' seems like a good name for a lightweight parser library.

abecedarius · on Jan 12, 2015

I'm afraid there's https://github.com/darius/parson already too. (I went through the same process to name it -- there's not much left in that well, is there?)

nvader · on Jan 12, 2015

Oops, looks like I was beaten by cgrand himself. https://github.com/cgrand/parsnip

allr · on Jan 12, 2015

Or http://parsleyjs.org/

jshprentz · on Jan 12, 2015

Or Parsley: a pattern-matching language based on OMeta and Python

https://github.com/python-parsley/parsley

zo1 · on Jan 12, 2015

Or a PEG grammar parser written in python:

https://pypi.python.org/pypi/Parsley

fizx · on Jan 12, 2015

First commit on fizx/parsley was in 2008. The library was open-sourced in 2009. This predates all of the name conflicts so far mentioned, though no doubt someone used the name before in some capacity.

xchaotic · on Jan 12, 2015

Why not start with Xpath and CSS selectors and pre-/post-process in js as needed?

divideby0 · on Jan 12, 2015

Looks pretty awesome, esp the clean DSL for your page model, but it seems like most of the documentation might be missing. How sophisticated is the crawler portion? Does it support Nutch-style generators that crawl more frequently updated pages more frequently? Or is it more designed for focused, one-off crawls a la Scrapy?

fizx · on Jan 12, 2015

The crawler portion is about as sophisticated as `wget -R`.

keyle · on Jan 12, 2015

Shameless plug... https://github.com/keyle/json-anything did this a while back

zo1 · on Jan 12, 2015

Looks pretty neat, +1 for concept & implementation. When I get time I'll be trying it out.

I'd also like to give some sort of -1 for the recycled library name, though it's not a technical nit pick, just a personal one. The name of the library is mostly dominating the discussion here at the moment, and that's a shame.

keyle · on Jan 12, 2015

Btw I had this idea of a web query language but never went anywhere https://gist.github.com/keyle/10951106 Have a look and let me know your thoughts if any.

dominotw · on Jan 12, 2015

Looks a lot like using xml linq provider to query html page.

SigmundA · on Jan 12, 2015

Using https://github.com/MindTouch/SGMLReader you can use Linq to XML or anything else that accepts an XmlReader interface in .Net.

grogenaut · on Jan 12, 2015

Interesting straight duh idea (as in duh why didn't I think of that). Would use it. Wonder how it handles looping.

Note 3/4 of the links on the main page are to not yet created wiki pages. Looking forward to it, or just writing it for myself in go :)

mkoryak · on Jan 12, 2015

last commit in 2013. Is it a done, or is it just no longer maintained?

fizx · on Jan 12, 2015

I don't know. Perhaps a little of each. See my other comments here.