Parsing Techniques - A Practical Guide

andreasvc · on Feb 19, 2012

I found this reference interesting:

"Snelling, Michel. General Context-Free Parsing in Time $n^2$, in International Computing Symposium 1973, pages 19-24, Amsterdam, 1974. North-Holland. It is extremely unlikely that the claim in the title holds, but no refutation has been published to date; in fact, the paper does not seem to be referenced anywhere. The algorithm is described in reasonable detail, but not completely. Filling in the last details allows some examples in the paper to be reproduced, but not all at the same time, it seems. Attempts at implementations can be found in this zip file. Further clarification would be most welcome."

bo1024 · on Feb 19, 2012

link to the link: http://dickgrune.com/Books/PTAPG_2nd_Edition/LowAvailability...

sb · on Feb 20, 2012

There is also a second edition, which updates some chapters with much more recent resulst (AFAIR, the book is from 1992.) In addtion, the author (Dick Grune) also co-authored a book on compilers ("Modern Compiler Design"), which I like a lot as it has a sound treatment of non-imperative programming language concepts, too. (Plus I think the treatment of BURS plus usage for instruction selection is best described therein as well.)

drostie · on Feb 19, 2012

I'm going to upvote this on the general principle that I've seen a lot of programmers who don't quite know what parsing is, but reading a 310-page book is perhaps not the best way to familiarize them with it.

The basic thing that a programmer needs to know about parsing is that regular expressions usually aren't. The big problem with regular expressions is that they have real trouble handling nested modes. Just to give a modestly complicated example of what can happen from a simple grammar with nesting, imagine a grammar with just single-letter variables and parentheses, parsed into Python with tuples:

    a(bc)def
    => ('a', ('b', 'c'), 'd', 'e', 'f')

You might be able to get a regular expression engine to deal with that for you, but it will tend to fail on something slightly more complicated like this:

    a(b((cde)f)(g)h)ij

...because it will usually either be too conservative and start matching:

    (b((cde)

or else it will sometimes be too aggressive and start matching:

    ((cde)f)(g)

Perl is a big exception; Perl actually has a syntax to make regular expressions properly recursive.

A good practical example for you to think about is CSV parsing. In CSV, a field can sometimes be quoted so that it may contain commas, and two quoting-symbols inside that field is interpreted as a single quoting-symbol too. Thus to create a table containing a chat, we might have to write:

    Reginald,I think life needs more scare-quotes.
    Beartato,"That's nice, Reginald."
    Reginald,"Don't you mean ""nice""?"
    Beartato,Aah! Your scare-quotes scared me!

With a straightforward character-by-character parser this is pretty easy to parse; you check for the «,» character and append another record onto the row, check for the «\n» character to append this row to the table, and when you see «"» you enter a special string-parsing mode which does not exit until it sees an odd number of quote marks pass by, bordered by two non-quote marks. (The fact that you might have to "peek ahead" leads to interesting parse combinators which have to be able to succeed or fail without moving the cursor forward.)

At one of my jobs I actually spent some time outside of work making CSV parsing faster by using a dangerous hack: you can split on the quote character first, to have the native string code handle the basic division of the table; then the even fields contain CSV data without quoted text (or occasionally empty strings, which must be interpreted as commands to add the quote character and the following text to the preceding field), and the odd fields contain raw data to be entered into a new record. (I call it a dangerous hack because I remember the thing working at first, but eventually I believe three bugs came out on more complicated CSV files -- it was very hard to "see that it worked right" due to the extra overhead needed to move all of this character-by-character logic into the split() calls of my programming language.

silentbicycle · on Feb 19, 2012

Regular expressions are for lexing (breaking up input into tokens, AKA "tokenizing"), parsers are for assembling a stream of tokens into a parse tree (according to a grammar). You can kinda-sorta get away with parsing via regular expressions, but "surprisingly complex" regular expressions (http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html) should be a sign that you're using the wrong tool. De-coupling lexing from parsing also means that your grammar won't be cluttered by issues like whitespace handling.

Perl's RE extensions can be used to do some parsing, albeit at the cost of potentially exponential runtime (http://swtch.com/~rsc/regexp/regexp1.html * ), since they're NP-complete (http://perl.plover.com/NPC/NPC-3SAT.html). (Congrats, your parser is now a denial-of-service attack vector!)

* If you find this article interesting, you'll probably love _Parsing Techniques_.

> I'm going to upvote this on the general principle that I've seen a lot of programmers who don't quite know what parsing is, but reading a 310-page book is perhaps not the best way to familiarize them with it.

This book is a _reference_. For someone who wants to learn basic parsing techniques, I'd first direct them to something like Niklaus Wirth's _Compiler Construction_ (PDF: http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf), which has a quick and clear introduction to recursive descent parsing, but only briefly touches on other methods.

The Grune book is for when you need to understand the trade-offs between (say) different kinds of chart parsers, how parser generators work, classic performance optimizations in Earley parsers, or error handling strategies. It has quite a bit more depth to it, as well as an excellent bibliography. The second edition was updated in 2008, and adds a lot of new content.

spacemanaki · on Feb 19, 2012

Since you brought up lexing I'm going to ask a really noob-ish question: does this book (by Grune) deal with lexing at all? If it doesn't, is it because parsing is the more interesting/complicated of the two?

I've recently started getting into a few "traditional" compiler resources after reading a lot of Lisp centric compiler books (Lisp in Small Pieces... SICP, PAIP) where lexing and parsing are kind of handled by (read) so this is a whole new world for me.

silentbicycle · on Feb 19, 2012

Lexing is strictly simpler than parsing - lexing is done with state machines* , while many parsers can be viewed as state machines with stacks, AKA "pushdown automata" (http://en.wikipedia.org/wiki/Pushdown_automata). Lexers don't have stacks, which is why basic regular expressions can't do things like match arbitrarily nested parenthesis.

* Basically: make a byte-by-byte flowchart ("an int is a digit, followed by more digits, stop when it reaches a non-digit character and return the accumulated string as an int"), then convert that to code. Pretty simple. Graphviz (http://graphviz.org) makes pretty state machine diagrams.

Just knowing a little bit about automata / state machines will be enough to write a decent lexer. The Wirth PDF above explains it just a few pages (chapter 3). People often use tools like lex or ragel because writing them by hand can be rather tedious (being able to say e.g. "[+-]?[1-9][0-9]* { return LEXEME_INT(atoi(str)); }" rather than writing out the tests byte-by-byte), but they're not hard.

And yeah, the Lispers are able to sidestep parsing entirely.

barrkel · on Feb 20, 2012

Writing a state machine such that state transitions map to instruction pointer changes can be a viable technique. For example, using Pascal - handy because of sets - your example could be recognized thusly, with cp: PChar, or ^Char:

    start := cp;
    if cp^ in ['+', '-'] then Inc(cp);
    if cp^ in ['1'..'9'] then Inc(cp) else goto no_match;
    while cp^ in ['0'..'9'] do Inc(cp);
    SetString(str, start, cp - start);

I don't think this is tedious at all. The various components of a regular expression have analogues in code: * and + become loops, ? and | if statements, character classes become set membership tests (which is nice in Pascal). For most token definitions used in practical programming languages, writing lexers for them by hand is fairly trivial; and when things get more complicated, the freedom of having hand-coded the lexer usually gives you more tools to solve those complications.

Having a lexer coded as a loop driven by state transition tables can be useful in other scenarios, however; for example, if you only receive a little bit of input at a time, and you want to continuously lex it until a whole token has formed. There, you want to be able to suspend tokenizing at any point, and resume it later. Without coroutines, continuations or threads, this is more awkward when hand-coding.

teaspoon · on Feb 19, 2012

That rule doesn't hold for some languages. For example, a Python lexer needs to remember a stack of indentation levels to know whether a given line's indentation should be tokenized as an INDENT, a DEDENT, or no token at all.

silentbicycle · on Feb 20, 2012

True. Strictly speaking, that means it isn't context-free in the usual sense (right?), but it's a practical extension.

Matt Might uses Python's indentation as a lexing/parsing example in his ongoing "Scripting Language Design and Implementation" course (http://matt.might.net/teaching/scripting-languages/spring-20...), which is worth following if you're reading this thread.

spacemanaki · on Feb 20, 2012

That first paragraph was eye opening all on its own. I actually think I've collected masses of links to compiler resources from one or more of your comments in the past, so thanks for this one too.

Kaizyn · on Feb 20, 2012

Yes, it deals with everything involved in "parsing" which covers all aspects of starting from a text document and converting it into a tree structure that can be used either for code generation or for manipulating the code.

Lexing isn't all that different than parsing anyway. Both involve the recognition of "tokens" based on a stream of data. In the case of a lexer, the "tokens" are usually the characters taken character by character. For parsing, the "tokens" are identifiers, values, operators, etc.

draegtun · on Feb 20, 2012

From Perl 5.10 you can extend the RE engine.

One example of this is Regexp::Grammars (https://metacpan.org/module/Regexp::Grammars) which implements a complete recursive descent parser.

riffraff · on Feb 20, 2012

FWIW, I think ruby 1.9 actually has syntax for recursive expressions via callable backreferences, but while I can whip up something that will match balanced parenthesis I am not clear on how it could be used to build a tree-like structure.

maw · on Feb 19, 2012

Grune is most famous, by the way, for being CVS' original implementor.

geoffhill · on Feb 20, 2012

I have this book. This book is good, and very complete from a traditional formals grammars point-of-view (LL, LR, SLR, LALR), and has very good sections on top-down and bottom-up parsing, but it is rather outdated in terms of what people are using now.

Nowadays many people use parser combinator libraries or packrat-style PEG parsers. This book provides no information about either.

barrkel · on Feb 20, 2012

Parser combinator libraries are usually just embedded DSLs for constructing data structures (implicitly or explicitly) that are then interpreted during traversal, mostly in a way which has equivalent power to LL(1) or LL with backtracking - and frequently without the advantages of a "proper" analysis, such as detection of grammar ambiguities that you need first and follow sets to find. They're very cute, IMO; an appealing piece of hackery. I wouldn't try and push them much beyond fairly small ad-hoc parsing problems though. At core, they are a form of metaprogramming, where the text of your combinator expressions is a program that writes a program, and the extra distance your code needs to go through to get to the hardware is liable to introduce problems: either unforseen issues with the library, unexpected overuse of CPU or memory, or hard to diagnose errors.

PEGs, meanwhile, are all-out backtracking exponential matchers that try and get back the performance by spending memory on memoizing. A PEG parser would be my last choice in writing a parser, unless I had to recognize a complex language with no tools and almost no time, but plentiful CPU and memory. With tools, I'd prefer GLR parsing if the problem is hard enough; with time, I'd prefer hand-written LL(k) recursive descent, which gives substantial flexibility. Most commercial compilers use hand-written recursive descent for a reason.

In summary, if you have good knowledge of even just LL(1) alone, you should be able to understand how both parser combinator libraries and PEG parsers work very easily, and also see why they can be problematic. So I wouldn't be too fast to dismiss the book as dated.

timb · on Feb 20, 2012

There is a little bit of information in the 2nd edition. Parser combinators get a few pages in "17.4.2 Functional Programming". PEGs and packrat parsing are briefly covered in "15.7 Recognition Systems".

ThaddeusQuay2 · on Feb 20, 2012

Just to point out, for those who have never heard of it, you can learn a lot about parsing, pattern matching, grammars, and term rewriting from TXL, which is definitely a Perlis language.

http://txl.ca

http://en.wikipedia.org/wiki/TXL_(programming_language)

3dGrabber · on Feb 19, 2012

Very accessible, yet complete book on parsing