What algorithm are you using with the "obvious code"? Recursive descent? That definitely works well in many cases, but not all.
Here are some things that I think are difficult about parsing:
1. The algorithm to use depends heavily on the language, and it takes some experience to know which algorithm to use for which language. Some languages are hard to write predictive recursive descent parsers for -- they can require a lot of backtracking, which is exponential in general. C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue. In contrast, languages like Python or Go are designed to be LL(1) -- you just have to look at a single token -- "def" or "func" -- to know what kind of thing you need to parse next.
Recursive descent was also considered too slow for expression parsing, where you have left recursion. If you have 13 levels of precedence, then you require 13 nested function calls to parse an atom. So you have basically 3 choices of: Pratt parsing, Shunting Yard algorithm, or Precedence Climbing (I've implemented #1 and #3 and like #3 the most). So you have to change the algorithm for the sublanguage of expressions, i.e. compose two different algorithms.
(I think these days nobody cares about 13 levels of function calls, unless you are parsing gigabytes of C++ code. Of course, this happens literally all the time, because of header file explosion ...)
2. If you know exactly what language you are going to parse, writing a hand-written parser is straightforward (although laborious). If you don't know the language, then refactoring parsers can be quite difficult. You need good test cases not to break corner cases.
A Lua paper talked about how until Lua 4.0 they used yacc while they were designing the language, and then when the language became more fixed in Lua 5.0, they switched to a hand-written parser. This seems to be very common. I think gcc also switched from yacc to a hand-written parser.
3. It's not straightforward to combine two parsers for two sublanguages and get a parser for the combined language (or get errors if the languages can't be composed). Many real languages are really several different sublanguages combined. C++ has a normal mathematical expression language, and a template expression language (there was that notorious >> vs foo<bar<T>> ambiguity in C++). Unix sh is about 5 different sublanguages mashed together (not counting external utilities). Parsing HTML requires parsing JS and CSS.
4. Writing secure parsers. This is an issue for languages like JavaScript, where the input is untrusted. If you write a parser in C, I pretty much guarantee I will find a crash or buffer overflow in it (through fuzzing or otherwise). Basically ALL parsers in production have crash bugs uncovered after YEARS of widespread use, e.g. Python, Clang, etc.
5a. Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved. Most languages have multiple parsers these days -- i.e. the Java parser in your compiler is not the same parser that's used by the IDE.
5b. Try writing a parser with the hooks necessary to supports code completion and auto-correct. There was a good video by Anders Hejlsberg here that mentioned that these type of parsers have to parse MORE THAN the language, so that they can continue to provide the required functionality when the code in the IDE is malformed.
FWIW I recall that Steve Wozniak also disclaimed any background in parsing algorithms, and that he said he invented his own table driven parsing algorithm for BASIC. That is totally plausible, but I also think he would have a hard time writing certain parsers for certain languages, supporting advanced functionality, without some study.
EDIT: #6: The line between the lexer and parser is not clear -- it depends on the language being parsed, and sometimes even on the implementation language.
In theory, you have a pipeline from the lexer to the parser. In practice, the lexer and the parser often need to communicate, resulting in circularity. This communication can be hard to reason about and debug.
Moreoever, you can often solve parsing problems (i.e. resolve ambiguity) in either the lexer or the parser. (I think Go's and JavaScript's semicolon insertion are done in the lexer) If you pick the wrong choice, then you can have bugs or a lot of extra complexity. It takes experience and thought to get these things right.
(Side note: depending on the language, there can be more stages than just lexing and parsing. Unix shell has at least 3 stages.)
C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue.
Which C compilers use a bottom-up algorithm? I'm fairly certain both GCC and Clang both use top-down recursive descent parsers.
It would be great to write your grammar once and get security, code-formatting, autocomplete, etc. for free, but in practice, you'll need a lot of control over things like error reporting. It's hard to match the control you have with plain recursive-descent code.
That's a good question, and you're of course right about GCC and Clang using recursive descent.
Well, I said that bottom-up is more efficient for such languages (with C style declaration syntax), not that it's used in production C compilers :)
My feeling is that they use recursive descent for reasons of readability and maintainability, error handling, debugging, etc. (although I have seen a couple hand-written bottom up parsers for other languages)
I suspect that they have ad hoc tricks for the specific lookahead cases, or perhaps they just live with it because it doesn't come up often enough in real C code (there are plenty of other reasons C and C++ parsers are slow; this is an algorithmic reason, which may not be the bottleneck)
In practice I think there is a sliding scale from strict "recursive descent" to "ad hoc bunch of recursive procedures". Depending on how you organize the output data structures of the parser, I guess you can just delay knowing what production you are in. It doesn't have to be organized strictly according to the grammar. A lot of C parsers I've looked at use untyped (homogeneous) tree output, which may help with this. Every function returns Node* rather than an object of a specific type.
But I'm not an expert on GCC or Clang, so I would be interested hear anything to the contrary. (I'm still sort of shocked by how huge the Clang front end is; the lib/Lex/ dir is 21K lines of code; lib/Parse/ is 33K lines; and that's not even close to the full front end -- it doesn't even count the 60K lines of AST headers that I mentioned before...)
Of course, the danger with a more ad hoc structure is that it's harder to reason about what language you're actually parsing. The only solution seems to be rather exhaustive test cases (e.g. as mentioned on https://gcc.gnu.org/wiki/New_C_Parser) It's almost mind-boggling to try to understand what language 33K lines of code are recognizing, without some higher level structure.
This question addresses the same issue, but there's no real answer.
"When you dive into the parens you don't know if you're parsing a type, or an array of type, or an expression (and whether it's l-value or r-value), so you have to somehow postpone that decision"
1. You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom. Also, you don't have to "compose two different algorithms" per se, there's no impedance mismatch.
2. I've found refactoring hand-rolled parsers to be just fine, I believe the things I needed to do would not be so convenient with parsing tools, though it depends on what direction I'd have gone down at a certain point in time.
3. The problem there is that of proper language design. You need to combine your sub-languages in a way such that there isn't parsing problems. Your language needs to be parsed by humans, too, after all.
4. This is actually a solved problem, it's just the solution isn't evenly distributed. I think that parsing code is much more low-risk than other C code I've written, but there are people who do things a lot more bozotically...
5a. Does the parser has special obligations beyond recording comments and source locations here?
4a^H^H5b. Yeah, now, the question is, how did they solve that problem?
Your comments betray a lot of ignorance, so I will just point out the glaring inaccuracies in #1. Here are two different expression parsers that are likely installed on machines you use, and they absolutely make a function call for each level of precedence (eval1, eval2, ..., exp1, exp2, ...)
My point is not that this is BAD -- it probably doesn't matter on modern machines. I'm just saying there are choices to make. Parsing requires some knowledge and choices -- that's all. You don't just "type out the code".
#3 is not a refutation, because all the languages I mentioned are extraordinarily common. The answer to "parsing real languages is hard" is not "don't parse real languages".
I can also tell by #4 that you don't have any background in security. It's not about YOUR code per se. It's about code that is deployed and that we all rely on.
1. Did I not say there might be some living strawmen out there?
3. Did I say we shouldn't parse real languages? No, I said the problem was that it's not straightforward to combine two languages into a new one. The problem's not at the level of parsers.
"You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom."
So, how do you handle expression precedence? (I haven't hand written a parser since I made peace and developed a working relationship with bison.)
A Pratt-style parser makes expression precedence easy in a recursive descent parser:
"use strict";
var tokens = [2, '+', 5, '+', 9, '+', 4];
var PREC = {'+': 1, '-': 1, '*': 2, '/': 2};
function expr(prec) {
var value = tokens.shift();
while (true) {
var nextToken = tokens[0];
var nextPrec = PREC[nextToken];
if (!nextPrec) break; // not an operator
if (nextPrec <= prec) break;
tokens.shift();
var rest = expr(nextPrec);
value = [nextToken, value, rest];
}
return value;
}
console.log(expr(0));
I find this technique useful because I can implement it in any language without being dependent on a parser generator.
Precedence is actually not hard. The way I do it is:
(1) Parse everything as though it were left-to-right.
(2) After each node is parsed, look at its immediate descendants and rearrange links as necessary. (Nodes in parentheses are flagged so you don't rearrange them.)
I can tell that the person above who is listing off a bunch of reasons not to use "recursive descent" hasn't written a compiler by hand ever (or not well). Most of the things he is talking about are easier to do by hand than in some complicated and relatively inflexible system.
Note that 'prediction' is mostly a red herring since you can look as many tokens ahead as you want before calling the appropriate function to handle the input. You would need to have a pathologically ambiguous language in order to make that part hard, and if your language is that ambiguous, it is going to confuse programmers!
In general, parsing is easy (if you know how to program well in the first place) and is only made more difficult/inflexible/user-unfriendly by using parsing tools. That doesn't mean that academic theories about parsing are bad -- it's good that we understand deeply things about grammars -- but that does not mean you should use those systems to generate your source code. (I do think it's a good idea to use a system like that to spot ambiguities in your grammar and decide how to handle hem, because otherwise it's easy to be ignorant... But I would not use them to generate code!)
Precedence isn't hard -- but it's often not done with recursive descent. In other words, it's not just "typing out the code".
I think you're misreading my reply. A hand-written recursive descent works a lot of the time, and is almost certainly the right "default". But my point is that it doesn't work all the time. If it works for you, great! That doesn't mean all parsing is easy. Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser).
In particular, I'm NOT advocating that all parsers should written with parsing tools. I think you must have read that somewhere else.
In fact, I used EXACTLY the same strategy as you mention on my recent shell parser. I machine-checked the grammar with ANTLR, and wrote some test cases for it. Then I transcribed that grammar to recursive descent by hand, along with several other ad hoc strategies that you need to parse an ad hoc language like shell.
(ANTLR is a great front end, but not necessarily a good back end. The generated Java code is probably OK, but the C++ code it generates is preposterous. Yacc is not that much better, i.e. with globals all over the place.)
"Just type out the code" is a horrible strategy because you will end up not knowing what language you've designed. If your post advocated "prototype your grammar with a tool, and then manually transcribe the grammar to code", then I'd be inclined to agree.
But that's not what you said. It seems like you are more interested in telling everyone what a great programmer you are, and how things are so easy for you, rather than spreading any interesting knowledge about parsing.
"Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser)."
It would not change at all, and I have no idea why you think it would, except to guess that the model you have in your head of a hand-written parser kind of sucks. They don't have to suck.
"...not knowing what language you've designed." I have no idea what you're on about here either.
Look, I think you are making things a lot harder than they are. I am not bragging ... I used to build lexers and parsers by hand 23+ years ago when I was a student in college and had almost no programming experience compared to what I have now. It is not hard. If you think it's hard, something is missing in your knowledge set.
(I also built stuff using parser tools 23+ years ago, and was able to very clearly contrast the two methods. Parser tools have gotten slightly better since then, but not much.)
I doubt your parser is even close to doing what he is talking about. I think you're an advocate of Microsoft's debugger over GDB, so hopefully what they are saying carries some weight (i.e. a little computer science doesn't hurt). They aren't just a bunch of eggheads; they are actually building language tools that people use.
And again, I'm not claiming that ALL parsing is hard. Sometimes you can just write a hand written lexer and recursive descent parser and be done with it. But that's not true in all cases, particularly for "real world" languages, or when you are designing a brand new language.
Saying parsing is easy is like saying "I wrote a game in BASIC in 8th grade, so writing games is easy". (And yes I've actually heard that statements like that ... )
I was researching Jai awhile back, and was definitely curious about it. I don't think you have released the code, so I can't tell which of these statements is true:
1) Parsing is easy for you
2) You think parsing is easy, but your language and your parser are full of bugs
I'm perfectly willing to believe #1 (honestly). But that doesn't mean all parsing is easy, for say a competent programmer in some other domain. It doesn't help anyone to say "just type out the code". Your comment about how you do precedence is a little more helpful.
I watched that talk when it appeared on the HN front page, and I actually think the whole methodology he is talking about is misguided. I don't find any of the "incremental program understanding" stuff in Visual Studio to be useful at all. I wish it were not there because it only causes problems and distractions.
It's a case where some people are choosing to do something that is a lot harder than a straightforward parse ... but as a user, a straightforward parse is actually what I want.
That said, even if you thought this was the right way to go,
I am not sure that the internals of their code would look anything like the kinds of parsing tools you are talking about, so I am not sure it supports your point in any way.
> And again, I'm not claiming that ALL parsing is hard.
Parsing is easy. The video you link above is harder, but that's not really parsing any more, it's more like "make sense of this text that is sort of like a working program", which is more like an AI problem.
But anyway. It's pretty clear you haven't written many parsers (or any) so I am going to stop arguing. If I were to "win" this argument I wouldn't get anything out of it. I am trying to help by disavowing people of the notion that certain things are harder than they've been indoctrinated to think. If you don't want that help, fine ... just keep doing what you do and the world will keep going onward.
> I doubt your parser is even close to doing what he is talking about.
You act like if someone hasn't architecture-astronauted all their software to meet any hypothetical future requirement, you've won. The truth is, making a parser that can restart incrementally and reuse above-and-below data structures is just another requirement. It shapes the software design and would probably result in a more iterator-oriented approach -- see, we can already imagine how it might be built. It's like you're saying, "Writing a piddly game in BASIC isn't all that, let me show you how hard it is to write this Tetris program." Maybe a better analogy is those people that thought OOP was some deep advanced stuff.
And note again parsing is the easiest part of that software, a neat and well-contained problem.
Step 2. Notice that writing functions parsing "expressions," "expressions without equals signs," "expressions without equals signs or plus/minus signs," ... is very repetitive.
Step 3. Being a competent programmer, find a way to solve that problem. (Instead of N functions, make one function that takes a parameter specifying the precedence level. Edit: Durr, I'm a moron, this doesn't really describe how to solve the problem of making N recursive calls. But still, you just figure it out, see my other post for how I actually tend to do it. Edit: Or judofyr's post describes it too.)
And it's a natural match for Packrat, they work together really well.
> writing a hand-written parser is straightforward (although laborious)
Not that much more laborious than writing a BNF. You can still use a parser generator, just add some annotations for the error recovery, error reporting, pretty printing and indentation.
> It's not straightforward to combine two parsers for two sublanguages
With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.
> Writing reusable parsers.
Again, trivial with PEG, where parsers can be extensible and generic, and can be inherited in full or partially by the consequent parsers.
> Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved.
And again, trivial with the lexerless PEG-based parsers.
> Try writing a parser with the hooks necessary to supports code completion and auto-correct.
You're getting this for free when using PEG.
> The line between the lexer and parser is not clear
Just forget about the lexers, once and for all. It's 21st century already. Burn your Dragon Book.
> With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.
Not strictly true; combining PEG parsers will always work, but it might not give you the answers you want. If you have some outer layer that has 'value=' and you want to follow it by an expression in various sub-languages, you have to try each language in turn - if language A has some weird corner-case where it happens to accept something that's an important and common case in language B, language A will always win unless you swap the two languages around, in which case B might recognise and mask some important syntax from A.
Worse, your combined language parser cannot tell you that the combination of A and B cause a problem, because PEG parsers don't really support that kind of consistency-checking. It's just a bug that'll crop up at run-time.
You can get around this by manipulating the outer syntax to have special "language A" and "language B" prefixes to mark what syntax to expect, or by manually merging them to create "language AB" which has the syntax priorities you want. But in both cases, that's (potentially delicate and thoughtful) human intervention, not "straightforward combining of two parsers".
> because PEG parsers don't really support that kind of consistency-checking
Not true at all. You can easily check if a new parser is breaking anything in the old one.
And in practice you never mix languages at the same level. A more typical example of such a mixing would be, say, regexp syntax in javascript.
EDIT: if you want more details on a theory of this static PEG verification, they will be available some time later when I polish my next bunch of commits.
I like PEGs a lot and even wrote my own PEG-like parsing language. The main problem I found was that, in practice, mixing lexing and parsing is a bad idea, so I have separate lexing in my system. It depends on the language, but I would say it's true for all programming languages.
It's just obvious that programming languages have separate lexical and grammatical structure. If you want to disprove that, show me some languages like C, Java, Python, etc. expressed as PEGs.
PEGs have been around for 12 years now; I don't see them being deployed widely. There are probably multiple reasons for that, but I believe usability in terms of expressing real languages is a big one. (People always harp on ordered choice; I don't think it's that big a deal because when you write a recursive descent parser, you're mostly using ordered choice too.)
You want to do as much work in the less powerful computational paradigm as possible -- that is, lex with regular languages. And then use the more powerful paradigm (PEGs or CFG algorithms) on top of that token stream.
I believe that lexing and parsing were combined in PEGs for reasons of academic presentation and bootstrapping, not for usability or practicality for recognizing real languages.
Several of your other points are wrong, but I'll leave it at that.
Here are some things that I think are difficult about parsing:
1. The algorithm to use depends heavily on the language, and it takes some experience to know which algorithm to use for which language. Some languages are hard to write predictive recursive descent parsers for -- they can require a lot of backtracking, which is exponential in general. C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue. In contrast, languages like Python or Go are designed to be LL(1) -- you just have to look at a single token -- "def" or "func" -- to know what kind of thing you need to parse next.
Recursive descent was also considered too slow for expression parsing, where you have left recursion. If you have 13 levels of precedence, then you require 13 nested function calls to parse an atom. So you have basically 3 choices of: Pratt parsing, Shunting Yard algorithm, or Precedence Climbing (I've implemented #1 and #3 and like #3 the most). So you have to change the algorithm for the sublanguage of expressions, i.e. compose two different algorithms.
(I think these days nobody cares about 13 levels of function calls, unless you are parsing gigabytes of C++ code. Of course, this happens literally all the time, because of header file explosion ...)
2. If you know exactly what language you are going to parse, writing a hand-written parser is straightforward (although laborious). If you don't know the language, then refactoring parsers can be quite difficult. You need good test cases not to break corner cases.
A Lua paper talked about how until Lua 4.0 they used yacc while they were designing the language, and then when the language became more fixed in Lua 5.0, they switched to a hand-written parser. This seems to be very common. I think gcc also switched from yacc to a hand-written parser.
3. It's not straightforward to combine two parsers for two sublanguages and get a parser for the combined language (or get errors if the languages can't be composed). Many real languages are really several different sublanguages combined. C++ has a normal mathematical expression language, and a template expression language (there was that notorious >> vs foo<bar<T>> ambiguity in C++). Unix sh is about 5 different sublanguages mashed together (not counting external utilities). Parsing HTML requires parsing JS and CSS.
http://tratt.net/laurie/blog/entries/parsing_the_solved_prob...
4. Writing secure parsers. This is an issue for languages like JavaScript, where the input is untrusted. If you write a parser in C, I pretty much guarantee I will find a crash or buffer overflow in it (through fuzzing or otherwise). Basically ALL parsers in production have crash bugs uncovered after YEARS of widespread use, e.g. Python, Clang, etc.
See http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-e... . Note how mature sqlite is and how extensively it is tested -- you can still shake obscure bugs out quite easily.
5. Writing reusable parsers.
5a. Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved. Most languages have multiple parsers these days -- i.e. the Java parser in your compiler is not the same parser that's used by the IDE.
5b. Try writing a parser with the hooks necessary to supports code completion and auto-correct. There was a good video by Anders Hejlsberg here that mentioned that these type of parsers have to parse MORE THAN the language, so that they can continue to provide the required functionality when the code in the IDE is malformed.
FWIW I recall that Steve Wozniak also disclaimed any background in parsing algorithms, and that he said he invented his own table driven parsing algorithm for BASIC. That is totally plausible, but I also think he would have a hard time writing certain parsers for certain languages, supporting advanced functionality, without some study.
EDIT: #6: The line between the lexer and parser is not clear -- it depends on the language being parsed, and sometimes even on the implementation language.
In theory, you have a pipeline from the lexer to the parser. In practice, the lexer and the parser often need to communicate, resulting in circularity. This communication can be hard to reason about and debug.
Moreoever, you can often solve parsing problems (i.e. resolve ambiguity) in either the lexer or the parser. (I think Go's and JavaScript's semicolon insertion are done in the lexer) If you pick the wrong choice, then you can have bugs or a lot of extra complexity. It takes experience and thought to get these things right.
(Side note: depending on the language, there can be more stages than just lexing and parsing. Unix shell has at least 3 stages.)