What algorithm are you using with the "obvious code"? Recursive descent? That de...

panic · on June 26, 2016

C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue.

Which C compilers use a bottom-up algorithm? I'm fairly certain both GCC and Clang both use top-down recursive descent parsers.

It would be great to write your grammar once and get security, code-formatting, autocomplete, etc. for free, but in practice, you'll need a lot of control over things like error reporting. It's hard to match the control you have with plain recursive-descent code.

chubot · on June 26, 2016

That's a good question, and you're of course right about GCC and Clang using recursive descent.

Well, I said that bottom-up is more efficient for such languages (with C style declaration syntax), not that it's used in production C compilers :)

My feeling is that they use recursive descent for reasons of readability and maintainability, error handling, debugging, etc. (although I have seen a couple hand-written bottom up parsers for other languages)

I suspect that they have ad hoc tricks for the specific lookahead cases, or perhaps they just live with it because it doesn't come up often enough in real C code (there are plenty of other reasons C and C++ parsers are slow; this is an algorithmic reason, which may not be the bottleneck)

In practice I think there is a sliding scale from strict "recursive descent" to "ad hoc bunch of recursive procedures". Depending on how you organize the output data structures of the parser, I guess you can just delay knowing what production you are in. It doesn't have to be organized strictly according to the grammar. A lot of C parsers I've looked at use untyped (homogeneous) tree output, which may help with this. Every function returns Node* rather than an object of a specific type.

But I'm not an expert on GCC or Clang, so I would be interested hear anything to the contrary. (I'm still sort of shocked by how huge the Clang front end is; the lib/Lex/ dir is 21K lines of code; lib/Parse/ is 33K lines; and that's not even close to the full front end -- it doesn't even count the 60K lines of AST headers that I mentioned before...)

Of course, the danger with a more ad hoc structure is that it's harder to reason about what language you're actually parsing. The only solution seems to be rather exhaustive test cases (e.g. as mentioned on https://gcc.gnu.org/wiki/New_C_Parser) It's almost mind-boggling to try to understand what language 33K lines of code are recognizing, without some higher level structure.

This question addresses the same issue, but there's no real answer.

http://stackoverflow.com/questions/5992130/how-do-you-parse-... (ignore the fact that the question is misusing the term "context sensitive")

"When you dive into the parens you don't know if you're parsing a type, or an array of type, or an expression (and whether it's l-value or r-value), so you have to somehow postpone that decision"

SamReidHughes · on June 25, 2016

1. You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom. Also, you don't have to "compose two different algorithms" per se, there's no impedance mismatch.

2. I've found refactoring hand-rolled parsers to be just fine, I believe the things I needed to do would not be so convenient with parsing tools, though it depends on what direction I'd have gone down at a certain point in time.

3. The problem there is that of proper language design. You need to combine your sub-languages in a way such that there isn't parsing problems. Your language needs to be parsed by humans, too, after all.

4. This is actually a solved problem, it's just the solution isn't evenly distributed. I think that parsing code is much more low-risk than other C code I've written, but there are people who do things a lot more bozotically...

5a. Does the parser has special obligations beyond recording comments and source locations here?

4a^H^H5b. Yeah, now, the question is, how did they solve that problem?

chubot · on June 25, 2016

Your comments betray a lot of ignorance, so I will just point out the glaring inaccuracies in #1. Here are two different expression parsers that are likely installed on machines you use, and they absolutely make a function call for each level of precedence (eval1, eval2, ..., exp1, exp2, ...)

http://code.metager.de/source/xref/gnu/coreutils/src/expr.c

https://opensource.apple.com/source/bash/bash-76.1/bash/expr...

My point is not that this is BAD -- it probably doesn't matter on modern machines. I'm just saying there are choices to make. Parsing requires some knowledge and choices -- that's all. You don't just "type out the code".

#3 is not a refutation, because all the languages I mentioned are extraordinarily common. The answer to "parsing real languages is hard" is not "don't parse real languages".

I can also tell by #4 that you don't have any background in security. It's not about YOUR code per se. It's about code that is deployed and that we all rely on.

SamReidHughes · on June 25, 2016

1. Did I not say there might be some living strawmen out there?

3. Did I say we shouldn't parse real languages? No, I said the problem was that it's not straightforward to combine two languages into a new one. The problem's not at the level of parsers.

4. Did I not say it wasn't evenly distributed?

mcguire · on June 25, 2016

"You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom."

So, how do you handle expression precedence? (I haven't hand written a parser since I made peace and developed a working relationship with bison.)

judofyr · on June 25, 2016

A Pratt-style parser makes expression precedence easy in a recursive descent parser:

    "use strict";
    var tokens = [2, '+', 5, '+', 9, '+', 4];
    var PREC = {'+': 1, '-': 1, '*': 2, '/': 2};

    function expr(prec) {
      var value = tokens.shift();
      while (true) {
        var nextToken = tokens[0];
        var nextPrec = PREC[nextToken];
        if (!nextPrec) break; // not an operator
        if (nextPrec <= prec) break;
        tokens.shift();
        var rest = expr(nextPrec);
        value = [nextToken, value, rest];
      }
      return value;
    }

    console.log(expr(0));

I find this technique useful because I can implement it in any language without being dependent on a parser generator.

jblow · on June 25, 2016

Precedence is actually not hard. The way I do it is:

(1) Parse everything as though it were left-to-right.

(2) After each node is parsed, look at its immediate descendants and rearrange links as necessary. (Nodes in parentheses are flagged so you don't rearrange them.)

I can tell that the person above who is listing off a bunch of reasons not to use "recursive descent" hasn't written a compiler by hand ever (or not well). Most of the things he is talking about are easier to do by hand than in some complicated and relatively inflexible system.

Note that 'prediction' is mostly a red herring since you can look as many tokens ahead as you want before calling the appropriate function to handle the input. You would need to have a pathologically ambiguous language in order to make that part hard, and if your language is that ambiguous, it is going to confuse programmers!

In general, parsing is easy (if you know how to program well in the first place) and is only made more difficult/inflexible/user-unfriendly by using parsing tools. That doesn't mean that academic theories about parsing are bad -- it's good that we understand deeply things about grammars -- but that does not mean you should use those systems to generate your source code. (I do think it's a good idea to use a system like that to spot ambiguities in your grammar and decide how to handle hem, because otherwise it's easy to be ignorant... But I would not use them to generate code!)

chubot · on June 25, 2016

Precedence isn't hard -- but it's often not done with recursive descent. In other words, it's not just "typing out the code".

I think you're misreading my reply. A hand-written recursive descent works a lot of the time, and is almost certainly the right "default". But my point is that it doesn't work all the time. If it works for you, great! That doesn't mean all parsing is easy. Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser).

In particular, I'm NOT advocating that all parsers should written with parsing tools. I think you must have read that somewhere else.

In fact, I used EXACTLY the same strategy as you mention on my recent shell parser. I machine-checked the grammar with ANTLR, and wrote some test cases for it. Then I transcribed that grammar to recursive descent by hand, along with several other ad hoc strategies that you need to parse an ad hoc language like shell.

(ANTLR is a great front end, but not necessarily a good back end. The generated Java code is probably OK, but the C++ code it generates is preposterous. Yacc is not that much better, i.e. with globals all over the place.)

"Just type out the code" is a horrible strategy because you will end up not knowing what language you've designed. If your post advocated "prototype your grammar with a tool, and then manually transcribe the grammar to code", then I'd be inclined to agree.

But that's not what you said. It seems like you are more interested in telling everyone what a great programmer you are, and how things are so easy for you, rather than spreading any interesting knowledge about parsing.

jblow · on June 25, 2016

"Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser)."

It would not change at all, and I have no idea why you think it would, except to guess that the model you have in your head of a hand-written parser kind of sucks. They don't have to suck.

"...not knowing what language you've designed." I have no idea what you're on about here either.

Look, I think you are making things a lot harder than they are. I am not bragging ... I used to build lexers and parsers by hand 23+ years ago when I was a student in college and had almost no programming experience compared to what I have now. It is not hard. If you think it's hard, something is missing in your knowledge set.

(I also built stuff using parser tools 23+ years ago, and was able to very clearly contrast the two methods. Parser tools have gotten slightly better since then, but not much.)

chubot · on June 26, 2016

I don't think you've really thought about those things, or you don't understand what I'm saying.

Here is the recent video about Heljsberg I mentioned: https://news.ycombinator.com/item?id=11685317

I doubt your parser is even close to doing what he is talking about. I think you're an advocate of Microsoft's debugger over GDB, so hopefully what they are saying carries some weight (i.e. a little computer science doesn't hurt). They aren't just a bunch of eggheads; they are actually building language tools that people use.

And again, I'm not claiming that ALL parsing is hard. Sometimes you can just write a hand written lexer and recursive descent parser and be done with it. But that's not true in all cases, particularly for "real world" languages, or when you are designing a brand new language.

Saying parsing is easy is like saying "I wrote a game in BASIC in 8th grade, so writing games is easy". (And yes I've actually heard that statements like that ... )

I was researching Jai awhile back, and was definitely curious about it. I don't think you have released the code, so I can't tell which of these statements is true:

1) Parsing is easy for you

2) You think parsing is easy, but your language and your parser are full of bugs

I'm perfectly willing to believe #1 (honestly). But that doesn't mean all parsing is easy, for say a competent programmer in some other domain. It doesn't help anyone to say "just type out the code". Your comment about how you do precedence is a little more helpful.

jblow · on June 27, 2016

I watched that talk when it appeared on the HN front page, and I actually think the whole methodology he is talking about is misguided. I don't find any of the "incremental program understanding" stuff in Visual Studio to be useful at all. I wish it were not there because it only causes problems and distractions.

It's a case where some people are choosing to do something that is a lot harder than a straightforward parse ... but as a user, a straightforward parse is actually what I want.

That said, even if you thought this was the right way to go, I am not sure that the internals of their code would look anything like the kinds of parsing tools you are talking about, so I am not sure it supports your point in any way.

> And again, I'm not claiming that ALL parsing is hard.

Parsing is easy. The video you link above is harder, but that's not really parsing any more, it's more like "make sense of this text that is sort of like a working program", which is more like an AI problem.

But anyway. It's pretty clear you haven't written many parsers (or any) so I am going to stop arguing. If I were to "win" this argument I wouldn't get anything out of it. I am trying to help by disavowing people of the notion that certain things are harder than they've been indoctrinated to think. If you don't want that help, fine ... just keep doing what you do and the world will keep going onward.

SamReidHughes · on June 26, 2016

> I doubt your parser is even close to doing what he is talking about.

You act like if someone hasn't architecture-astronauted all their software to meet any hypothetical future requirement, you've won. The truth is, making a parser that can restart incrementally and reuse above-and-below data structures is just another requirement. It shapes the software design and would probably result in a more iterator-oriented approach -- see, we can already imagine how it might be built. It's like you're saying, "Writing a piddly game in BASIC isn't all that, let me show you how hard it is to write this Tetris program." Maybe a better analogy is those people that thought OOP was some deep advanced stuff.

And note again parsing is the easiest part of that software, a neat and well-contained problem.

See also https://hn.algolia.com/?query=by:WalterBright%20parser&sort=... for the opinion of somebody else that knows things about the relative difficulty of parsing.

SamReidHughes · on June 25, 2016

> Try adding an autoformatter like gofmt

Since you seem to know so much about this, what's the big deal here? And how does using some parsing tool help?

SamReidHughes · on June 25, 2016

Step 1. Begin writing a parser.

Step 2. Notice that writing functions parsing "expressions," "expressions without equals signs," "expressions without equals signs or plus/minus signs," ... is very repetitive.

Step 3. Being a competent programmer, find a way to solve that problem. (Instead of N functions, make one function that takes a parameter specifying the precedence level. Edit: Durr, I'm a moron, this doesn't really describe how to solve the problem of making N recursive calls. But still, you just figure it out, see my other post for how I actually tend to do it. Edit: Or judofyr's post describes it too.)

sklogic · on June 26, 2016

> which is exponential in general

In Packrat it's linear.

> Pratt parsing

And it's a natural match for Packrat, they work together really well.

> writing a hand-written parser is straightforward (although laborious)

Not that much more laborious than writing a BNF. You can still use a parser generator, just add some annotations for the error recovery, error reporting, pretty printing and indentation.

> It's not straightforward to combine two parsers for two sublanguages

With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.

> Writing reusable parsers.

Again, trivial with PEG, where parsers can be extensible and generic, and can be inherited in full or partially by the consequent parsers.

> Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved.

And again, trivial with the lexerless PEG-based parsers.

> Try writing a parser with the hooks necessary to supports code completion and auto-correct.

You're getting this for free when using PEG.

> The line between the lexer and parser is not clear

Just forget about the lexers, once and for all. It's 21st century already. Burn your Dragon Book.

thristian · on June 26, 2016

> With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.

Not strictly true; combining PEG parsers will always work, but it might not give you the answers you want. If you have some outer layer that has 'value=' and you want to follow it by an expression in various sub-languages, you have to try each language in turn - if language A has some weird corner-case where it happens to accept something that's an important and common case in language B, language A will always win unless you swap the two languages around, in which case B might recognise and mask some important syntax from A.

Worse, your combined language parser cannot tell you that the combination of A and B cause a problem, because PEG parsers don't really support that kind of consistency-checking. It's just a bug that'll crop up at run-time.

You can get around this by manipulating the outer syntax to have special "language A" and "language B" prefixes to mark what syntax to expect, or by manually merging them to create "language AB" which has the syntax priorities you want. But in both cases, that's (potentially delicate and thoughtful) human intervention, not "straightforward combining of two parsers".

sklogic · on June 26, 2016

> because PEG parsers don't really support that kind of consistency-checking

Not true at all. You can easily check if a new parser is breaking anything in the old one.

And in practice you never mix languages at the same level. A more typical example of such a mixing would be, say, regexp syntax in javascript.

EDIT: if you want more details on a theory of this static PEG verification, they will be available some time later when I polish my next bunch of commits.

chubot · on June 26, 2016

I like PEGs a lot and even wrote my own PEG-like parsing language. The main problem I found was that, in practice, mixing lexing and parsing is a bad idea, so I have separate lexing in my system. It depends on the language, but I would say it's true for all programming languages.

It's just obvious that programming languages have separate lexical and grammatical structure. If you want to disprove that, show me some languages like C, Java, Python, etc. expressed as PEGs.

PEGs have been around for 12 years now; I don't see them being deployed widely. There are probably multiple reasons for that, but I believe usability in terms of expressing real languages is a big one. (People always harp on ordered choice; I don't think it's that big a deal because when you write a recursive descent parser, you're mostly using ordered choice too.)

You want to do as much work in the less powerful computational paradigm as possible -- that is, lex with regular languages. And then use the more powerful paradigm (PEGs or CFG algorithms) on top of that token stream.

I believe that lexing and parsing were combined in PEGs for reasons of academic presentation and bootstrapping, not for usability or practicality for recognizing real languages.

Several of your other points are wrong, but I'll leave it at that.

sklogic · on June 26, 2016

> mixing lexing and parsing is a bad idea

Why exactly? I find it rather liberating to mix lexing and parsing, for any possible language.

> show me some languages like C, Java, Python, etc. expressed as PEGs.

Feel free to browse my github repos, username 'combinatorylogic'.

There is a lot of parsers in PEG, including C and Verilog.

One of the very good reasons to do such a mix is to be able to have nested comments (which rules out regular expressions).

> I don't see them being deployed widely

Do you see any new languages deployed widely? Inertia is very strong here.

> Several of your other points are wrong, but I'll leave it at that.

Mind elaborating? All my points are based on years of practice in writing parsers for all languages imaginable.