1. You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom. Also, you don't have to "compose two different algorithms" per se, there's no impedance mismatch.
2. I've found refactoring hand-rolled parsers to be just fine, I believe the things I needed to do would not be so convenient with parsing tools, though it depends on what direction I'd have gone down at a certain point in time.
3. The problem there is that of proper language design. You need to combine your sub-languages in a way such that there isn't parsing problems. Your language needs to be parsed by humans, too, after all.
4. This is actually a solved problem, it's just the solution isn't evenly distributed. I think that parsing code is much more low-risk than other C code I've written, but there are people who do things a lot more bozotically...
5a. Does the parser has special obligations beyond recording comments and source locations here?
4a^H^H5b. Yeah, now, the question is, how did they solve that problem?
Your comments betray a lot of ignorance, so I will just point out the glaring inaccuracies in #1. Here are two different expression parsers that are likely installed on machines you use, and they absolutely make a function call for each level of precedence (eval1, eval2, ..., exp1, exp2, ...)
My point is not that this is BAD -- it probably doesn't matter on modern machines. I'm just saying there are choices to make. Parsing requires some knowledge and choices -- that's all. You don't just "type out the code".
#3 is not a refutation, because all the languages I mentioned are extraordinarily common. The answer to "parsing real languages is hard" is not "don't parse real languages".
I can also tell by #4 that you don't have any background in security. It's not about YOUR code per se. It's about code that is deployed and that we all rely on.
1. Did I not say there might be some living strawmen out there?
3. Did I say we shouldn't parse real languages? No, I said the problem was that it's not straightforward to combine two languages into a new one. The problem's not at the level of parsers.
"You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom."
So, how do you handle expression precedence? (I haven't hand written a parser since I made peace and developed a working relationship with bison.)
A Pratt-style parser makes expression precedence easy in a recursive descent parser:
"use strict";
var tokens = [2, '+', 5, '+', 9, '+', 4];
var PREC = {'+': 1, '-': 1, '*': 2, '/': 2};
function expr(prec) {
var value = tokens.shift();
while (true) {
var nextToken = tokens[0];
var nextPrec = PREC[nextToken];
if (!nextPrec) break; // not an operator
if (nextPrec <= prec) break;
tokens.shift();
var rest = expr(nextPrec);
value = [nextToken, value, rest];
}
return value;
}
console.log(expr(0));
I find this technique useful because I can implement it in any language without being dependent on a parser generator.
Precedence is actually not hard. The way I do it is:
(1) Parse everything as though it were left-to-right.
(2) After each node is parsed, look at its immediate descendants and rearrange links as necessary. (Nodes in parentheses are flagged so you don't rearrange them.)
I can tell that the person above who is listing off a bunch of reasons not to use "recursive descent" hasn't written a compiler by hand ever (or not well). Most of the things he is talking about are easier to do by hand than in some complicated and relatively inflexible system.
Note that 'prediction' is mostly a red herring since you can look as many tokens ahead as you want before calling the appropriate function to handle the input. You would need to have a pathologically ambiguous language in order to make that part hard, and if your language is that ambiguous, it is going to confuse programmers!
In general, parsing is easy (if you know how to program well in the first place) and is only made more difficult/inflexible/user-unfriendly by using parsing tools. That doesn't mean that academic theories about parsing are bad -- it's good that we understand deeply things about grammars -- but that does not mean you should use those systems to generate your source code. (I do think it's a good idea to use a system like that to spot ambiguities in your grammar and decide how to handle hem, because otherwise it's easy to be ignorant... But I would not use them to generate code!)
Precedence isn't hard -- but it's often not done with recursive descent. In other words, it's not just "typing out the code".
I think you're misreading my reply. A hand-written recursive descent works a lot of the time, and is almost certainly the right "default". But my point is that it doesn't work all the time. If it works for you, great! That doesn't mean all parsing is easy. Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser).
In particular, I'm NOT advocating that all parsers should written with parsing tools. I think you must have read that somewhere else.
In fact, I used EXACTLY the same strategy as you mention on my recent shell parser. I machine-checked the grammar with ANTLR, and wrote some test cases for it. Then I transcribed that grammar to recursive descent by hand, along with several other ad hoc strategies that you need to parse an ad hoc language like shell.
(ANTLR is a great front end, but not necessarily a good back end. The generated Java code is probably OK, but the C++ code it generates is preposterous. Yacc is not that much better, i.e. with globals all over the place.)
"Just type out the code" is a horrible strategy because you will end up not knowing what language you've designed. If your post advocated "prototype your grammar with a tool, and then manually transcribe the grammar to code", then I'd be inclined to agree.
But that's not what you said. It seems like you are more interested in telling everyone what a great programmer you are, and how things are so easy for you, rather than spreading any interesting knowledge about parsing.
"Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser)."
It would not change at all, and I have no idea why you think it would, except to guess that the model you have in your head of a hand-written parser kind of sucks. They don't have to suck.
"...not knowing what language you've designed." I have no idea what you're on about here either.
Look, I think you are making things a lot harder than they are. I am not bragging ... I used to build lexers and parsers by hand 23+ years ago when I was a student in college and had almost no programming experience compared to what I have now. It is not hard. If you think it's hard, something is missing in your knowledge set.
(I also built stuff using parser tools 23+ years ago, and was able to very clearly contrast the two methods. Parser tools have gotten slightly better since then, but not much.)
I doubt your parser is even close to doing what he is talking about. I think you're an advocate of Microsoft's debugger over GDB, so hopefully what they are saying carries some weight (i.e. a little computer science doesn't hurt). They aren't just a bunch of eggheads; they are actually building language tools that people use.
And again, I'm not claiming that ALL parsing is hard. Sometimes you can just write a hand written lexer and recursive descent parser and be done with it. But that's not true in all cases, particularly for "real world" languages, or when you are designing a brand new language.
Saying parsing is easy is like saying "I wrote a game in BASIC in 8th grade, so writing games is easy". (And yes I've actually heard that statements like that ... )
I was researching Jai awhile back, and was definitely curious about it. I don't think you have released the code, so I can't tell which of these statements is true:
1) Parsing is easy for you
2) You think parsing is easy, but your language and your parser are full of bugs
I'm perfectly willing to believe #1 (honestly). But that doesn't mean all parsing is easy, for say a competent programmer in some other domain. It doesn't help anyone to say "just type out the code". Your comment about how you do precedence is a little more helpful.
I watched that talk when it appeared on the HN front page, and I actually think the whole methodology he is talking about is misguided. I don't find any of the "incremental program understanding" stuff in Visual Studio to be useful at all. I wish it were not there because it only causes problems and distractions.
It's a case where some people are choosing to do something that is a lot harder than a straightforward parse ... but as a user, a straightforward parse is actually what I want.
That said, even if you thought this was the right way to go,
I am not sure that the internals of their code would look anything like the kinds of parsing tools you are talking about, so I am not sure it supports your point in any way.
> And again, I'm not claiming that ALL parsing is hard.
Parsing is easy. The video you link above is harder, but that's not really parsing any more, it's more like "make sense of this text that is sort of like a working program", which is more like an AI problem.
But anyway. It's pretty clear you haven't written many parsers (or any) so I am going to stop arguing. If I were to "win" this argument I wouldn't get anything out of it. I am trying to help by disavowing people of the notion that certain things are harder than they've been indoctrinated to think. If you don't want that help, fine ... just keep doing what you do and the world will keep going onward.
> I doubt your parser is even close to doing what he is talking about.
You act like if someone hasn't architecture-astronauted all their software to meet any hypothetical future requirement, you've won. The truth is, making a parser that can restart incrementally and reuse above-and-below data structures is just another requirement. It shapes the software design and would probably result in a more iterator-oriented approach -- see, we can already imagine how it might be built. It's like you're saying, "Writing a piddly game in BASIC isn't all that, let me show you how hard it is to write this Tetris program." Maybe a better analogy is those people that thought OOP was some deep advanced stuff.
And note again parsing is the easiest part of that software, a neat and well-contained problem.
Step 2. Notice that writing functions parsing "expressions," "expressions without equals signs," "expressions without equals signs or plus/minus signs," ... is very repetitive.
Step 3. Being a competent programmer, find a way to solve that problem. (Instead of N functions, make one function that takes a parameter specifying the precedence level. Edit: Durr, I'm a moron, this doesn't really describe how to solve the problem of making N recursive calls. But still, you just figure it out, see my other post for how I actually tend to do it. Edit: Or judofyr's post describes it too.)
2. I've found refactoring hand-rolled parsers to be just fine, I believe the things I needed to do would not be so convenient with parsing tools, though it depends on what direction I'd have gone down at a certain point in time.
3. The problem there is that of proper language design. You need to combine your sub-languages in a way such that there isn't parsing problems. Your language needs to be parsed by humans, too, after all.
4. This is actually a solved problem, it's just the solution isn't evenly distributed. I think that parsing code is much more low-risk than other C code I've written, but there are people who do things a lot more bozotically...
5a. Does the parser has special obligations beyond recording comments and source locations here?
4a^H^H5b. Yeah, now, the question is, how did they solve that problem?