I used Lex and Yacc (as the article mentions, the direct predecessors of the Flex and Bison it talks about) a bit a long time ago, but more recently when I've written parsers I've just used recursive descent - but I was parsing a well-defined grammar. It did leave me with the knowledge of how to describe grammars in Yacc though, which I have found useful.
Guy Steele said something about a use of Yacc for language design (rather than implementation) that stuck with me :
Be sure that your language will parse. It seems stupid to sit down and start designing constructs and not worry how they will fit together. You can get a language that's difficult if not impossible to parse, not only for a computer, but for a person. I use YACC constantly as a check of all my language designs, but I very seldom use YACC in the implementation. I use it as a tester, to be sure that it's LR(1) ... because if a language is LR(1) it's more likely that a person can deal with it.
From the Dynamic Languages Wizards series (in 2001), in the panel on language design (1:09:05) [1]
I've not yet employed Yacc in this fashion, but it did give me a tool for thinking about object models. A while ago when I was puzzling over how some classes in an entity relationship diagram should be related, and I considered it from the point of view of how would I design a grammar for serializing an instance of the model into text. This essentially made my decision for me in a principled way, though I never reached the point of writing up a grammar for the whole model, just considered the implications for the local bit that was troubling me.
If you want a language easy to deal with for humans, you want LL(1) or possibly LL(k) for some reasonable k.
LR(1) isn't easy to deal with: LR(1) shift-reduce parsing allows an arbitrary number of tokens to be stuffed into the parser stack before making a single reduction.
If you want easy to parse, make sure that a left-to-right scan can make a parsing decision by looking at a small number of tokens (often one). A permanent parsing decision: parse it this way or that way, no backtracking to try something else.
OK, the basics. But do not stop reading here if you want to write a parser. There are more modern tools to look at (e.g., antlr).
Warning 1: parsing Unicode streams well is awkward with flex -- it's from an age where ASCII ruled. But handling multiple input incodings may get weird. If it is only UTF-8, maybe it works, because that's essentially bytes. But I find a hand-written scanner more convenient (the grammar is seldom too complex for that). But regexps based on General_Category or ID_Start etc.? Difficult...
Warning 2: for various reasons, usually flexibility, conflict resolving, error reporting, and/or error recovery, many projects move from bison to something else, even a handwritten recursive descent parser. It's longer, but not that difficult.
- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.
- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.
- The tokenizer doesn't have to do full validation on the UTF-8; it can defer that to some routines. I.e. it has to recognize a potentially valid UTF-8 sequence, capture it in yytext[], and then a proper UTF-8 recognizer can be invoked on yytext which will do things like reject overlong forms. You can further sanitize it to reject characters that you don't want in an identifier, like various kinds of Unicode spaces or whatever your rules are.
Multiple encodings though? Hmph. You can write a custom YYINPUT macro in Lex/Flex where you could convert any input encoding, normalizing it into UTF-8. Or read through a stream filtering abstraction which reads various encodings on its input end, pumping out UTF-8.
In TXR Lisp, when you do (read "(1 2 3)") what happens is that the string is made of wide characters: code points. But the parser takes UTF-8. So, a UTF-8 stream is created which scans the wide string as UTF-8:
> I have to say though, most compilers courses I've seen have an inordinate emphasis on parsing and little else.
Which is sad, because parsing is simpler than the theory-heavy books make it out to be and understanding optimization is probably more practically important for programmers.
I mostly agree. Parsing is as complex as the theory-heavy books make it out to be (and more as you keep going down the rabbit hole). But the necessity of that complexity is isolated to some pretty niche areas. If you've been brushing up against the limitations of straight forward parsing for a number of years and are still finding them wanting, then you might be one of the few people who need to delve deeper.
Almost everyone should just hand code a recursive descent parser and then move on with their lives. I've been a part of a few book clubs at work that tried to dig into compiler books and the same thing kept happening. People get bogged down by the parser theory and give up. One of the people even eventually left to work on cryptograph for a company doing government contracts. These people could understand complex topics, but apparently parsing theory is a bridge to far for nearly everyone.
the happy path may very well be "simpler," but I get the impression a lot of the newer parsing techniques focus on error recovery which is what made clang and friends game-changers because they emit helpful and often actionable error messages on malformed input
it's a whole other ballgame when writing interactive parsers (as one would for IJ or I think tree-sitter, too) where the document spends the majority of its life in an erroneous state
now, I can appreciate diving into such an 80/20 topic may be too much for a compiler course, but as for really rolling out a compiler or editor plugin, it's the difference between "this is crap" and "wow, that's genuinely helpful"
If lex/yacc style parsing works for you, then great. However, I suspect most people are going to get more mileage out of just hand writing a recursive descent parser and moving on with their lives.
The benefit of recursive descent is that they're easy to write and modify and understand. You don't need any new paradigms, just write code like you typically do. If something goes wrong, your standard debugging skills will serve you well.
There's also a lot of other relatively easy parsing technologies out there. For example, you can also consider monadic parsing, parser combinators, PEG libraries.
I spent a year trying to figure out which parser technique worked best for me, and I'm glad I didn't just stick with my starting point of lex/yacc. So again, if this guide allows parsing to just work for you, then great stick with it. But if you find yourself encountering a lot of problems, then it might be worth it to look around because other options exist and work just fine.
The problem of course is most people aren't going to deliberately produce an LL grammar. It's very easy to take your language into LR territory by accident and if you've centered your features around a particular LR-able process, converting to LL may be extremely hard or practically impossible.
LL can be augmented by backtracking approaches: parse it that way and if it doesn't work out, do a long return to such and such point, and try it another way.
It hasn't been historically popular because it's expensive. Particularly, when you do several such searches in a nested fashion, the combinations can explode.
That might not matter on today's hardware.
The fix against LR features creeping in is not to write a grammar. Just write the code. If you have some new idea, code it in the recursive descent parser. There is a risk that by doing that, you break something which used to parse (because you hijacked the prefix by which it was recognized). You have to have a test case for that; don't add anything to the parser without tests.
LL(1) or LL(k) parsing relies by identifying the construct by looking at a few prefix tokens of the input. So when you extend the language, you have to think about where in that prefix space you stick the new features.
If you make the language a Lisp, you have a set blueprint for how to avoid LR(1). ANSI Common Lisp defines the actual algorithm by which code is parsed, because a user-reprogrammable part of the run-time. The reader follows a read table, which dispatches functions based on what character, or in some cases what pair of characters, is seen. There are subtleties: some characters are identified as token constituents while some are terminating (if they are seen in the middle of a token, that token is delimited, and that character is not part of the token). For instance, the ' character, in the standard read table, is tied to a function which will read the following object X and produce the object (quote x). The character is a terminating character, so that when we have abcd', an abcd symbol token will be recognized and come to an end because of the '. Then the function for ' will be dispatched to read the quote.
After taking a compiler course in uni I found the emphasis on dealing with syntax mostly a waste of time. To begin with, do yourself a favor and use S-expression syntax (like Lisp) for your language. They're dead simple to parse. With the syntax out of the way, you can get to meat and potatoes of implementing a language. Later on you can always define a "look" for your language, and you can spend an inordinate amount of time on that.
> After taking a compiler course in uni I found the emphasis on dealing with syntax mostly a waste of time. To begin with, do yourself a favor and use S-expression syntax (like Lisp) for your language. They're dead simple to parse. With the syntax out of the way, you can get to meat and potatoes of implementing a language. Later on you can always define a "look" for your language, and you can spend an inordinate amount of time on that.
I maintained same attitude for years.
I've changed my mind now. Anything feature I want already exists in some programming language. The only distinguishing feature I can offer when designing a new programming language is "very readable", so syntax matters more than I used to think.
Whether you get traction or not depends a lot on syntax - if your syntax is too much of an outlier compared to mainstream languages, your features don't matter.
My suggestion was more for people who are starting out developing a programming language. It's the over-emphasis on the syntax so early on I find to be suboptimal. I also find that slapping on a more human readable or aesthetically pleasing syntax afterwards to be relatively easy.
I feel like you're discounting the importance of syntaxes. Python is as popular as it is because the syntaxes is so simple. Lisp isn't as popular because prefix notation doesn't make intuitive sense to beginners. I would argue that syntax is probably the most important feature of a language
The problem I find with parenthesized languages is the amount of visual noise coming from the parentheses. If you're learning how to write your first interpreter or compiler I kind of assume you already know how to program and understanding S-expressions is not too difficult. Personally, I would have liked to learn and implement closures, continuations, macros, garbage collection, type systems, etc. earlier, and then going deeper into syntax.
Python languished in obscurity for about a quarter century before taking off in popularity. The tide didn't turn because of changes in syntax.
Python's syntax inspires imitations that you will never hear about in spite of their syntax.
C doesn't have beginner-friendly syntax, yet it succeeded over languages like Pascal and Modula. C++ enjoys great popularity in spite of a mind-numbingly large grammar full of arcane syntax, which grows worse as time passes. There is a lot interest in Rust, which has tons of bizarre syntax to learn.
Poorly-considered, simplistic hypotheses connecting syntax with popularity simply don't hold up to even a modest amount of scrutiny.
I contend that your hypothesis is equally simplistic and poorly considered. How would Python get an active community and ecosystem in the first place, if it languished for a quarter of a century?
> C++ enjoys great popularity in spite of a mind-numbingly large grammar full of arcane syntax, which grows worse as time passes. There is a lot interest in Rust, which has tons of bizarre syntax to learn.
C++ becomes less popular as its syntax becomes more arcane. Rust is a recent language and its long term survival remains to be seen. Languages are like memes: they come and they go. BASIC used to be the beginner language, now it's Python. Lua is popular due to its syntax and ease of embedding. We'll see what simplistic whitespace syntaxes come out within the next 25 years. I predict many language which exist today will fall by the wayside while newer ones replace them
> How would Python get an active community and ecosystem in the first place, if it languished for a quarter of a century?
Development on Python started in 1989 and it was evidently released in 1991. The popularity didn't take off until well past Y2K. During all those intervening years, it had a small number of users, characteristic of an unpopular language.
If MicroPython is Python, i dissagree. It spelled syntax error where there was any and indentation makes it very hard to distinguish code blocks especially by beginners and especially on windows where the font is not proportional in the interpreter.
Kind of related, for anyone curious with parsing and JS: I have to recommend peggy for writing simple parsers for files to be consumed by JavaScript. Pretty niche, but does it so well. I developed a few packages using it so far.
Guy Steele said something about a use of Yacc for language design (rather than implementation) that stuck with me :
Be sure that your language will parse. It seems stupid to sit down and start designing constructs and not worry how they will fit together. You can get a language that's difficult if not impossible to parse, not only for a computer, but for a person. I use YACC constantly as a check of all my language designs, but I very seldom use YACC in the implementation. I use it as a tester, to be sure that it's LR(1) ... because if a language is LR(1) it's more likely that a person can deal with it.
From the Dynamic Languages Wizards series (in 2001), in the panel on language design (1:09:05) [1]
I've not yet employed Yacc in this fashion, but it did give me a tool for thinking about object models. A while ago when I was puzzling over how some classes in an entity relationship diagram should be related, and I considered it from the point of view of how would I design a grammar for serializing an instance of the model into text. This essentially made my decision for me in a principled way, though I never reached the point of writing up a grammar for the whole model, just considered the implications for the local bit that was troubling me.
[1] https://youtu.be/agw-wlHGi0E?si=n-ann0TYjvZ45ie5&t=4145
edit: added a few clarifying notes