I Built a Ruby Parser

hiepph · on June 7, 2022

Big shout out to "Writing an Interpreter In Go". I followed along for about a month. The code is clearly written and tested. It's practical enough with theory explanation along the way. Now I feel confident to write a Lua interpreter in Go (considered Lua is simple enough).

moron4hire · on June 7, 2022

There is a popular saying (especially amongst The Olds) that "you don't understand programming until you understand pointers." I found it to be the opposite: I didn't understand pointers until after I had been programming for years and then implemented my first, interpreted scripting language.

I think learning to write interpreters is one of those fundamental skills that everyone should learn at some point in their programming careers. Not because there's a good chance they'd create a popular language out of it (though who knows, I definitely think we could use some new ideas in PLT), but because it removes almost all of the "black magic" feel of programming.

You see arguments for days about "OOP vs FP" and other such faff. And once you've implemented OOP and implemented FP in your own language, it's all just... window dressing. It's all the same thing, just different ways of slicing it. There's nothing inherently evil about one or the other.

tylerscott · on June 7, 2022

I second your opinion that learning to write an interpreter is a great idea at some point in a career. It definitely helped me gain a new level of understanding and, well, confidence in my knowledge of how pretty much any language “works”. It makes everything more interesting, too!

ammanley · on June 7, 2022

Random question from someone who is is self-taught and doesn't know better: is there a significant difference in this saying about learning interpreters vs learning to write your own _compiler_? Or are the two functionally equivalent for learning purposes?

moron4hire · on June 7, 2022

I'd say there's probably about 1/2 to 2/3rds overlap between the two. You'd still need to do all the same lexing and parsing. You still need to create a model of execution for your language.

But the primitive substrate of the machine (be it real or virtual) turns implementing that execution model into a puzzle all its own. You might implement an array in your interpreted language as just an array in your host language. You might implement an object as just a HashMap between field names and values in your host language. Interpreters don't have to build their own interfaces with the operating system, they reuse those interfaces from the host language in which they are developed, but a compiled language will need some way of its own to execute system calls.

Like, printing some text to a terminal. Say we were implementing an interpreted language in C#. We get to the point of writing our own "print" function for our own language. We'll probably end up creating some kind of translation from our language's print format to some kind of call to System.Console.WriteLine in C#. You might even like the format of Console.WriteLine, so you might even make it a straight, one-to-one translation and call it a day.

But if you're writing a compiled language, you'll need to know how the operating system you're running on expects to receive and execute commands. That's a whole other thing. To grossly oversimplify, it largely means getting a bunch of bytes into the right format and order into memory and then executing a specific CPU instruction. Want to allocate memory? There's another blob-of-memory-plus-execute-an-instruction interface you'll need to adhere to. Want to open a network socket? Same. And modern operating systems provide a lot of functionality.

But also, a lot of that work is kind of grunt work. There are certainly ways you can design a language that make it more difficult to implement as a compiled language than an interpreted one (dynamic typing, for example). Let's gloss over that issue. All else being equal, the language portion of the work being done in interpreters versus compilers is largely the same.

thefaux · on June 7, 2022

This way of describing compilers seems to imply that a compiler must emit machine specific assembly code, which seems overly narrow. A different way to think about compilers vs interpreters is that compilers are programs that read source code as input and generate an executable artifact as output while interpreters are programs that read source code as input and then, as a side effect, perform the instructions within the interpreter process.

Note that taking this broader definition of compilers, it is not necessary for a compiler writer to target the host architecture or learn about the sys calls. Many languages have a non-native host target, e.g. Typescript (javascript), Scala (jvm) and F# (.Net), but we still call the programs that translate source written in these languages to the target executable format compilers.

Going from an interpreter to a transpiler, which I personally consider a compiler, can be an almost trivial step. Let's assume that there is already an interpreter for the language and that it is implemented as a giant switch statement based on the op code of each instruction. Given an arbitrary target language in which all of the required instructions of the interpreter have a concrete representation, one could write a transpiler to this target language by replacing the right hand side of each statement in the switch with code that appends to a source file in the target language (there'd also generally need to be some surrounding boiler plate to do things like import required headers).

In practice, these days it is quite common for languages to transpile to C, LLVM IR, the JVM or Javascript. Even if one does want to emit their own machine code, it would still probably make sense to first target something simpler and not waste time in the low level details of language features that may or may not even prove useful (or the language itself may not prove useful). Again, going from interpreter -> transpiler can be a simple step. It is not unrealistic to write a useful transpiler in a day, particularly if you make the language syntax very simple and/or use a parser generator.

moron4hire · on June 7, 2022

> A different way to think about compilers vs interpreters is that compilers are programs that read source code as input and generate an executable artifact as output while interpreters are programs that read source code as input and then, as a side effect, perform the instructions within the interpreter process.

I think that's a very fair definition, and one I agree with completely. But I also admitted that I was grossly oversimplifying, which I thought was necessary given the stated background of the person I was responding to. As you pointed out, the step from interpreter to transpiler is almost trivial. My goal was to attempt to describe the much less trivial portions of the work without getting too bogged down in details.

But you make good points about transpilers that I probably should have mentioned. Lots of very good, very valuable work has been done with languages that have not gone all the way to emitting CPU-specific op codes.

13of40 · on June 7, 2022

> ...Want to open a network socket? Same...

From your compiler's perspective you shouldn't be messing around with all that, you should have an abstraction that lets you say "Pass a by value, b by reference, and c as an out parameter using calling convention X". As long as malloc and opensocket or whatever use the same calling convention, all of the actual byte layout is a one time effort.

moron4hire · on June 7, 2022

exercise for the reader

ammanley · on June 7, 2022

Really appreciate the through review, this sheds a lot of light on things for me. Thanks a ton.

stcredzero · on June 7, 2022

I'd say that either one is good. Writing your own compiler, depending on how you do it, will get you into assembly and machine language, which will benefit your understanding of the lowest levels of computers, and of the low-level mechanisms behind concurrency and operating systems.

You can always do a 2nd project with your interpreter, where you compile down to a bytecode, then JIT the bytecodes to machine language.

ammanley · on June 7, 2022

I do like that idea, thank you.

jrochkind1 · on June 7, 2022

Ruby is well-known to be especially challenging to parse, btw.

See for instance https://news.ycombinator.com/item?id=5182864

rambambram · on June 7, 2022

Nice read and good job. You can be proud on yourself!

jxf · on June 7, 2022

I'll always root for new fundamental tools, even toy examples. That's how broader change happens.

brink · on June 7, 2022

Funny enough, I started one a few weeks ago too. I wanted to do some basic automatic type checking. Though I will probably never finish it. https://github.com/codabrink/tolkien

vidarh · on June 7, 2022

There are many at this point...

nanolith · on June 7, 2022

The first parser that one writes for a real language is a big accomplishment. I remember the one I wrote for C89 back in high school, which was more annoying than many because it requires writing a preprocessor as well. All of the theory and toy language parsers just don't compare to being able to parse a real language. It's a baptism of fire, and it feels great afterward.

In my professional career, I have written dozens of domain-specific languages for dealing with things like asynchronous protocols, complex configuration, and even safer code generators. It's a great skill to have, and it has greatly contributed to building real products that just wouldn't be possible or wouldn't be built to such exacting quality otherwise.

difflens · on June 7, 2022

I'm most familiar with specifying grammar for the ANTLR tool. Just out of curiosity, is the antlr grammar a parsing expression grammar? Does anyone know?

Separately, I'm curious about how a parser generated out of ANTLR/yacc etc compares to this hand written one? If you've done some benchmarking, it'd be great to see numbers!

intrepidhero · on June 7, 2022

At least according to wikipedia, ANTLR accepts a CFG (context free grammar), while PEG (parsing expression grammar) is not context free, specifically, the choice operator in a PEG selects the first match.

I've been reading about parsers recently (https://www.craftinginterpreters.com/contents.html and https://peps.python.org/pep-0617/). It seems like recursive descent parsers are considered a little easier to reason about, and PEG aligns well with recursive descent.

I have not yet dived deeper in yacc, bison, or ANTLR and how you would write a parser from a CFG. Can anyone add their experiences?

difflens · on June 7, 2022

FWIW, here's a repo of ANTLR grammars: https://github.com/antlr/grammars-v4

Many of them are incomplete, but it gives a good idea of how the grammar looks for ANTLR.

I used the ANTLR grammar for parsing CSS in DiffLens (https://github.com/marketplace/difflens). DiffLens uses the typescript compiler itself to parse TS and JS.

I've also used Roslyn to parse C# (and F#). I wonder if the ruby compiler itself exposes a parser.

toniks · on June 7, 2022

Thanks. Can you recommend some free resources to learn from?

runevault · on June 7, 2022

Crafting Interpreters is probably the gold standard. They do phase 1 in Java and phase 2 in C but most of the code is pretty translatable.

http://craftinginterpreters.com/

tylerscott · on June 7, 2022

Seconding this recommendation. Bob (author) did such an amazing job of that book. I proudly display it in my office. The illustrations alone are worth the cost.

thinkpad13 · on June 8, 2022

may I know what type is that thinkpad timm?