I've written a couple of interpreters now, and even a simple maths-compiler that generated assembly-language output.
One of the things that made me postpone these projects for years was seeing so many parser/lexer examples which were just evaluating simple expressions such as "3 + 4", or "3 + 4 * 5". Going from there to a real language with functions, conditionals, etc, is a big step with no guidance.
Still this is a nice project, and it is well-documented even if it is "small".
Indeed. There's not even a meaningful line between compiler and transpiler. All compilers take code in one format and output it in another, and there are plenty of widely-accepted compilers that output C code or assembly, rather than machine code.
As far as I can tell you are saying "just an interpreter" and referring to this as an "ast printer" because the output language is not machine code.
I'm not sure why that would be so significant. Don't get me wrong, I'm sure targeting some real machine code presents plenty of unique difficulties—but the interesting thing about compilation to me (and I'd wager a significant proportion of other readers here) is effective ways of structuring translation processes.
In the case of this project, the goal is to be a minimal demonstration, distilling the essence of how a compiler is structured. The fact that the output language is not machine code doesn't affect the essential structure of a compiler (as far as I know).
What most compiler tutorials are missing is coverage of intermediate representations. There are a billion parsing and calculator tutorials, but not nearly as many that show the process of turning, say, C-style code, into some sort of intermediate format, before emitting machine code.
The AST is an intermediate format. Is it important to have an intermediate language? Otherwise why does the AST not qualify?
My understanding is that having something like LLVM’s IR, for instance, is part of its decoupling the compiler frontend/backend, but probably wouldn’t be desirable for a single language compiler. But maybe I’m mistaken, or you are referring to something else :)
Not OP and I agree, "IR" is just the data structure that holds the AST or a human readable version of it.
That said I think what they're getting at is that the interesting bits of modern compilers are transforms between ASTs to lower from one IR to another either to perform some kind of optimization or to replace an abstraction with an implementation before it gets to codegen.
For example if you have generators in your language it's pretty easy to see how to turn a "yield" statement into an AST node. But to actually make the system work you'll probably need a compiler pass over your AST to transform coroutine definitions to subroutine definitions and a state machine to represent the execution context and a constructor/destructor for the state.
Same goes for all interesting language concepts, in the compiler the interesting bit is the pass that transforms the top level AST/IR into more explicit IR and going through the pipeline to get to codegen. Which is as complex as everything else these days when it needs to be fast.
The example talks about this but doesn't dive too deep.
Not really agreeing on this one. By IR people means middle ground between various semantics. Parsing gets you an abstract structure of the semantic-world you typed in. An IR would be a something bridging the gap between concepts and register machines, or whatever.
I'm sure someone will disagree with me, but to me, the point of IR is that it's a concretization, not an abstraction. Consider that your source code can look like
int foo(void);
or
class C;
or
template<class T> struct vector;
then you see that all of these translate into an AST, but don't result in any code getting generated. Conversely, given a template definition for vector, vector<int> and vector<string> would result in multiple intermediate representations for the exact same chunk of AST.
Calling the parsed representations of these "intermediate representations" when they're not corresponding to generated code would render the term practically useless. You might as well call the source code itself IR at that point and claim IR has no value.
That's not the point. A compiler has many phases, tokenizing/parsing/building the AST is just the first one, and for many languages the easiest of them all, specially if performance is not a concern.
The JS example wrote everything from scratch, so to do the same, I wrote my own parser combinator library (based on https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf), but you could use an existing library, and get rid of `Parser.hs`.
For those interested in this approach, one of the authors (Hutton) of the paper has a great video explaining functional parsing
https://youtu.be/dDtZLm7HIJs
I wrote one in Haskell (https://news.ycombinator.com/item?id=22523088), but I think it might be understandable for others, since I don't use many Haskell-isms other than do-notation and typeclasses.
See the page source for the code. (Probably fun, old syntax both for HTML and JS, pre-ECMA JavaScript [meaning, the semicolon isn't a delimiter, but a separator], even the indentation style is outdated. Mind that back then, even the support of JS object literals wasn't guaranteed.)
I don't know but I don't think I want any part of it.
A compiler is a very foundational piece of software, where performance is important, you run it many times, often on large things. It should be written in a a language with a very high performance ceiling and quick start up time, imho.
In this case I believe the "compiler" is an "ast printer".