Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

MLs are way much better than the average blubs for writing compilers, but yet they still have some issues. The worst is the amount of a boilerplate code needed to implement even the smallest pass over an AST. Say, you want to visit and rewrite all the identifier nodes, but yet you must implement recursive functions for all the node types. In Haskell a bit of the Scap Your Boilerplate magic relieves this problem a bit, but still a long way to go to reach a level of convenience of Nanopass.

Next issue is related: slightly amended AST types are hard to define, they cannot be derived fromtypesxisting one by a simple rewrite.

Also, MLs do not allow an easy way to handle metadata transparently.

What I really want to have is ML type system and Nanopass density combined in one language (working on it, not done yet).



There are many ways to avoid the boilerplate you mention. The deriving ppx extension is one of them. I personaly used my own poor's-man-deriving called ocamltarzan which I used in pfff to factorize boilerplate by using auto-generated visitors. See https://github.com/facebook/pfff/blob/master/lang_php/parsin...


Yes, it looks much better than an ad hoc ML. As I said, tricks like Scrap Your Boilerplate are very useful, although they're still not as dense as Nanopass.

If a straightforward, depth-first visitor is relatively easy to infer, more complex strategies (e.g., the one you'd need to resolve the lexical scoping) are still a bit messy. And if your visitor is building a new AST out of an old one, things are getting much more tricky.


You can usually generate the visitor function, if your ML has a decent macro language (as does OCaml).

I'll give you the one about amended ASTs though. I once had to write an HTML / CSS processor. It started with the HTML DOM [tree] and performed successive processing stages on it, each time adding a little bit of extra data into the tree and extra node types. I had to tediously write an html type, an html1 type, an html2 type and so on. Never did find out if there's a better way to do this.


I don't know that anyone has really solved that problem:

http://lambda-the-ultimate.org/node/4170

Honestly, maybe the solution is just to look at the problem a different way, and use an attribute grammar instead, like Ted Kaminski mentioned.


I think I found a relatively sane solution to the expression problem, which employs quite a weird type system. The trick is to use a permissive (nearly "dynamic") first typing pass, using a "weak set" unification (a special kind of a Prolog variable which unifies with any other weak set, merging the values together), and then doing another typing pass to find out of the union types inferred this way are isomorphic to the expected declarations.

This way, tagged unions can be expanded, cut down, rewritten using simple rules - all the stuff you can do in Nanopass, but strictly typed with a Hindley-Milner type inference.


Asking as a complete layperson, are there any other strongly typed languages that doesn't have this issue when parsing an AST and adding metadata during subsequent passes?


Well, in ML you can easily create an AST type that allows you to add metadata as you go, by using option types. The problem is that what you really want is to have multiple representations of the AST, each only supporting relevant data.

You could potentially solve the problem using macros, to generate different versions of the AST, but it doesn't solve the problem of having multiple ASTs.

I suppose you could parameterize your type with whatever metadata should be on it, which means you only have to create one tree. But then pattern matching on it becomes a little bit messier, and your type is a bit more complicated.

So it's not that the problem doesn't have any solutions at all, it's just that none of them are totally satisfactory.

The reason you don't have this problem in untyped languages is that you aren't trying to type them :). The first solution I mentioned, using option types, is effectively how you would handle it in a dynamic language.


Exactly, the best compilation technique involves dozens of small passes, ideally with a new AST after every pass. And rewriting the whole thing so many times is very error-prone and clumsy in general. If you add a new variant to the first AST you may end up amending dozens of the lower ones.

My ideal approach is the one of Nanopass (or MBase, which was developed independently), where you can derive an AST from an existing one using simple rules.

For example, my usual trick for a typing pass is to enumerate all the expression nodes, giving them unique ids (that will become free variables in the type equations). The new AST is usually generated with a couple of lines of code:

  ast typed: simple (simpleexpr -> osimpleexpr) recform {
    simpleexpr = t(ident:ttag, osimpleexpr:e);}
And the rest of the AST can be very complex and elaborate, but we only wanted to rewrite `simpleexpr` nodes here and nothing else.


What about Lisps then? They don't have the type system but you probably win on verbosity. I thought that Racket and other Lisps have some static typing/ADT support too, but I'm not familiar with them.

I first heard the term "nanopass" from this paper [1], which is done in Scheme:

https://scholar.google.com/scholar?cluster=18108599075280184...

[1] Sarkar, Dipanwita, Oscar Waddell, and R. Kent Dybvig. "A Nanopass Framework for Compiler Education."


> What about Lisps then?

That's what I'm using at the moment, but I'm really missing the strict typing. It's often hard to debug even the simple passes if they're constructing invalid trees. A type system would have picked it up easily.

For this reason I'm working on a hybrid DSL which combines all the power of a Nanopass-like visitors language and a Hindley-Milner type inference with a little twist to overcome the expression problem. Plus all the bells and whistles like an enforced totality, combined with locally allowed imperative things (updating hash maps, accumulating data destructively in lists, etc.).


This article covers setting up Common Lisp with GADTs and exhaustive pattern-matching checks. https://chriskohlhepp.wordpress.com/metacircular-adventures-...

> We note that we get a non-exhaustive match warning at compile time, not run time. This is the behaviour we would expect for example from OCaml. Indeed, in our little game we might change the simple use of keyword symbols to algebraic data types and thus add compiler hosted verification of our pattern matches. To bring our exploration of the meta-circular abstraction to a full circle: Abstract data-types are an ideal medium in which to represent Abstract Syntax Trees or ASTs. Indeed the whole of Lisp syntax is itself fundamentally an Abstract Syntax Tree. So let’s see if we can represent trees using the cl-algebraic-datatype package. That would amount to an Abstract Syntax Tree represented inside an Abstract Syntax Tree. We will close with that that little teaser.


Is the DSL embedded in a Lisp or is it an entirely new language? Isn't there a lot of existing work on type systems and ADTs in Lisps? e.g.

http://docs.racket-lang.org/ts-guide/types.html#%28part._.Un...

https://www.reddit.com/r/lisp/comments/16vwib/which_lisps_fe...

I ask because I have run into the typing problem writing language tools in Python, and then I switched to OCaml. I am not that experienced with it, but I see the problem you're talking about. I'm wondering if I should go to a Lisp. Racket looks interesting and this seems like something the community cares about.


Yes, it is a eDSL on top of Lisp (but I also have a proper ML which runs on top of Lisp, so it does not matter much).

The issue with the existing type systems is that they do not solve the expression problem, for this reason I had to build a new one from scratch.

I'd suggest taking a look at Nanopass (there is a Racket port). I doubt it will play well with any external type systems, but at least some of the ideas in it are very valuable.


I proposed integrating FLINT ML compiler and VLISP in some form for heightened correctness argument, ML safety, and LISP power. Kind of several copies of the software that corresponded with the strengths of each catching various problems and assuring correctness of tool itself.

Never occurred to me to embed ML in LISP. Looks so obvious in hindsight. You might be onto something there. Integrating yours with mine, if I get back to using those, might embed CakeML or Ocaml subset in VLISP with it emulated in Racket or CL knockoff of it for fast, day-to-day development.

Btw, what's the expression problem? I've either forgotten or not got that deep into PL design yet.


Originally ML was implemented in Lisp. It was the language for a theorem prover: HOL

https://en.wikipedia.org/wiki/HOL_(proof_assistant)

Older ML (not something like SML) in Lisp is still available in some old version of that prover.


Interesting. So, basically taking it full-circle back where it came from. That is, if one still uses it with HOL4 for verification.


The term was coined by Philip Wadler: https://en.wikipedia.org/wiki/Expression_problem

The older versions of Bigloo used to contain an ML compiler built on top of Scheme (they dropped it later for some reason), so the concept is not new.


Appreciate the reference. Will look into it in the future. The Wikipedia summary made sense but the recompile part doesnt bother me as much as others. My preferrence is to make compiles fast as possible with good incremental compilation and caching. Ive even pushed for trying to FPGA-accelerate compiler passes if possible.


My solution allows incremental compilation. It is just a type system hack, and it does not even involve backtracking (my second favourite compilation technique), just a two pass typing instead of a single pass.

As for the compilation speed, the Nanpass-like approach is very parallelisable and there is a lot of space for nice optimisations (like pass fusion).


Curious, do you publish your tools that implement your solution?


Only partially (see the discussion elsewhere in this thread). It's a work in progress. But I hope to be able to roll out a working prototype soon.

The main component of this type system (weak sets unification) is already published.


Cool, cool.


IIRC, you are into both tech like ML and hardware. I know of LISP machines & even a G-machine (Haskell) emulator on a Forth chip. However, given verification and non-Turing push, I was looking for a CPU to run algorithms written in Standard ML. Found one and thought you'd enjoy reading it too:

http://cstein.kings.cam.ac.uk/~chris/part2/eb379.pdf


Thanks! An interesting one. It's unusual to see an SKI-based implementation of an eager language.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: