Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
From AST to Lossless Syntax Tree (oilshell.org)
133 points by g0xA52A2A on Feb 12, 2017 | hide | past | favorite | 56 comments


Funny, I actually did the same thing for python years ago and also talked about "Lossless Syntax Tree" then start calling it a Full Syntax Tree since CST doesn't really fit; I wasn't aware of the existence of lib2to3 at that time.

My goal was different than your: I wanted make writing custom refactoring code (that's it: code that modify source code) and code that works on source code a do-able task.

I end up doing some design decisions that I haven't found elsewhere (but this field is hard to explore):

- producing json, because datastructure doesn't lie to you and potential interoperability

- nodes are responsible for the formatting within itself, in opposition with lib2to3 where a node is responsible for the formatting before itself (or after, I'm not sure anymore)

- the tree is design for the human brain instead of an interpreter/compiler (for example having list instead of recursive structures)

The project is called Baron https://github.com/pycqa/baron and was actually a mean for me to work on what really interest me: the abstraction that attempt to make writing custom refactoring a doable task https://github.com/pycqa/redbaron

Good luck with your project :)


Cool, I didn't know about those projects. I have been using sed for a long time to refactor Python!

I think the goals are actually similar -- refactoring is a style-preserving transformation that is similar to changing the language. At Google there is a bunch of work on converting C++ 03 to 11 to 14 to 17, etc. and they're using similar techniques with Clang. That's an example of a refactoring that's also changing the language.

I think it would be nice to make a "LST-aware sed". Once you have the parsers, I don't think it's that hard to add something like that on top. The parsers are of course a lot of labor!


I've used RedBaron before when building some toy languages.

Thanks for a great project!


Oh, can you give more details/links please? I'm really curious about that!


The code never became public, but I can give some details.

The idea was to create an easier C, using Python as an intermediary language.

RedBaron provided the AST and AST->Python layers.

Unfortunately, the project had to pivot, because optimising Python to be fast enough was overly difficult/unpredictable.


Adding a comment as a reminder to look up your project. Which sounds really interesting!


Never did this before but after reading your comment, I thought there must be a better way. Click the "n units ago" timestamp next to the commentors user name, there's a "favorite" option you can then access through your profile page "favorite stories / comments".


Cool I didn't know this was a common pattern. I recently saw the same approach implemented in scala.meta [1] - it allows you to view the code as both parsed tokens with all syntax intact as well as more abstracted ASTs which only carry semantic meaning. Someone even built a code formatter called scalafmt[2] like the author mentions.

Its a really cool approach because I think we need to pay much more attention to making more/better structured data from the compiler available to tools.

[1] http://scalameta.org/tutorial/ [2] https://github.com/olafurpg/scalafmt


Back when we started scala.meta token-level granularity wasn't available in most metaprogramming frameworks. Clang [1] and Roslyn [2] seem to be the first two major industry-grade compilers that use this approach and re-use the compiler as the foundation for extensible tooling APIs.

[1] https://clang.llvm.org/

[2] https://roslyn.codeplex.com/


roslyn [1] uses red/green trees [2]. The lossless part, however, comes from the fact that the syntax tokens store associated trivia [3]

[1] https://github.com/dotnet/roslyn

[2] https://ericlippert.com/2012/06/08/red-green-trees/

[3] https://github.com/dotnet/roslyn/wiki/Roslyn-Overview#syntax...


From [2] - Incidentally, these are called “red/green trees” because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There’s no other meaning to the colours.

I love reading tidbits like that from other teams' work.


(author here)

Cool! Yes this is what I was getting at. I mentioned Clang and Microsoft's IDEs in the "related work" section. I haven't seen them describe their data structures very much.

Clang has some documentation, but it's not very clear. For example, this doc has a FIXME(describe design) on it.

http://clang.llvm.org/docs/LibFormat.html

I also think the Clang AST is absolutely enormous and thus hard to document.


We've not completely solved the problem of language complexity creeping in to the metaprogramming toolkit but we do have quasiquotes [1] to partially address the pain of ast constructors and destructors.

e.g. For code snippet "class C(x: Int)" Scala compiler create a tree that roughly resembles the following code:

    class C extends scala.AnyRef {
      <paramaccessor> private[this] val x: Int = _;
      def <init>(x: Int) = {
        super.<init>();
        ()
      }
    }
This tree looks even more terryfing under the hood in terms of AST constructors. Instead of making people figure out how that works we support nice high level sugar instead:

    q"class C(x: Int)"
Where q is a magic string interpolator that is compiled to generate an equivalent AST. It can be used for both construction of new AST nodes based on the older ones (we use $name syntax to substitute thing in) and also deconstuction of existing ones into smaller pieces via pattern matching ($name extracts parts out.)

[1] http://docs.scala-lang.org/overviews/quasiquotes/intro.html


OK interesting, Julia metaprogramming looks very similar to this, and I hope to take inspiration from it for oil:

http://docs.julialang.org/en/stable/manual/metaprogramming/

https://en.wikibooks.org/wiki/Introducing_Julia/Metaprogramm...

They use :(expr) or quote/end for quotation, and $var for interpolation. Elixir metaprogramming uses quote/end for quotation, and unquote for interpolation. (In fact the entire Elixir language appears to be done with AST metaprogramming, since it's on top of Erlang.)

I basically think of these schemes as "Lisp-like AST metaprogramming, but with Syntax". Thanks for the pointer on Scala.

I saw a video where people asked why Clang source tools generate textual changes rather than AST changes... and this is a good example. People for some reason think that ASTs are "cleaner" or more usable, but they can be a pain.


Have you looked at gofmt at all? I remember reading somewhere that it was a deceptively difficult project. Perhaps they weren't using the best data structure either?


I haven't yet but I plan to (this needs a wiki page). Part of the motivation here is to avoid writing two parsers. I believe that Go had a parser in C for compilation and then a parser in Go for reformatting, but I want to avoid that duplicate work.

I've looked a tiny bit at clang-format, but it's large and uses the large Clang AST.

And dartfmt too:

http://journal.stuffwithstuff.com/2015/09/08/the-hardest-pro...

The difference between gofmt and dartfmt is that Go doesn't impose any line length. And the Dart async style leads to a lot of chaining, which means more line break possibilities to consider.

I want to write an auto-formatter and it will be somewhere in the middle. It will support a max line length, so it probably needs these optimization functions like TeX and dartfmt, but the language doesn't have this async chaining pattern.

And now I have a pointer to scalafmt. Any other pointers are welcome!


"Part of the motivation here is to avoid writing two parsers."

Have you considered implementing a parser generator instead or taking the higher-order route and constructing a parser from higher-order primitives? Both ways provide a lot of flexibility to produce different trees for different problems, save a bit of time and a lot of code. Languages are mostly sequences, alternations, repetitions and characters anyway.


Yes, I'm thinking of writing my own parser generator / meta-language for the Oil language. Now that I have experience with OSH/bash, I can see what I need to support.

On the one hand, I know how to write the parser by hand for Oil. Now that I've written the parser for OSH, it's not too much work, and not too much code. The thing that might push me over the edge is handling "broken" code like Microsoft's Roslyn, but I may put that off for v2 and just get the shell working.

The first half of my blog is kind of about how nothing off the shelf like ANTLR or yacc will work. I will have to write my own.

There are posts there about parsing in a single pass, undecidable parsing, four mutually recursive sublanguages, lexer modes/lexical state, using two tokens of lookahead vs. arbitrary lookahead, and Pratt parsing for experssions.

http://www.oilshell.org/blog/

Oil won't be as complicated as OSH/bash, but it at least needs lexer modes and Pratt parsing.

I actually thought about trying a bounty / crowdsourcing for a meta-language that can express Oil in the fewest lines of code (with no dependencies). My claim is that no existing meta-language can handle it.


Iirc sean mcdirmid deals with non ast systems, so you interact at the user level when processing code (important to minimize changes so the user can track what's happening).


Technique as old as dirt. My thesis advisor used it in the 60s and 70s.

His favored approach had one artificial limitation, perhaps a remnant of the age: he limited the size of source files in bytes to some power of 2. This let him represent each token (incl. white space/new lines/comments) as a pair of fixed-size integers indexing into the file bytes. The tokenized file is an array of those pairs and the concrete syntax tree is a tree where leaf nodes are indices into the array of tokens.

Suitable for syntax-directed code generation, control-flow graph generation, static analysis, linting, pretty-printing. Super memory compact and even has an upper bound on memory footprint. The caveat is that the source language does need to support composing a "module" out of multiple files because of the limit on source file size.


I'm sure it's as old as dirt, but I don't think there is a good name for it. "Concrete Syntax Tree" is not a good name for the reasons pointed out in the article.

Do you have reference for this? I saw this problem mentioned in the write-up on the ZINC Abstract Machine by Xavier Leroy (author of OCaml). But I don't know of other papers that talk about this.

Also I believe that most open source tools do NOT have this functionality. Look at lib2to3. It's bolted on -- not exactly a clean design. Most open source front ends are not designed for tooling like Clang is.


The ANTLR folks call it a "parse tree".


Please read the article. I specifically mention ANTLR, parse trees, and why the lossless syntax is not a parse tree / concrete syntax tree.


My source is in-person conversations with a real-life human being, and working on one of his codebases that employed the technique. If you want to look up his work, his name is Bill McKeeman. I personally have never felt compelled to find secondary sources when I had primary sources.

Edit: I'm personally not surprised that open-source codebases don't employ this technique. Lots of great PL work was done for private companies until the 90s and while lots of work was published in papers and books, precious few open-source PL communities historically drew from academia. I'm sure you know the counter-examples.


The Go library's standard AST package[1] similarly stores comments and whitespace information. It has to, since it's used by gofmt, and you wouldn't be very happy you lost comments when formatting your code.

[1] https://golang.org/pkg/go/ast/


Unfortunately, most people agree that the ast package is a mess. It isn't used by the actual compiler. There is some talk of eventually deprecating it and publishing the internal ast/parser/type-checker packages.


I think calling it a "mess" is hyperbole, but yes, it's true that it has shortcomings and may be deprecated and replaced.


I've used it before for a tool that transforms code that throws away an error code to code that takes the error code and panics if not nil.

https://github.com/placeybordeaux/panic-attack

The package works fine, but I can't say it was fun to work with.


If I remember correctly rust-fmt deals with comments without them being explicitly part of the ast. Sure creates more work though


IDEs like IntelliJ need to do this as they support all sorts of transformations (refactorings) while preserving the original source code as much as possible. Their syntax trees must also handle errors, as they need to work robustly in the editor while the user is typing or otherwise has syntax errors. IntelliJ can even perform refactorings while the code has syntax or semantic errors.

See this for more details: http://www.jetbrains.org/intellij/sdk/docs/basics/architectu...


The Roslyn compiler (C#) stores "syntax trivia" as properties of the AST nodes they appear near. Whitespace and comments are considered trivia.



Thank you, this kind of design doc is exactly what I was looking for. I'm not really familiar with Microsoft's ecosystem, but I mentioned it in the blog post because I suspected that they had the most advanced technology in this domain.

From that doc, which I plan to read thoroughly:

This enables the second attribute of syntax trees. A syntax tree obtained from the parser is completely round-trippable back to the text it was parsed from. From any syntax node, it is possible to get the text representation of the sub-tree rooted at that node.

This is true with my representation too, but I don't actually attach "trivia" to trees. Instead I just have every node store a bunch of span IDs. And then if you want to reconstruct the text, then you just take min(span_ids of node) and max(span_ids of node) and then concatenate those spans.

I also think the name "lossless syntax tree" makes sense, because they are describing very specific properties that ASTs and CSTs / parse trees don't have.

They also have an immutable property which is cool. I recall that Hjelsberg had a video on this:

https://news.ycombinator.com/item?id=11685317

https://channel9.msdn.com/Blogs/Seth-Juarez/Anders-Hejlsberg...


Are there any parsers (or parser generators) that allow to generate a "Lossless Syntax Tree" for an arbitrary grammar? Specifically, a parser that tags every parsed token with a tuple (line_id, col, length) that tells me where it was in the original parsed file. I've been thinking for ages about writing some refactoring tools for Fortran that would need something like that.


This is a parser generator that I'm working on:

https://github.com/tree-sitter/tree-sitter

It produces concrete syntax trees that can be queried by line or character index. The library is specifically focused on incremental parsing, for use in a text editor, but it also works fine for normal 'batch' parsing workloads. It has a simple C API that you should be able to use from any language. Currently, there are bindings to JavaScript and Haskell. Here are some example grammars:

* C - https://github.com/tree-sitter/tree-sitter-c

* JavaScript - https://github.com/tree-sitter/tree-sitter-javascript

* Go - https://github.com/tree-sitter/tree-sitter-go

* Ruby - https://github.com/tree-sitter/tree-sitter-ruby

* Python - https://github.com/tree-sitter/tree-sitter-python


That looks extremely interesting!


PEGTL for C++ provides such a tuple to its semantic actions. It'd be up to you of course to store that information in to your parse tree.


For JavaScript we have https://github.com/estree/estree as our JS AST spec for parsers like esprima, acorn, babylon (under a config flag), etc.

JSCS (https://github.com/jscs-dev/node-jscs) (now merged with ESLint) started a CST project as well https://github.com/cst/cst to help deal with autoformatting by adding whitespace type nodes.

Currently the community has a lot of interest in https://github.com/jlongster/prettier which just simply reprints the file from scratch in a consistent way (posted in https://news.ycombinator.com/item?id=13365470).

We also have https://github.com/benjamn/recast to help source to source transformations and https://github.com/facebook/jscodeshift.


Using the line/column information in Python's ast objects, you can sort of, almost recreate the concrete syntax, and I've tried to do so in my toy project https://github.com/Mortal/fstrings which converts old %-style formatting to the new Python 3.6 f-string syntax. It's not feature complete, but it mostly works.


There's redbaron, which provides a full syntax tree, and is designed for making such changes: https://github.com/PyCQA/redbaron


Unfortunately, redbaron targets Python 2, which doesn't have the f-strings added in Python 3.6. The project looks quite nifty otherwise.


That looks very cool! Thanks for the pointer.


While line/column works nicely if no modification was done on the AST, it works pretty bad with modification.

Furthermore, you can't do cosmetic (e.g., indentation) changes on the AST that will reflect back on the source code (preserving user, i.e., non-modified, style decisions).

The Microsoft example (thinking of Roslyn) offers a great lossless syntax tree written with immutable data structures.


The lib2to3 library provides concrete syntax trees.


Yes I just mentioned lib2to3 in the post, that's probably the closest example. (There are also some Clang-based tools to convert C++ 03 to 11 to 14, etc.)

Although I believe that you need different terminology. A concrete syntax tree doesn't necessarily represent comments and whitespace. It represents the derivation of the string from the grammar. These are totally different things and I give an example in the article.

The CST is kind of wasteful and I'm not sure why you would use it other than not wanting to write semantic actions to prune the tree on the fly. If anyone has insight into this, please let me know.


This is the first I've read about the oil shell. I've tried a few times to move to fish shell, but I'm too simple a shell user not to be tripped up regularly when I paste bash in, so I end up back at zsh for interactive and bash for scripts.


oil shell is not ready to be used as a login shell, it is a prototype. The most interesting part of it currently is its blog at http://www.oilshell.org/blog/


I'm very much enjoying reading "Translating Shell to Oil" http://www.oilshell.org/blog/2017/02/05.html


By chance do you have or plan to have a RSS feed? One that is some 60 (!) posts long, so I can read everything you published in my RSS reader (I'm still using one!).


Author never commits with his real email except for the initial commit (which points to https://github.com/andychu) and the project is owned by organization.

I wonder if it is possible to run anonymous projects by creating organization and committing as "Anonymous <anonymous@example.com>". Of course you should just create another account using Tor in this case, but just from a theoretical point of view, can an outsider (not working at github) find out who the author is?


> I wonder if it is possible to run anonymous projects by creating organization and committing as "Anonymous <anonymous@example.com>". Of course you should just create another account using Tor in this case, but just from a theoretical point of view, can an outsider (not working at github) find out who the author is?

Should be trivially possible; GitHub identifies people via their SSH key, not via a "name" or email address.

According to their terms of service ( https://help.github.com/articles/github-terms-of-service ):

> You must provide your name, a valid email address, and any other information requested in order to complete the signup process.

Throwaway email addresses (e.g. mailinator.com ) are "valid", but not providing "your name" is a violation of the TOS, and hence:

> Violation of any of the terms below will result in the termination of your Account.

Of course, that's not saying much when their terms also state:

> GitHub, in its sole discretion, has the right to suspend or terminate your account and refuse any and all current or future use of the Service, or any other GitHub service, for any reason at any time. Such termination of the Service will result in the deactivation or deletion of your Account or your access to your Account, and the forfeiture and relinquishment of all Content in your Account. GitHub reserves the right to refuse service to anyone for any reason at any time.

Hence anybody who actually cares about privacy, anonymity, publishing code without it being taken down, etc. should probably avoid GitHub (other than for mirroring). It's a single point of failure which can be closed down or sent a subpoena.

The whole point of git is to be decentralised, so just use `git send-email` and keep it that way. No need to give any control to a centralised entity, especially a proprietary one!


> > You must provide your name

It never says "full legal name", just "your name". Suppose you decide to use an alias? (Especially if you use one that sounds like a legitimate name?) Does that violate the TOS?

I am not a lawyer so I am not going to even try to interpret GitHub's TOS. But I do think GitHub ought to clarify this point–either explicitly say that aliases/pseudonyms are permitted, or alternatively if they insist on your legal name, say that. Although demanding one's legal name, if taken literally, can produce some odd results – if everyone calls you "Peggy Smith" but your legal name is "Margaret Helen Smith", should it be a TOS violation to create an account as "Peggy Smith"? Or, to give another example, a woman who marries may decide to adopt her husband's surname as her legal surname (or in some countries that might even happen automatically by law without her having any say in it), but still use her maiden name for professional purposes–that's what my mother did at least.


I imagine many GitHub users don't have a "full legal name"; e.g. I'm in the UK and AFAIK there's no concept of a distinguished "legal name"; a name is anything you're known by, i.e. I could give "chriswarbo" as my name in an official setting, since that's one of the names I'm known by (although it would probably cause me headaches, since that name doesn't appear in any of my other official forms/documents).

Besides which, as I quoted above, GitHub can terminate any account at any time for any reason, so it's really just nitpicking.


> GitHub identifies people via their SSH key, not via a "name" or email address.

Contrary to this, GitHub leans heavily on email addresses to identify contributors—in fact it may be correct to say it relies entirely on email addresses. This comes from both the documentation[1] and is borne out by experiments I've done to verify this[2].

The only thing I'm aware of that's of note is if you superficially try to change the author of the patch to give a different attribution before pushing to GitHub, then GitHub will say something like "Other-Name committed with Real-You". In this case, it all comes from Git—Git has a notion of "committer" as distinct from "author". So what's happening in the case of e.g. `--author=Other-Name <fake@example>`, Git reads your config, which specifies e.g. Real-You, and so based on that and the `author` flag, will mark the commit with author Other-Name and committer Real-You. From this, GitHub displays the UI string of the form mentioned earlier.

For example, from a recent commit[3] to the vscode-docs repo, if you do:

  $ git cat-file -p caa4bc3e50a96a93a1d4f5e39597cc3fc1cb9a4f
... then you get:

  tree 86c221e0f9e7e986068c50971ac322c738f5812f
  parent ccb9478342bba878111baedb412fb7f6911b40ad
  author João Moreno <mail@joaomoreno.com> 1486451758 +0100
  committer Greg Van Liew <gregvanl@microsoft.com> 1486481582 -0800

  Improve `code` docs for macOS
Once again, this is all Git, and nothing specific to GitHub or any special inference that it derives from SSH keys (to my knowledge). If you have some other information, can you share it? Realistically, it's not feasible for GitHub to rely on SSH because it's not even necessary to set up SSH for your account. (Pushes from the command line can be done with a username/password pair—if you're hitting the https endpoint—and GitHub really likes people who do everything through their web UI. Forcing the latter group to mess with SSH is basically a non-starter, given GitHub's goals—which I'll clarify at every opportunity, is to be a social network first, and a software development platform second.)

1. https://help.github.com/articles/why-are-my-commits-linked-t...

2. http://www.colbyrussell.com/2016/02/13/keeping-a-low-profile...

3. https://github.com/Microsoft/vscode-docs/commit/caa4bc3e50a9...


Yes, I didn't mean to imply they're scraping information from keys or anything (e.g. looking them up in a web of trust, etc.); only that SSH keys are an "external" form of identifier.

The git cli uses 'email-like identifiers', but they don't have to be working email addresses unless you're using git's email integration (and even then, you could use sendmail et al to redirect these fake emails through a real one).

Of course, it's trivial to use multiple git configs (such that "Real-You" isn't really you), although it requires some vigilance. I worked with someone who maintained two separate GitHub accounts: one for "public" stuff, one for security/hacking related stuff (nothing nefarious; they went after bug bounties). They got really annoyed one day when they spotted that the drunk coding they'd done the night before had been committed using their public git config, so anyone who'd crawled/mirrored those repos (and of course GitHub themselves) would be able to tie the identities together!


HN feature request: "hide all from domain matching pattern".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: