Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most languages don't have ASTs that work well in "degenerate" cases like unfinished/work in progress/sample code. These are things you often want to have in source control. There are also very few AST standards between languages making for a lot more language-dependent work for an SCM.

(For what it is worth, I toyed with what I think to be a useful compromise to the idea, which is to use syntax highlighting tokenizers, which perform well and interoperate well with character-based diffs: https://github.com/WorldMaker/tokdiff)



I agree that degenerate cases are one of the major problems with trying to move to higher-level editing abstractions (and it's one of the big issues I faced with my own application), but I don't see how it's a problem here.

If it doesn't have a valid AST, then it's not a valid program, either. If you're using a formatting program (like prettier or gofmt) and pass it an invalid program, you're either going to get an error, or undefined behavior. And anyone else opening it in a different editor is going to have their language-mode interpret the invalid program differently than yours, too. Source code tools like structured search won't work predictably, either.

This sounds to me like saying "An XML structure editor? But what if I want to put an invalid XML structure in a file called foo.xml and commit it to the repository?" Or my first boss complaining that visual text editors didn't let you see every byte. Or Mel needing to know drum addresses so he could use them for constants.

As time goes by, we move to higher level abstractions, and (thanks to tools like gofmt) we're already at the point where you probably shouldn't be pushing source code which doesn't even have a valid AST, and expect all your tools to work perfectly. There's plenty of ways to write and commit WIP code without needing an invalid AST on disk.


I thought similarly for many years. What's the use of code in source control that doesn't compile anyway?

Then I started paying attention to all the reasons why we naturally might want to check in "invalid" code.

All the cases where I'm never going to finish an entire refactor in a single commit, and the whole refactor makes far more sense in documented steps where many of the intermediate steps will never compile.

All the cases where sending broken code to a source control server at the end of the day is both the best way to back it up and the easiest way to get fresh eyeballs on it in the morning. Even with CI systems in place, sometimes the errors that the CI bot can tell you are as useful as the ones your own machine's build environment can tell you. (Especially in those weird cases where it turns out to be that maybe its the build environment on your own machine that's the problem and you've been beating yourself up over a bad install of something that should be unrelated, and the build errors on the remote machine lead you to the real problem.)

All the cases where I might build tests first, and the compiler is a test, especially in a static typed language, before working backwards to make the tests pass/run/compile.

A lot of people like to think of programming code as some purely logical construct, but good code is poetry, even in its mistakes. Sometimes we need drafts to tell our stories right. Sometimes we need bad poetry in source control as a warning to others to help them realize what good poetry can be.

It's interesting to want code editors to keep us from ever writing bad poetry to disk, but it's also somewhat inhumane. People write bad poetry all the time, it's a natural skill. Saving bad poetry for posterity is sometimes the only way we get better poetry.


I once saw a phd thesis that argued that the best way to go was to use AST diffing at the top level (classes, method definitions, etc) but line-based diffing for the inner parts (method bodies, etc).

Small textual changes can lead to large changes in the AST, which results in confusing diffs. Merging is also non-trivial.

On the other hand, for top level structures following the AST keeps things saner.


It's also similar to the thought path that lead me to trying a Tokenizer for diffs. The tool as is essentially gives you character-based diffing everywhere which keeps the inner parts sane/mergeable.

One of the things that I never got around to doing was exploring the higher-level opportunities, but they are there. Just as the tokenizer is the first pass in AST building, tokenized diffs should contain enough information that if you wanted to try to do higher level diff analysis you could do some interesting things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: