For those who don't already know, this is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/) which does for parsing what LSP does for analysis. That is, it provides a standard interface for turning code into an AST and then making that AST available to clients like editors and diff tools. Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports. And if you're developing a new language, you can create a tree-sitter parser for it, and now every tool that speaks tree-sitter knows how to support your language.
Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.
Absolutely agreed, and copying from a comment I wrote last year: I think the fact that tree-sitter is dependency-free is worth highlighting. For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).
Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.
Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To
And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!
LSP support is semi-built-in, but lots of improvements to come in that area apparently to support more language servers. With Python, it currently only has Pyright built-in which is more of an annoyance if you're working with code where the venv is inside a container but there's very active tickets on their GitHub about building out the LSP support. I currently use it as my second editor - I have Sublime set up to be pretty much perfect for my usage, but Zed is catching up fast. I find I'm very fussy about editors, I can't get on with VSCode at all, but I feel warm and fuzzy toward Zed - the UX is great, performance superb, external LSP support is probably the one feature stopping me using it as my primary editor.
I tried Vs code a ton of times. It is reasonably good, but I am SO used to Emacs that it is almost impossible to move from there for me.
Vs code is better at debugging and maybe slightly better at remote connections, that yes. But for the rest of things I am way more productive with Emacs than anything else.
Okay, but how does that work with language versions? Like, if I get a "C++ parser" for tree-sitter, how do I know if it's C++03, C++17, C++21 or what? Last time I checked (which was months ago, to be fair), this wasn't documented anywhere, nor were there apparent any mechanisms to support langauge versions and variants.
The second doc has a year in the title, so it's ancient af.
The first one has multiple `C++0x` red marks (whatever that mean, afair that's how C++11 was named before standardization). It mentions `constexpr`, but doesn't know `consteval`, for example. And doesn't even mention any of C++11 attributes, such as [[noreturn]], so despite the "Last updated: 10-Aug-2021", it's likely pre-C++11 and is also ancient af and have no use in a real world.
While I agree tree-sitter is an amazing tool, writing the grammar out can be incredibly difficult I found. I tried writing out a grammar and highlighting query set for vhdl with tree-sitter, and found that there were a lot of difficulties in expressing vhdl grammar in tree-sitter.
I think many people are exhausted (at least I am) with the constant irrational exuberance of bolting AI onto every technology, product, and service in existence to end all of humanity's problems. It won't work like that.
In fact, reminds me of the time at which they used Blockchain for everything.
Just a bubble right now. It will come back to its natural uses after it. Everyone is doing AI now and I am pretty sure it is to attract investment even if some might know their product will go nowhere.
Correction: Everybody says they're doing AI now because that's the magic buzzword for getting money.
I spent the 1990s building actual AI software, but we had to call it something else because if you even whispered "AI" in the 90s your funding would dry up instantly.
Someone should build a tool that augments any text with current year tech buzzwords for optimal investor appeal. I wonder what tech could be used for that… wait
I don't believe this is correct - there's no such thing as "speaking tree-sitter." Every tree-sitter parser emits a different concrete syntax tree, not a standard abstract syntax tree.
LSP truly solves the M editors to N languages needing M * N many integrations by using a standard interface for a query oriented compiler. Tree sitter doesn't solve this problem, it just makes it way easier to write N many integrations for your editor/tool.
That depends on how deep you want to go with the result. I use the Nova editor which uses tree-sitter for syntax highlighting, and I've packaged several languages for it. Each time it goes like this:
1. Clone someone's tree-sitter grammar off GitHub.
2. Build it into a Mac .dylib.
3. Create a Nova extension that says "use this .dylib to highlight that language."
4. Use it.
I don't have to make any changes to Nova itself, and the amount of configuration I have to write is so tiny that Nova could have a DIY wizard if they wanted it to.
The source for Difftastic discussed here (at https://github.com/Wilfred/difftastic/blob/master/src/parse/...) is also very simple: for each of a list of supported languages, import the tree-sitter parser and wrap a teensy amount of configuration around it.
> 3. Create a Nova extension that says "use this .dylib to highlight that language."
How is that possible if the different tokens emitted by tree sitter don't have standardized names? Isn't there some kind of configuration that maps the rules in the grammar to whatever convention Nova uses for their token names?
Now tree sitter does make this super easy, but my point was you still have to have some kind of per-language configuration/logic to work, whereas the entire point of LSP is to have none.
The maintainer of the tree-sitter grammar is usually one who maintains that mapping. At least, every time I've wanted to use it, it's been the case that all of that was already done and part of the grammar's repo.
The main issue I have with tree-sitter is that it’s approach can’t work for many languages I care about: Common Lisp cannot be parsed without a full lisp implementation; Haskell’s syntax is complicated enough that the grammar is incomplete; C/C++ can’t be parsed accurately if only because of the pre-processor; parsing perl is Turing-complete, etc. I think the suggestion elsewhere makes sense: don’t make us write parsers in a new ecosystem, but instead define a format for existing parsers to produce as a side-output.
> C/C++ can’t be parsed accurately if only because of the pre-processor
Yeah, decided to check this out to see if it could help review in our massive C-based project. Unfortunately, in a recent patch, of the 90 "hunks", 88 of them had fallen back to "normal diff" because "$N C parse errors, exceeded DFT_PARSE_ERROR_LIMIT").
Can one write a tree-sitter grammar for English (or any other natural language), that basically labels each sentence as a statement, so I can use difftastic to show changes on sentences rather than visual lines?
This is because visual line diffs for an essay is bonkers. Usually the sentence changed starts in the middle of a visual line.
The common advice[0] is to just write one sentence per line. I usually split at commas etc as well. Then use editor soft wrapping instead of fixing a maximum line length - but if your lines get longer than the screen width that might be a sign your sentences are too complex.
[0]: anyone have a good source for this? I’m not sure where I first encountered it
I will write excessively complex sentences whenever I darn please, and will be hogtied before I stop at the whims of a /diff tool/.
Mock outrage aside, whimsy and play in written language is vastly cheaper than in industrial programming environments. Provided, of course, the author can yet communicate while horsing around.
It turns out that there is a lot of discourse out there about "semantic
newlines", under a few different names. So far the names I've seen are:
- One Sentence Per Line (OSPL)
- Semantic Line Breaks (SemBr)
- Semantic Linefeeds
- Ventilated Prose
- Semantic newlines
Reading through the pages below was helpful in getting a better idea of
what language people use to discuss this. They're mostly historical
retrospectives or arguments for the merit of semantic newlines.
And often there is a middleground. In this case one could write a script that outputs a reformatted file with one sentence per line.
In Vim this could even be a simple macro as the editor already has a key for jumping to the next sentence.
As soon as you said tree sitter I immediately understood. Yes, I can’t believe I never realized that you could totally build a syntax-aware VCS on top of it. That’s brilliant.
I just wrote a language parser a few months ago in tree sitter and it’s probably the most delightful software I’ve used apart from ffmpeg.
Tree-sitter is nice, but I would like parsers that make a better effort on invalid inputs. Something like an Early parser that maximizes some quality function. This would be useful for parsing (for example) C and C++ where the preprocessor prevents true parsing of unpreprocessed code. I understand that tree-sitter is intended for interactive use in editors where it can't spend too much time parsing.
(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.
The author of Treesitter proposes using the WASM runtime, which is new.
(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea
(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.
---
I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.
One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.
---
That is, difftastic is a batch tool, i.e. you run it with git diff.
So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.
These parsers can output a CST in the TreeSitter format, which looks pretty simple.
They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.
Unix style! Parsers can live in different binaries and still be composed.
The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.
As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.
At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.
Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).
That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.
Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.
FWIW, as a happy user, I'm mainly happy that it exists at all. In the short term, it reduces the work supporting M editors and N languages from to M+N. That's nice. More importantly, it puts a bug in everyone's ear that this is a good and achievable thing. Maybe the next step will be a tree-sitter-API-compatible replacement that fixes some of those problems and we can all migrate onto that.
That is, the big win is getting people to buy into the concept of syntax (and analysis) as a library and not as a feature of one specific editor. Once we're all spoiled by that, perhaps a better implementation or an nice API will come along and astound us all.
> Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill
I'd understood that incremental was used so that as someone writes code the IDE can syntax highlight the incomplete and syntactically incorrect code with better accuracy. Is that not the case?
It is, but the counter argument is that parsers are already so fast that streaming and all-at-once parsing are indistinguishably quick on even huge files.
I don’t believe that’s true, but it’s likely correct for the common use case of files a few pages long, written in well supported languages.
I am quite sure that batch will work with good responsiveness for many, if not most, common languages provided source files have fewer than say 30k lines in them. If you just think about the io performance of modern computers, it should not be that difficult to parse at 25MB/sec which I estimate translates to between 500K to 1M loc, which again is in the 15k-30k loc range per 30ms.
I'm not saying that incremental is bad per se, but that the choice of guaranteeing incrementalism complicates things for cases where it isn't necessary. I am not super familiar with lsp, but I can imagine lsp having a syntax highlighting endpoint that has both batch and incremental modes. A naive implementation could just run the batch mode when given an incremental request and later add incremental support as necessary. In other words, I think it would be best if there were another layer of indirection between the editor and the parser (whether that is tree-sitter or another implementation).
Right now though, you have to opt in whole hog to the tree-sitter approach. As mentioned above, incrementalism has no benefit and only cost for a batch tool like difftastic or semgrep to mention two named in this thread.
That makes sense to me. I don't know for sure that you're right but it sure seems plausible.
I do wonder how much of a range there is on non-brand-new computers though. I'm typing this on an M2 Max with 64GB of RAM. I also have a Raspberry Pi in the other room, and I know from hard experience that what runs screamingly fast on my Mac may be painfully slow on the Pi.
I could also imagine power benefits to an incremental model. If I type a single character in the middle of a 30KLOC document, a batch process would need to rescan the entire thing where a smart incremental process could say "yep, you're still in the middle of a string constant".
I think it simply boils down to the requirements of interactive editors vs. batch tools.
I have no doubt that interactive editors like Atom/Zed can really make use of incremental parsing, and also lenient parsing.
Syntax highlighting and parsing isn't the only thing they do -- they still need the CPU for other things.
But yeah the problem is incremental is very different than batch, and lenient is very different than strict, so basically every language needs at least 2 separate parsers. That's kind of an unsolved problem, and I'm not sure it can be solved even in principle ...
This question is coming from a place of total ignorance:
One appeal of the general idea of a structural diff tool, for me, is ignoring the ordering of things for which ordering makes no difference.
x = 4
y = 7
are independent statements and the code will be no different if I replace those two statements with
y = 7
x = 4
However, this information is not actually present in the abstract syntax tree. If I instead consider these two statements:
x += 3
x *= 7
it is apparent that reordering them will cause changes to the meaning of the code. But as far as the AST goes, this is the same thing as the example where reordering was fine.
What kinds of things are we doing with our new AST tooling?
> x = 4
> y = 7
>
>are independent statements and the code will be no different if I replace those two statements with
>
> y = 7
> x = 4
Not always, e.g. in a multi threaded situation where x and y are shared atomics. Then unless we authorize C++ to take more liberties in reordering, another thread will never see y as 7 while x is not yet 4 in the first example, but not the second. This kind of subtlety can't be determined from syntax alone.
OK, I tended to agree that the AST was inadequate for this task. But what are we doing with it? That's most of what I want from "structural code diff".
In a sense, plain old diff is a structural diff. The grammar is a sequence of lines of characters.
All tree-sitter gives you is a _different_ grammar, so that a structural diff can operate on different trees given the same text as diff.
A parse tree still doesn't know anything about the meaning of a program, which is what you need to know in order to determine that those assignments to x and y are unordered.
What you want to determine this is not an AST, you want a Program Dependence Graph (PDG), which does encode this information. Creating them is not close to as simple as creating a AST, and for many languages requires either assumptions that will be broken, or result in something very similar to an AST (every node has a dependency on the previous node).
OK. What good is the AST? Why do I care about "structural diffs" that don't do this?
The page has several examples:
1. Understand what actually changed.
This appears to show that `guess(path, guess_src).map(tsp::from_language)` has been changed to `language_override.or_else(|| guess(path, guess_src)).map(tsp::from_language)`. The call to `map` is part of a single line of code in the old file, but has been split onto a line of its own in the new file to accommodate the greater complexity of the expression.
The bragging associated with the example is "Unlike a line-oriented text diff, difftastic understands that the inner expression hasn't changed here", but I don't really care about that. I need to pay close attention to which bits of the line have been manipulated into which positions anyway. I'm more impressed by ignoring the splitting of one line into several, which does seem to be a real benefit of basing the diff on an AST.
2. Ignore formatting changes.
This example shows that when I switch the source from which `mockable` is imported from "../common/mockable.js" to "./internal.js", the diff will actively obscure that information by highlighting `mockable` and pretending that `"./internal.js"` is uninteresting code that was there the whole time (because it was already the source of some other imports). This badly confuses a boring visual change ("let's use the syntax for importing several things, instead of one thing") with a very significant semantic change ("let's import this module from a completely different file"). I'm not comfortable with this; there must be a better way to present this information than by suggesting that I shouldn't be worried about it.
(A textual diff, in this case, has the same problem. But when the pitch is that your new tool is better than a textual diff because it understands the code, failing to highlight an important change to the code is worse than it used to be!)
3. Visualize wrapping changes.
This shows that when I change the type of some field from `String` to `Option<String>`, the diff will not highlight the text "String", because that part hasn't changed. This is a change from a textual diff, but it doesn't appear to add much value.
There's a second example to do with code that belongs both before and after other code, in this case an opening/closing tag pair in XML, but in that case the structural diff appears to be identical to a textual diff.
4. Real line numbers.
"Do you know how to read @@ -5,6 +5,7 @@ syntax? Difftastic shows the actual line numbers from your files, both before and after."
I agree that that's a real benefit, but again it doesn't seem to have anything to do with the difference between textual and structural diffs.
------
I think the conceptual appeal of a "structural diff" is that it fails to highlight changes to the code that don't change the behavior of the software. Difftastic clearly believes something different; in the second example, they are failing to highlight a change to the code that does change the behavior of the software. And in the other examples, they are failing to highlight things that haven't changed from some perspectives, but could be argued to have changed from other perspectives -- and that in either case don't derive much benefit from not being highlighted. If changing `String` to `Option<SpecialType>` produced a diff that highlighted `SpecialType` in a separate color from the surrounding `Option<>` wrapping, indicating that the one line of code contained two relevant changes, that might be interesting, but otherwise I don't see the point of not highlighting the inner `String` along with the new wrapping.
Honestly I agree that structural diffs don't solve a problem for me either. I care about formatting too much to only want to rely on them.
I was just replying that if you want to not get a diff for your example to which I replied you have to use a more advanced representation of the code, and AST won't be able to do it.
How close are we to being able to copy a function in to the clipboard, then highlight some lines of code and paste the function around it (like highlight > quote marks)?
I don't know what exactly you mean by pasting a function around the selection, but you can paste selections, registers or even files at specific lines with some vim-fu. If it's generic enough you could write a function or even a keyboard shortcut if it's very simple.
I have set ",',(,[,{ in visual mode to cut the selection insert the pairs then paste it back as a very hacky solution, but it gets the job done. If you want something more advanced to add or change anything around the selection tpope has solved that with vim-surround[1].
I assume auto-wrapping a copied function signature around a selecting block as if it was just parentheses or something. I don't think I ever needed that, but a variant of that might be useful for XML where wrapping something in a pair of tags is a common operation.
Mostly I would find this helpful during refactoring where it’s normal to move some code out in to a specific function. It would also be helpful for loops, if/else, validation, try/catch since you could copy the boilerplate and paste it over the code block in one move.
> Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports.
Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.