(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.
The author of Treesitter proposes using the WASM runtime, which is new.
(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea
(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.
---
I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.
One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.
---
That is, difftastic is a batch tool, i.e. you run it with git diff.
So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.
These parsers can output a CST in the TreeSitter format, which looks pretty simple.
They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.
Unix style! Parsers can live in different binaries and still be composed.
The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.
As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.
At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.
Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).
That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.
Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.
FWIW, as a happy user, I'm mainly happy that it exists at all. In the short term, it reduces the work supporting M editors and N languages from to M+N. That's nice. More importantly, it puts a bug in everyone's ear that this is a good and achievable thing. Maybe the next step will be a tree-sitter-API-compatible replacement that fixes some of those problems and we can all migrate onto that.
That is, the big win is getting people to buy into the concept of syntax (and analysis) as a library and not as a feature of one specific editor. Once we're all spoiled by that, perhaps a better implementation or an nice API will come along and astound us all.
> Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill
I'd understood that incremental was used so that as someone writes code the IDE can syntax highlight the incomplete and syntactically incorrect code with better accuracy. Is that not the case?
It is, but the counter argument is that parsers are already so fast that streaming and all-at-once parsing are indistinguishably quick on even huge files.
I don’t believe that’s true, but it’s likely correct for the common use case of files a few pages long, written in well supported languages.
I am quite sure that batch will work with good responsiveness for many, if not most, common languages provided source files have fewer than say 30k lines in them. If you just think about the io performance of modern computers, it should not be that difficult to parse at 25MB/sec which I estimate translates to between 500K to 1M loc, which again is in the 15k-30k loc range per 30ms.
I'm not saying that incremental is bad per se, but that the choice of guaranteeing incrementalism complicates things for cases where it isn't necessary. I am not super familiar with lsp, but I can imagine lsp having a syntax highlighting endpoint that has both batch and incremental modes. A naive implementation could just run the batch mode when given an incremental request and later add incremental support as necessary. In other words, I think it would be best if there were another layer of indirection between the editor and the parser (whether that is tree-sitter or another implementation).
Right now though, you have to opt in whole hog to the tree-sitter approach. As mentioned above, incrementalism has no benefit and only cost for a batch tool like difftastic or semgrep to mention two named in this thread.
That makes sense to me. I don't know for sure that you're right but it sure seems plausible.
I do wonder how much of a range there is on non-brand-new computers though. I'm typing this on an M2 Max with 64GB of RAM. I also have a Raspberry Pi in the other room, and I know from hard experience that what runs screamingly fast on my Mac may be painfully slow on the Pi.
I could also imagine power benefits to an incremental model. If I type a single character in the middle of a 30KLOC document, a batch process would need to rescan the entire thing where a smart incremental process could say "yep, you're still in the middle of a string constant".
I think it simply boils down to the requirements of interactive editors vs. batch tools.
I have no doubt that interactive editors like Atom/Zed can really make use of incremental parsing, and also lenient parsing.
Syntax highlighting and parsing isn't the only thing they do -- they still need the CPU for other things.
But yeah the problem is incremental is very different than batch, and lenient is very different than strict, so basically every language needs at least 2 separate parsers. That's kind of an unsolved problem, and I'm not sure it can be solved even in principle ...
https://news.ycombinator.com/item?id=39762495
(1) The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can't be merged, because it's 60 MB of generated C source code. There's a scalability problem supporting multiple languages.
The author of Treesitter proposes using the WASM runtime, which is new.
(2) The original blog post concludes with some Treesitter issues, prefering Syntect (a Rust library that accepts Textmate grammars)
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tr...
Other feedback:
(3) The idea of a uniform api for querying syntax trees is a good one and tree-sitter deserves credit for popularizing it. It's unfortunately not a great implementation of the idea
(4) [It] segfaults constantly ... More than any NPM module I've ever used before. Any syntax that doesn't precisely match the grammar is liable to take down your entire thread.
---
I think some of the feedback was rude and harsh, and maybe even using Treesitter outside its intended use cases. But as someone who's been interested in Treesitter, but hasn't really used it, it seems real.
One problem I see is that Treesitter is meant to be incremental, so it can be used in an editor/IDE. And that's a significantly harder problem than batch syntax highlighting, parsing, semantic understanding.
---
That is, difftastic is a batch tool, i.e. you run it with git diff.
So to me the obvious thing for difftastic is to throw out the GLR algorithm, and throw out the heinous external lexers written in C that are constrained by it, and just use normal batch parsers written in whatever language, with whatever algorithm. Recursive descent.
These parsers can output a CST in the TreeSitter format, which looks pretty simple.
They don't even need to be linked into the difftastic binary -- you could emit an CST / S-expression format and match it with the text.
Unix style! Parsers can live in different binaries and still be composed.
The blog post use case can also just use batch parsers that output a CST. You don't Treesitter's incremental features to render HTML for your blog.