BTW there is interesting feedback from 4 people on a Treesitter post yesterday: ...

diffxx · on March 22, 2024

As one of the harsh and rude commentators, I would say I basically agree with your interpretation. You also correctly inferred that I have experience with working with it in an area that is arguably outside of its true use case.

At the same time, I believe that there needs to be a corrective about what tree-sitter should and should not be used for. There are companies building security products on top of tree-sitter which I think is an objectively bad idea given its problems and limitations. Difftastic is to me a grey area because it could lead hypothetically to a security issue if it generates an incorrect diff due to an incorrect tree-sitter grammar. Unlikely but not impossible.

Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill (I have written a recursive descent parser for a language in c that can do 3million lines per second on a decent laptop which is about 60k lines per 20 ms, which is the window I look to for reactivity). How many non-generated source files exceed say 100k lines? Incremental parsing feels like taking on a lot of complexity for rather limited benefit except in fairly niche use cases (granting that one person's niche is another's common case).

That being said, it is impressive that their incremental algorithm works as well as it does but the cost is that grammar writers are forced to mold a language grammar that might not fit into the GLR algorithm. When it doesn't work as expected, which is not uncommon in my experience, the error messages are inscrutable and debugging either the generator or the generated code is nigh impossible.

Most of the happy users have no idea how the sausage is made, they just see the prettier syntax highlighting that works with multiple tools. I get that my criticism is as welcome as a wet blanket, but I just think there is something much better possible which your comment hints at.

kstrauser · on March 22, 2024

FWIW, as a happy user, I'm mainly happy that it exists at all. In the short term, it reduces the work supporting M editors and N languages from to M+N. That's nice. More importantly, it puts a bug in everyone's ear that this is a good and achievable thing. Maybe the next step will be a tree-sitter-API-compatible replacement that fixes some of those problems and we can all migrate onto that.

That is, the big win is getting people to buy into the concept of syntax (and analysis) as a library and not as a feature of one specific editor. Once we're all spoiled by that, perhaps a better implementation or an nice API will come along and astound us all.

porker · on March 22, 2024

> Your point about batch vs incremental is spot on, though even for IDEs, I think incremental is usually overkill

I'd understood that incremental was used so that as someone writes code the IDE can syntax highlight the incomplete and syntactically incorrect code with better accuracy. Is that not the case?

kstrauser · on March 22, 2024

It is, but the counter argument is that parsers are already so fast that streaming and all-at-once parsing are indistinguishably quick on even huge files.

I don’t believe that’s true, but it’s likely correct for the common use case of files a few pages long, written in well supported languages.

diffxx · on March 22, 2024

I am quite sure that batch will work with good responsiveness for many, if not most, common languages provided source files have fewer than say 30k lines in them. If you just think about the io performance of modern computers, it should not be that difficult to parse at 25MB/sec which I estimate translates to between 500K to 1M loc, which again is in the 15k-30k loc range per 30ms.

I'm not saying that incremental is bad per se, but that the choice of guaranteeing incrementalism complicates things for cases where it isn't necessary. I am not super familiar with lsp, but I can imagine lsp having a syntax highlighting endpoint that has both batch and incremental modes. A naive implementation could just run the batch mode when given an incremental request and later add incremental support as necessary. In other words, I think it would be best if there were another layer of indirection between the editor and the parser (whether that is tree-sitter or another implementation).

Right now though, you have to opt in whole hog to the tree-sitter approach. As mentioned above, incrementalism has no benefit and only cost for a batch tool like difftastic or semgrep to mention two named in this thread.

kstrauser · on March 22, 2024

That makes sense to me. I don't know for sure that you're right but it sure seems plausible.

I do wonder how much of a range there is on non-brand-new computers though. I'm typing this on an M2 Max with 64GB of RAM. I also have a Raspberry Pi in the other room, and I know from hard experience that what runs screamingly fast on my Mac may be painfully slow on the Pi.

I could also imagine power benefits to an incremental model. If I type a single character in the middle of a 30KLOC document, a batch process would need to rescan the entire thing where a smart incremental process could say "yep, you're still in the middle of a string constant".

chubot · on March 23, 2024

I think it simply boils down to the requirements of interactive editors vs. batch tools.

I have no doubt that interactive editors like Atom/Zed can really make use of incremental parsing, and also lenient parsing.

Syntax highlighting and parsing isn't the only thing they do -- they still need the CPU for other things.

But yeah the problem is incremental is very different than batch, and lenient is very different than strict, so basically every language needs at least 2 separate parsers. That's kind of an unsolved problem, and I'm not sure it can be solved even in principle ...