I have a huge belief in tree-sitter. I think it's going to continue to grow and become an important tool, especially in security/code tooling contexts.
The main innovation of tree-sitter, even more than incremental parsing, as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing. To me, this is problematic for tooling because it is always possible for a tree-sitter based tool to be flat out wrong relative to the actual language. For syntax highlighting, this is generally not a huge deal (and tree-sitter does generally work well, though there are exceptions), but I'd be more cautious with security tools based on tree-sitter.
If all languages changed their reference parsers to tree-sitter, this would be moot, but that seems unlikely. Language parsers are often optimized beyond what is possible in a general purpose parser generator like tree-sitter and/or have ambiguities that cannot be resolved with the tree-sitter dsl.
What feels perhaps likely in the future is that a standard parse tree api emerges, analogous to lsp, and then language parsers could emit trees traversable by this api. Maybe it's just the tree-sitter c api with an alternate front end? Hard to say, but I suspect either something better than (but likely at least partially inspired by) tree-sitter will emerge or we will get stuck in a local minimum with tooling based on slightly incorrect language parsers.
> as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing.
Author of DiffLens (https://marketplace.visualstudio.com/items?itemName=DiffLens...) here. A uniform API for traversing a parse tree for all languages would be amazing for DiffLens! However, I fear languages are different enough that this ideal may never be reached :) Or maybe there would be a core set of APIs and extensions for the idiosyncrasies of each language. For DiffLens though, we try to use the language's official parser/compiler if it exposes an AST
> unless the language compiler/interpreter uses tree-sitter for parsing
Doubtful, last time I tried tree-sitter would parse invalid inputs without even tagging any errors in the parse tree. For example, it would silently accept extra tokens, or keywords in the place of identifiers. Replacing the built-in lexer and then validating the parse tree for correctness would be close to writing the grammar twice.
And accepting partially correct inputs within the compiler toolchain isn't too hard, so I don't really see the advantage of agreeing on tree-sitter and not just on a parse tree representation that editors can then query, as you then suggested. If the big deal is having it execute client-side or being sandboxed, I feel that's orthogonal to parsing algorithms.
tree-sitter is a bit better than regexp but it is not an actual parser of grammars, a fast actual parser of all languages for syntax coloring is the future I think, tree-sitter is a pragmatic middle-ground while we wait for the prime solution