This is true, but in other cases they added keywords in ways that could work with type stripping. For example, the `as` keyword for casts has existed for a long time, and type stripping could strip everything after the `as` keyword with a minimal grammar.
When TypeScript added const declarations, they added it as `as const` so a type stripping could have still worked depending on how loosely it is implemented.
I think there is a world where type stripping exists (which the TS team has been in favor of) and the TS team might consider how it affects type stripping in future language design. For example, the `satisfies` keyword could have also been added by piggy-backing on the `as` keyword, like:
const foo = { bar: 1 } as subtype of Foo
(I think not using `as` is a better fit semantically but this could be a trade-off to make for better type stripping backwards compatibility)
I don't know a lot about parser theory, and would love to learn more about ways to make parsing resilient in cases like this one. Simple cases like "ignore rest of line" make sense to me, but I'm unsure about "adversarial" examples (in the sense that they are meant to beat simple heuristics). Would you mind explaining how e.g. your `as` stripping could work for one specific adversarial example?
function foo<T>() {
return bar(
null as unknown as T extends boolean
? true /* ): */
: (T extends string
? "string"
: false
)
)
}
function bar(value: any): void {}
Any solution I can come up with suffers from at least one of these issues:
- "ignore rest of line" will either fail or lead to incorrect results
- "find matching parenthesis" would have to parse comments inside types (probably doable, but could break with future TS additions)
- "try finding end of non-JS code" will inevitably trip up in some situations, and can get very expensive
I'd love a rough outline or links/pointers, if you can find the time!
Most parsers don't actually work with "lines" as a unit, those are for user-formatting. Generally the sort of building blocks you are looking for are more along the lines of "until end of expression" or "until end of statement". What defines an "expression" or a "statement" can be very complex depending on the parser and the language you are trying to parse.
In JS, because it is a fun example, "end of statement" is defined in large part by Automatic Semicolon Insertion (ASI), whether or not semicolons even exist in the source input. (Even if you use semicolons regularly in JS, JS will still insert its own semicolons. Semicolons don't protect you from ASI.) ASI is also a useful example because it is an ancient example of a language design intentionally trying to be resilient. Some older JS parsers even would ignore bad statements and continue on the next statement based on ASI determined statement break. We generally like our JS to be much more strict than that today, but early JS was originally built to be a resilient language in some interesting ways.
Thanks for the response, but I'm aware of the basics. My question is pointed towards making language parsers resilient towards separately-evolving standards. How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?
The example snippet I added is designed to violate the rules I could come up with. I'd specifically like to know: what are better rules to solve this specific case?
> How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?
I don't know anything about parsers besides what I learned from that one semester worth of introduction class I took in college but from what I understand of your question, I think the answer is you can't simply because we can't look into the future.
1. Automatic semicolon insertion would next want to kick in at the } token, so that's the obvious end of the statement. If you've asked it to ignore from `as` to the end of the statement (as you've established with your "ignore to the end of the 'line'"), that's where it stops ignoring.
1A. Obviously in that case `bar(null` is not a valid statement after ignoring from `as` to the end of the statement.
2. The trick to your specific case, that you've stumbled into is that `as` is an expression modifier, not a statement modifier. The argument to a function is an expression, not a statement. That definitely complicates things because "end of the current expression" is often a lot more complicated than ASI (and people think ASI is complicated). Most parsers are going to have some sort of token state counter for nested parentheses (this is a fun implementation detail of different parsers because while recursion is easy enough in "context-free grammars" the details of tracking that recursion is generally not technically "context-free" at that point, so sometimes it is in the tokenizer, sometimes it is a context extension to the parser itself, sometimes it is using a stack implementation detail of the parser) and you are going to want to ignore to the next "," token that signals a new argument or the next ")" that signals the end of arguments, with respect to any () nesting.
2A. Because of how complicated expression parsing can get, that probably sets some resiliency bounds on your "ignorable grammar": it may require that internally it still follows most of the logic of your general expression language: balanced nested parentheses, no dangling commas, usual comment syntax, etc.
2B. You probably want to define those sorts of boundaries anyway. The easiest way is to say that ignorable extensions such as `as` must themselves parse as if it was a valid expression, even if the language cannot interpret its meaning. You can think of this as the meta-grammar where one option for an expression might be `<expression> ::= <expression> 'as' <expression>` with the second expression being parseable but ignorable after parsing to the language runtime and JIT. You can see that effectively in the syntax description for Python's original PEP 3107 syntax-only type hints standard [1], it's surprisingly that succinct there. (The possible proposed grammar in the Type Annotations proposal to TC39 is a lot more specific and a lot less succinct [2], for a number of reasons.)
CSS syntax have specific rules for how to handle unexpected tokens. E.g if an unexpected character is encountered in a declaration the parser ignores characters until next ; or }. But CSS does not have arbitrary nesting, so this makes it easier.
Comments as in your example is typically stripped in the tokenization stage so would not affect parsing. The TpeScript type syntax has its own grammar, but it uses the same lexical syntax as regular JavaScript.
A “meta grammar” for type expressions could say skip until next comma or semicolon, and it could recognize parentheses and brackets as nesting and fully skip such blocks also.
The problem with the ‘satisfies’ keyword is a parser without support would not even know this is part of the type language. New ‘skippable’ syntax would have to be introduced as ‘as satisfies’ or similar, triggering the type-syntax parsing mode.
I understand that you can define a restricted grammar that will stay parseable, as the embedded language would have to adapt to those rules. But that doesn't solve the question, as Typescript already has existing rules which overlap with JS syntax. The GP comment was:
> For example, the `as` keyword for casts has existed for a long time, and type stripping could strip everything after the `as` keyword with a minimal grammar.
My question is: what would a grammar like this look like in this specific case?
It can’t strip what’s after the as keyword without an up-to-date TS grammar, because `as` is an expression. The parser needs to know how to parse type expressions in order to know when the RHS of the `as` expression ends.
Let’s say that typescript adds a new type operator “wobble T”. What does this desugar to?
x as wobble
T
Without knowing about the new wobble syntax this would be parsed as `x as wobble; T` and desugar to `x; T`
With the new wobble syntax it would be parsed as `x as (wobble T);` according to JS semicolon insertion rules because the expression wobble is incomplete, and desugar to `x`
The “as” expression is not valid JavaScript anyway, so the default rule for implicit semicolon does not apply. A grammer for type expressions could define if and how semicolons should be inserted.
TypeScript already has such type operators though. For example:
type T = keyof
{
a: null,
b: null
}
Here T is “a”|”b”, no automatic semicolon is inserted after `keyof`. While I don’t personally write code like this, I’m sure that someone does. It’s perfectly within the rules, after all.
While it’s true that TS doesn’t have to follow JS rules for semicolon insertion in type expressions, it always has done, and probably always should do.
This is just the default. Automatic semicolon insertion only happen in specific well-defined cases, for example after the “return” keyword or when an invalid expression can be made valid by semicolon insertion. Neither applies here.