Or, you just use the DOM API to manipulate the structure. You don't implement a ...

lolinder · 2024-07-22T16:59:45 1721667585

If you don't want your users to produce HTML, then why would you use the DOM API to parse their text into an HTML data structure? Then you'd have code that's capable of producing <script> tags or who knows what else from untrusted user input and you now have to explicitly filter out tag types. Alternatively you can implement the middle bit of a compiler and map nodes to a new, safe data structure that you spit out at the end, but in the scenario we're discussing the user input was supposed to be unstructured text. HTML content is in most cases a malicious edge case, not expected data.

If you instead escape the user-provided unstructured text by replacing the very well-known set of special characters that could create tags, you know your users cannot produce active code, only text nodes.

It's the principle of least power: if you don't need users to access anything other than unstructured text then why feed their input into a parser that produces a data structure that represents code? Make illegal states unrepresentable by just escaping the text nodes as they're saved!

TeMPOraL · 2024-07-22T18:15:23 1721672123

The problem isn't with what the user can do, but with what your code can. If you bork your escaping, which is context-dependent, then user data can turn into arbitrary HTML, complete with script tags. If you keep an abstract tree representation, and add the user-provided data by passing it vetbatim to "set text content" method on a node, then there's no possible way the user input can break it. That is exactly what it means to make illegal states unrepresentable!

Working on the data structures after parsing makes it impossible to accidentally break the structure itself. Like, maybe your string escaping is perfect, but if you do:

  $content = $templatePrefix + $sanitizedString + $templateSuffix;

Then you're still vulnerable to trivial errors in your template breaking the structure and creating an exploitable vulnerability, despite the $sanitizedString being correct. If you instead work at parsed level and do:

  $result = $template.findNode("#foo").setText($unsanitizedInput)

Then there's just no way this can break (except bugs in the HTML parser and DOM API in general, which are much less likely to exist, and much easier to find and fix).

lolinder · 2024-07-22T23:36:03 1721691363

I think we've been talking past each other.

  $result = $template.findNode("#foo").setText($unsanitizedInput)

This is not parsing the user input, this is letting the native API escape the input for you, which is exactly what I'm advocating for. See my note above:

> escape the HTML using the language's standard library or the web framework's tooling.

This is what parsing the user input would look like with the DOM API:

    const newDiv = document.createElement('div');
    newDiv.innerHTML = untrustedUserInput;
    // Do some work to attempt to sanitize the new HTML elements
    document.body.appendChild(newDiv);

To me this is definitively a Bad Idea™, and I thought this was what you were advocating for.

What you actually proposed is just escaping the HTML, not parsing user input, with the only twist being that you prefer to inject user input into your templating system imperatively with something resembling the DOM API instead of declaratively with something resembling JSX. That's fine, but not relevant to the question of what method we use to sanitize the untrusted input that we're injecting. On that front it sounds like we're in agreement that parsing user input is a terrible idea.