Hacker News new | past | comments | ask | show | jobs | submit login

> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures.

For the record, the top 25 common weaknesses for 2023 are listed at:

* https://cwe.mitre.org/top25/archive/2023/2023_top25_list.htm...

Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.

CWEs that have been in every list since they started doing this (2019):

* https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weakn...




# Top Stubborn Software Weaknesses (2019-2023)

Out-of-bounds Write

Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)

Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)

Use After Free

Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

Improper Input Validation

Out-of-bounds Read

Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)

Cross-Site Request Forgery (CSRF)

NULL Pointer Dereference

Improper Authentication

Integer Overflow or Wraparound

Deserialization of Untrusted Data

Improper Restriction of Operations within Bounds of a Memory Buffer

Use of Hard-coded Credentials


Yup. Almost all of them are various flavor of fucking up a parser or misusing it (in particular, all the injection cases are typically caused by writing stupid code that glues strings together instead of proper parsing).


That's not parsing, that's the inverse of parsing. It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly. It's compiling, of a sort.

Parsing is the reverse—taking an untrusted string (or binary string) that is meant to be code and converting it into a data structure.

Both are the result of taking untrusted data and assuming it'll look like what you expect, but both are not parsing issues.


> It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly.

Which is precisely why parsing should've been used here instead. The correct way to do this is to work at the level after parsing, not before it. "SELECT * FROM foo WHERE bar LIKE ${untrusted input}" is dumb. Parsing the query with a placeholder in it, replacing it as an abstract node in the parsed form with data, and then serializing to string if needed to be sent elsewhere, is the correct way to do it, and is immune to injection attacks.


For SQL we tend to use prepared statements as the answer, which probably do some parsing under the hood but that's not visible to the programmer. I'd raise a lot of questions if I saw someone breaking out a parser to handle a SQL injection risk.


That's because prepared statements were developed before understaning of langsec was mature enough. They provide a very simple API, but it's at (or above) the right level - you just get to use special symbols to mark "this node will be provided separately", and provide it separately, while the API makes sure it's correctly integrated into the whole according to the rules of the language.

(Probably one other factor is that SQL was designed in a peculiar way, for "readability to non-programmers", which tends to result with languages that don't map well to simple data structures. Still, there are tools that let you construct a tree, and will generate a valid SQL from that.)

HTML is a better example, because it's inherently tree-structured, and trees tend to be convenient to work with in code. There it's more obvious when you're crossing from dumb string to parsed representation, and then back.


The same thing applies to HTML, though: I would shudder if I saw a parser implemented for most HTML injection prevention. The correct answer in almost all cases is to escape the HTML using the language's standard library or the web framework's tooling.

The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text, in which case you'll need more than just a parser to ensure you don't have anything dangerous—you'd need to do a lot of error-prone analysis of the AST afterward as well.


Or, you just use the DOM API to manipulate the structure. You don't implement a parser because one is already provided by the tooling - you use it to go from known-valid text to a data structure (here, DOM), and do your operations there.

You shouldn't do "escaping" and string concatenation. That's just parsing and unparsing while cutting corners, which is how you get injection bugs.

> The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text

And that's exactly what you're doing. With escaping, you're taking a serialized form of some data, and splice into it some other data, massaged in a way you hope will make it always parse to string when something parses this later. It's going to eventually bite you; not necessarily with XSS - web template breakage is another common occurrence.

Working in string space is tricky, dangerous, and dumb - parsing, working on the parsed representation, and unparsing at the end, is how you do it correctly and safely.

(Another way to put it: plaintext is a wire format; you don't work in it if the data is structured.)

Note that the API may look like you're doing text - see JSX - but it internally goes through a parsing stage, and makes it impossible for you to do stupid things that break or transform the program, like working in string space lets you.


If you don't want your users to produce HTML, then why would you use the DOM API to parse their text into an HTML data structure? Then you'd have code that's capable of producing <script> tags or who knows what else from untrusted user input and you now have to explicitly filter out tag types. Alternatively you can implement the middle bit of a compiler and map nodes to a new, safe data structure that you spit out at the end, but in the scenario we're discussing the user input was supposed to be unstructured text. HTML content is in most cases a malicious edge case, not expected data.

If you instead escape the user-provided unstructured text by replacing the very well-known set of special characters that could create tags, you know your users cannot produce active code, only text nodes.

It's the principle of least power: if you don't need users to access anything other than unstructured text then why feed their input into a parser that produces a data structure that represents code? Make illegal states unrepresentable by just escaping the text nodes as they're saved!


The problem isn't with what the user can do, but with what your code can. If you bork your escaping, which is context-dependent, then user data can turn into arbitrary HTML, complete with script tags. If you keep an abstract tree representation, and add the user-provided data by passing it vetbatim to "set text content" method on a node, then there's no possible way the user input can break it. That is exactly what it means to make illegal states unrepresentable!

Working on the data structures after parsing makes it impossible to accidentally break the structure itself. Like, maybe your string escaping is perfect, but if you do:

  $content = $templatePrefix + $sanitizedString + $templateSuffix;
Then you're still vulnerable to trivial errors in your template breaking the structure and creating an exploitable vulnerability, despite the $sanitizedString being correct. If you instead work at parsed level and do:

  $result = $template.findNode("#foo").setText($unsanitizedInput)
Then there's just no way this can break (except bugs in the HTML parser and DOM API in general, which are much less likely to exist, and much easier to find and fix).


I think we've been talking past each other.

  $result = $template.findNode("#foo").setText($unsanitizedInput)
This is not parsing the user input, this is letting the native API escape the input for you, which is exactly what I'm advocating for. See my note above:

> escape the HTML using the language's standard library or the web framework's tooling.

This is what parsing the user input would look like with the DOM API:

    const newDiv = document.createElement('div');
    newDiv.innerHTML = untrustedUserInput;
    // Do some work to attempt to sanitize the new HTML elements
    document.body.appendChild(newDiv);
To me this is definitively a Bad Idea™, and I thought this was what you were advocating for.

What you actually proposed is just escaping the HTML, not parsing user input, with the only twist being that you prefer to inject user input into your templating system imperatively with something resembling the DOM API instead of declaratively with something resembling JSX. That's fine, but not relevant to the question of what method we use to sanitize the untrusted input that we're injecting. On that front it sounds like we're in agreement that parsing user input is a terrible idea.


> Number one was Out-of-bounds Write (CWE-787)

Surely many of these originate from deserialization of untrusted data (e.g., trusting a supplied length). It’s probably documented but I’m passively curious how they disambiguate these cases.


>> Number one was Out-of-bounds Write (CWE-787)

> Surely many of these originate from deserialization of untrusted data (e.g., trusting a supplied length).

Then they would presumably be classified under "Deserialization of Untrusted Data", number fifteen.


That’s entirely my point. If a vulnerability happens due to writing out of bounds during untrusted deserialization, which category would you file it under?

“Deserialization of untrusted data” isn’t even a security bug like an out of bounds write is. Every meaningful program deserializes external input. It’s a common area where bugs occur, but it’s not a type of bug in and of itself. Every bug in that category “belongs” in a more proximate category.


Out of bounds write attacks are generally executed on parsers to be fair




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: