Surrogates are technically a UTF-16 only thing. Realizing that sometimes they nevertheless escape out into the wild, WTF-8 defines a superset of UTF-8 that encodes them:
To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.
I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.
A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.
Specifically, filenames on Windows are not UTF-16 (or UCS-12) but rather WTF-16 - like UTF-16 but with possibly unmatched surragate pairs. WTF-8 provides an 8-bit encoding for such filenames that matches UTF-8 wherever the original was valid UTF-16 while converting the rest in the most straightforward way possible, menaing you need less code to go from WTF-16 to WTF-8 than going from UTF-16 to UTF-8 while rejecting invalid characters.
https://simonsapin.github.io/wtf-8/
To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.
I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.
A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.