> only if you're decoding user input as UTF-7 Hmm, may I ask what makes utf8 won...

rspeer · on Nov 23, 2013

When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.

gsnedders · on Nov 22, 2013

It's a bug in the UTF-7 decoder that yields an invalid codepoint (outwith of the Unicode codespace) and isn't checked anywhere.