Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> only if you're decoding user input as UTF-7

Hmm, may I ask what makes utf8 won't produce U+DEADBEEF? Or something remotely like that?

Edit:

'\xfb\x9b\xbb\xaf'.decode('utf8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 0: invalid start byte



When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.


It's a bug in the UTF-7 decoder that yields an invalid codepoint (outwith of the Unicode codespace) and isn't checked anywhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: