It managed to insert invalid Unicode into a SQLite database, causing a subsequen...

icebraining · on Nov 21, 2013

Yes, but only if you're decoding user input as UTF-7, which would be insane.

doki_pen · on Nov 21, 2013

What if you were scraping a webpage and it reported its encoding as UTF-7?

mattdeboard · on Nov 21, 2013

...which is, in fact, exactly how this bug was exposed.

gsnedders · on Nov 22, 2013

Modern browsers don't support UTF-7 any more after a number of XSS attacks relying on inserting UTF-7 encoded script elements which then cause the document to be sniffed as UTF-7.

The only place UTF-7 is still widely used is in email clients.

est · on Nov 22, 2013

> only if you're decoding user input as UTF-7

Hmm, may I ask what makes utf8 won't produce U+DEADBEEF? Or something remotely like that?

Edit:

'\xfb\x9b\xbb\xaf'.decode('utf8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 0: invalid start byte

rspeer · on Nov 23, 2013

When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.

gsnedders · on Nov 22, 2013

It's a bug in the UTF-7 decoder that yields an invalid codepoint (outwith of the Unicode codespace) and isn't checked anywhere.