I suspect a reason Python 3 might 'be slower' would be precisely how expensive e...

guitarbill · on Dec 27, 2016

UTF-8/UTF-16 are encodings, used to encode strings for transport/exchange. Internally, strings are stored in UCS-2 or UCS-4 representation.

AFAIK, this hasn't changed between Python2 and Python3, after all, Python2 also supports unicode. It has more to do with what a "string literal" in source code means. I don't want to go into it too much, but in Py3, `"a"` is a unicode string literal equivalent to `u"a"` in Py2. Py3 `b"a"` is roughly the same as Py2 `"a"`. But note the ambiguity in Py2 - could be a sequence of bytes, or an ASCII encoded string. This is, of course, an oversimplification btw. But not too far of if you think how e.g. `from __future__ import unicode_literals` works in Py2.

Is parsing UTF-8 literals in a file slower? Maybe negligibly. Does it affect runtime? I have no idea. Probably not, you can still do string comparisons byte for byte once you've converted to UCS-2/4. It might use more memory though.

gsnedders · on Dec 27, 2016

> UTF-8/UTF-16 are encodings, used to encode strings for transport/exchange. Internally, strings are stored in UCS-2 or UCS-4 representation.

That makes no sense; UCS-2 and UCS-4 are encodings too.

guitarbill · on Dec 27, 2016

It isn't completely false. It's a simplification (as UCS-* is often used to denote internal encodings), because oh god, Unicode. This does a great job at explaining some intricacies: http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

Also, in Python 3, strings can be UCS-1 (latin1?), UCS-2 (UTF-16?), UCS-4 (UTF-32?) or other: https://www.python.org/dev/peps/pep-0393/

It's complicated. Really complicated. Sorry.

gsnedders · on Dec 27, 2016

I'd lean against referring to them as any Unicode-based terminology, as all of that is ultimately designed to represent all scalar values (e.g., even UTF-16 code units compose in such a way as to give some way to represent scalar values outside of the BMP, though they obviously cannot directly). The other notable thing about Python 3.3+'s representation is their ability to represent surrogates, which UTFs cannot.

Realistically, simply referring to it as "flexibly sized codepoint sequences" is probably about as good as we can get, IMO, because that's what they fundamentally are. (And there's very little terminology for a sequence of codepoints!)

yxhuvud · on Dec 27, 2016

I doubt that, as Ruby 1.9 to 2.x has done a lot of encoding changes to now default to utf8 while at the same time being a lot faster.