Hacker News new | past | comments | ask | show | jobs | submit login

My comment is about this sentence:

>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode

>standard library APIs that formerly took bytes now incorrectly take Unicode strings

What do you mean by "incorrectly"?




POSIX APIs take bytes, generally. Python wraps these APIs to take unicode and doesn't allow you to pass bytes, even if you need to. Filenames, for example, are just bytes, and if you force them to always be valid unicode you will make it so that you can't interact with files that have names that aren't valid unicode. That's just one example.



An extremely frustrating part of the Python 3 migration is how many times Python module maintainers have had to hear "oh, now it's safe to migrate." This page currently leads off with a comment saying it's been fine any time since 3.4. You say 3.6. When I was maintaining a popular Python module, I heard the same at 3.1, and 3.2. (I didn't maintain it long after that.)


There are very few places where the bytes/string difference matters for posix paths. Python is far from the only popular tool to assume paths must be valid unicode.


> There are very few places where the bytes/string difference matters for posix paths.

It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)

People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.

> Python is far from the only popular tool to assume paths must be valid unicode.

It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."


> There are very few places where the bytes/string difference matters for posix paths.

There’s only every single input from the system at large, no big.


I don't quite agree. There's lots of systems where it's always unicode, and the a lot of systems where it's always ASCII, and then some systems where stuff is weird (and should be unicode :x)


There was a different API to get this behavior since 3.4: https://www.python.org/dev/peps/pep-0428/#id39


Which means it's been true (and broken) for many many years until maintainers finally succumbed to external pressure and unbroke the API.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: