Mercurial’s journey to and reflections on Python 3

ploxiln on Jan 13, 2020 | [–]

I've been involved in multiple non-trivial libraries and frameworks that supported both python2 and python3 for many years with the same codebase ... and it really wasn't anything like this. The python3 "adaptation" effort for mercurial was just bungled by multiple terrible decisions.

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years.

Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.

pdonis on Jan 13, 2020 | | [–]

> The natural string type is the right type in many places

For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.

I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

jharsman on Jan 13, 2020 | | | [–]

I was an early adopter of Mercurial and the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support.

For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.

In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.

I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.

pdonis on Jan 13, 2020 | | | [–]

> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.

hsivonen on Jan 14, 2020 | | | [–]

The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.

pdonis on Jan 14, 2020 | | | [–]

> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.

simias on Jan 13, 2020 | | | | [–]

Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?

SSLy on Jan 13, 2020 | | | [–]

On windows? Of course you can.

edgyquant on Jan 14, 2020 | | | [–]

I'm certain you can on Linux as well. Only Macs old HFS would not allow it.

cataflam on Jan 14, 2020 | | | | [–]

Isn't this a fairly recent change?

amaranth on Jan 14, 2020 | | | [–]

NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.

mathw on Jan 14, 2020 | | | [–]

Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!

rurban on Jan 15, 2020 | | | [–]

Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.

gpderetta on Jan 13, 2020 | | | | [–]

Hum, any program that doesn't treat filenames as bytestreams on unix is broken. Doubly so if its primary purpose is preserving and archiving files.

Are you sure the issue wasn't something else?

lmm on Jan 14, 2020 | | | [–]

The point is that filenames aren't bytestreams on windows, and if you treat them as such then your program won't work.

WorldMaker on Jan 14, 2020 | | | [–]

By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)

lmm on Jan 14, 2020 | | | [–]

> By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.

WorldMaker on Jan 14, 2020 | | | [–]

Sure, but that's also not the only resulting option; instead of erroring you could also do something nice like help those users migrate to cleaner Unicode encodings of their filenames by asking them to correct mistakes or provide information about the original encoding. It takes more code to do that than just throwing an error, of course, but who knows how many users that might help that don't even realize why their repositories don't work correctly on, say, Windows.

Dylan16807 on Jan 14, 2020 | | | | [–]

Windows filenames basically are bytestreams. But the bytes come in pairs.

lmm on Jan 15, 2020 | | | [–]

Not really. Certain byte sequences are invalid.

Dylan16807 on Jan 15, 2020 | | | [–]

Certain byte sequences are invalid in unix filenames too. So that can't be the factor that decides if they are bytestreams or not.

xorcist on Jan 13, 2020 | | | | [–]

If hg borked on non-ascii characters, it sounds like the problem was rather that it didn't treat that data as a bag-of-bytes. Not the other way around?

ploxiln on Jan 13, 2020 | | | [–]

He was trying to use Windows. For Windows, you pretty much have to go through unicode to utf-16, can't be arbitrary bytes, can't be utf8.

(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)

Dylan16807 on Jan 14, 2020 | | | [–]

Windows uses arbitrary shorts that are sort of supposed to be utf-16. Just like Unix uses arbitrary bytes that are sort of supposed to be utf-8.

You have to convert between them, but neither uses proper Unicode to represent filenames.

cbsmith on Jan 13, 2020 | | | | [–]

Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.

mynegation on Jan 14, 2020 | | | [–]

Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).

mikepurvis on Jan 14, 2020 | | | [–]

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

cbsmith on Jan 14, 2020 | | | | [–]

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.

takeda on Jan 14, 2020 | | | [–]

It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

cbsmith on Jan 15, 2020 | | | [–]

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.

lmm on Jan 14, 2020 | | | | [–]

UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.

cbsmith on Jan 14, 2020 | | | [–]

Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.

masklinn on Jan 13, 2020 | | | | [–]

> For many programs, yes.

For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else[0], which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.

> Repository data is bytes, not Unicode.

It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.

[0] though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well

pdonis on Jan 13, 2020 | | | [–]

> For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

This is a problem with the Python 3 standard library; in many places it requires Unicode when it shouldn't.

takeda on Jan 14, 2020 | | | [–]

This is a really bad way of thinking. The distinction in Python 3 is between text (str) and bytes.

str is not Unicode in fact if you don't use fancy characters internally it stores text as a byte array.

You should think of text the same as of image or sound, what you see in the screen or hear in the speaker is the actual thing, but if you need to save it on disk you encode it as for example png or wav.

owl57 on Jan 14, 2020 | | | [–]

You can just read "requires text when it shouldn't". But I don't recommend this terminology: in most modern computer programs, including Python 3 implementations, "text" and "Unicode" mean the same thing, but outside of this context Unicode is just more precise: sometimes "text" means ASCII and sometimes it means things non-represantable in current version of Unicode.

pdonis on Jan 14, 2020 | | | | [–]

> The distinction in Python 3 is between text (str) and bytes.

Feel free to s/Unicode/str/ in what I posted if you prefer that terminology. The problem is still the same.

An example of the problem: Python's standard streams (stdin|out|err) in Python 2 are streams of bytes, but in Python 3 they're streams of Unicode (or str if you prefer that terminology) characters. The problem is twofold: first, if my standard streams are hooked to a console, Python can't always properly detect the encoding of the bytes coming from the console, so it can give me the wrong Unicode characters; second, if my standard streams are hooked to pipes, there is no encoding it can pick that is right, since the bytes aren't even coming from a console (where at least there is some plausible argument for saying the user meant to type Unicode characters, not bytes). What Python 3 should have done was keep the standard streams as bytes, since that's the only common denominator you can rely on, and then let the application decide how to decode them if it decides it needs to, just as in Python 2.

takeda on Jan 14, 2020 | | | [–]

I believe the behavior is correct though. Python uses encoding specified through LANG/LC_* which is the encoding that supposed to be used, and all properly behaved applications use it.

If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version. Most people will use it for text, so the defaults make sense. Personally I would like if there was no automatic conversion when using files/network/pipes etc. but I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

pdonis on Jan 15, 2020 | | | [–]

> Python uses encoding specified through LANG/LC_

Yes, that's the best you can do, but it's still not always correct. I agree that it should be, but "should be" and "is" aren't always the same.

> If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version.

Yes, but there are still standard library functions that will use the regular streams, and that might conflict with what your application is doing. There is no way to tell Python as a whole "use binary streams everywhere because they are pipes for this application".

> Personally I would like if there was no automatic conversion when using files/network/pipes etc.

That would work if (a) Python could always detect that condition (it can't) and (b) the entire standard library adjusted itself accordingly.

> I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

Python 2 worked fine with the standard streams being binary, and applications wrapping them to decode to Unicode when necessary. Python 2.7 even back ported the TextIOWrapper and similar classes to make the wrapping as simple as possible. A similar approach could have been taken in Python 3 (binary streams and a simple wrapper class), but it wasn't.

masklinn on Jan 14, 2020 | | | | [–]

Complaining that the world is not as it should be does not solve the issue.

ploxiln on Jan 13, 2020 | | | | [–]

Repository data bytes does not show up as string literals in your code, or keyword argument names, or http header names. The vast majority of code involved in this struggle is misc business logic, not repository tracked file contents itself.

reubenmorais on Jan 13, 2020 | | | [–]

Python 3's approach means bytes/str poisons the whole expression. So if you want to do something like:

"%s/%s" % (repository_data_1, repository_data_2)

And have it work on Python 2 and 3, you're screwed.

luhn on Jan 13, 2020 | | | [–]

And Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things. Python 2 would let you do that, and it would often cause subtle bugs with non-ASCII data. Python 3 requires you to encode/decode, so you're working consistently and explicitly with binary or text.

I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

reubenmorais on Jan 14, 2020 | | | [–]

> I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

We're discussing the linked article, so I'm talking in the context of the linked article. I know it works now, but Python 3 initially removed %-formatting for bytes. I guess I should have used past in my comment, "you were" screwed instead of "you are". From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.

pdonis on Jan 14, 2020 | | | | [–]

> Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things.

Python 3's behavior as far as forcing you to explicitly recognize data type conversions is more correct, yes.

Python 3's behavior in assuming that nobody would ever need to do "text-like" operations like string formatting on byte sequences was not. At least this particular wart was fixed. But there are still a lot of places where Python makes you use the str "textual" data type when it's not the right one.

Python 3's behavior in making individual elements of a byte string integers instead of length-one byte strings is, frankly, braindead.

acdha on Jan 14, 2020 | | | | [–]

That example works fine in both Python 2 and 3 if you’re not mixing types incorrectly. If you are, it will appears to work on Python 2 before failing the first time you encounter non-ASCII data and tends to greatly confuse people with errors which would have been caught immediately on Python 3. I’ve seen teams waste hours trying to track down errors like that.

zo1 on Jan 14, 2020 | | | [–]

Exactly this. The amount of times I saw juniors fixing thses sort of obscure subtle bugs with str_var.decode("utf-8").encode("latin-1") and this after attempting every which combination of the above two de/encode operations is mind boggling.

reubenmorais on Jan 14, 2020 | | | | [–]

It works after Python 3.5. From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.

takeda on Jan 13, 2020 | | | | [–]

The rule of thumb (not just for Python, but anything that deals with encoding) is to use binary encoding at the bounds of your program (reading/writing files, sending/receiving data over network etc) it applies to everything including tools like this. If you follow it your life will be simpler.

You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

masklinn on Jan 13, 2020 | | | [–]

> You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.

Do not ever use text-mode `open` without specifying an encoding.

mintplant on Jan 14, 2020 | | | [–]

Node.js tries to be helpful in defaulting file writes to UTF-8, but defaults file reads to returning a raw byte buffer [0]. So you have to either remember to treat the two operations differently, or, like in Python, manually specify the encoding for both.

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.

takeda on Jan 14, 2020 | | | | [–]

The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.

masklinn on Jan 14, 2020 | | | [–]

> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.

takeda on Jan 14, 2020 | | | [–]

Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.

slavik81 on Jan 13, 2020 | | | | [–]

> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.

mark-r on Jan 13, 2020 | | | [–]

Every Python program should be tested with Emoji characters, they're a real torture test.

slavik81 on Jan 14, 2020 | | | [–]

Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.

mark-r on Jan 14, 2020 | | | [–]

Good point. I do almost all of my Python on Windows where it's much easier to get an error.

WorldMaker on Jan 14, 2020 | | | | [–]

Every program in general should be tested with Emoji characters at this point.

mark-r on Jan 14, 2020 | | | [–]

Not a bad idea, but I think Python is more likely to have hidden bugs that this will uncover. A language that accepts bytes as input and emits the same on output will probably work fine on UTF-8 for example.

WorldMaker on Jan 14, 2020 | | | [–]

That's the Python 2 mentality and a large part of this discussion was that it didn't work in hindsight, that you can't just be "encoding oblivious", but it usually doesn't show up as an obvious problem until you least expect it. Our input and output devices are aren't always homozygous on byte encoding (and quite possibly very rarely are; we have decades and decades of kludges around this), and testing every program with Emoji has become one of my favorite pastimes for finding failure cases.

takeda on Jan 14, 2020 | | | | [–]

It defaults to the system encoding. I don't use Python on Windows, but Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8. Perhaps Python needs to be updated to reflect that?

You can also specify encoding when calling open.

Dylan16807 on Jan 14, 2020 | | | [–]

> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.

ygra on Jan 15, 2020 | | | [–]

> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.

But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.

slavik81 on Jan 14, 2020 | | | | [–]

The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.

cbsmith on Jan 13, 2020 | | | | [–]

To be fair... the problem was more in Python 2 where this stuff was often conflated. Python 3 really just brought the problem in to stark relief.

TBH I do think the problem is easier to address in a statically typed world.

speedplane on Jan 15, 2020 | | | | [–]

> I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.

Python set the entire community back 10 years or more by making this drastic mistake.

simias on Jan 13, 2020 | | | [–]

It might be my own pro-typed-language bias showing but this migration from byte strings to unicode strings is really where dynamically typed languages really don't shine.

If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.

In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...

And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.

monoideism on Jan 14, 2020 | | | [–]

> added unicode as an afterthought like Python did

I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.

Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.

WorldMaker on Jan 14, 2020 | | | [–]

Another useful red letter date for language/tool adoption is the standardization of UTF-8 in 1993. Before UTF-8 there were a lot of tools, especially in the POSIX world, that didn't feel comfortable without an 8-bit safe encoding format.

Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then (before a large influx of users), but a corresponding complaint about UTF-8 is that because it was 8-bit safe, a lot of tools also felt they could kick the can on dealing with it more directly (as a default), and Python 2 seems to be among them. Hindsight has told us a lot about the problems to expect (and exactly why Python 3 did what it felt it had to do), but they probably weren't as clear in 2000. (In further hindsight, imagine if Astral Plane Emoji had been standard and been common around 2000 instead of 2010 how much further we might be in consistent Unicode implementation today. I suppose that makes 2010 another red letter date for Unicode adoption.)

im3w1l on Jan 14, 2020 | | | [–]

And it was much later than 1993 that unicode conclusively defeated latin-1. Something like 2010?

monoideism on Jan 14, 2020 | | | | [–]

> Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then

That's true, but I would argue that given the difficulty and backlash we've seen moving from Python 2 to Python 3, such a move would have risked destroying Python's rapid forward momentum and condemned it to the ash heap of programming language history.

monoideism on Jan 14, 2020 | | | [–]

To add on to this, I'm not agreeing with the backlash from Python 2 to 3. And I wouldn't want it in the ash heap of history - I definitely think there's a definite place for nice, quick, easy dynamic langs like Python, particularly for exploratory programming.

I'm just saying the move to Python 3 turned out to be a huge deal to a lot of people (it surprised me), and for that reason, trying such a big jump at Python 2 would have been risky and could have derailed Python's forward progress at a critical point.

Would the downvoters like to share their reasons for disagreement?

WorldMaker on Jan 14, 2020 | | | [–]

I think the question goes back to the size and scale of users at the 1 to 2 jump versus the 2 to 3 jump. Python didn't really start to hit most of its "forward progress" in terms of both user adoption and being so deeply integrated into systems. There was no Django for Python 1, for one example. As another example, I'm pretty sure Debian and its heavy reliance on Python for so much of its system scripting didn't happen until Python 2, either, but a quick search didn't turn up a reliable date.

It probably would have been a lot less risky with so many fewer daily users, so many fewer huge projects to migrate.

monoideism on Jan 14, 2020 | | | [–]

You may be right. I first used Python on a regular basis in 2002 (after release of Python 2), so I wasn't aware it had so little adoption prior to Python 2. But it definitely was picking up by 2002.

dkarl on Jan 13, 2020 | | | [–]

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.

I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.

mixmastamyk on Jan 13, 2020 | | | [–]

Yes, I was really surprised that they avoided upgrading to Python 2.7-level best practices and future statements for as long as they did and tried to hide it from most developers thru custom compatibility layers. Huh? That's step 0, getting except, stdlib imports, and print statements up to date. Folks can deal with that, that's the easy part.

Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.

indygreg2 on Jan 14, 2020 | | | [–]

The late start was mostly due to having to retain Python 2.4/2.5 compatibility until May 2015 and it was literally impossible to use some future statements or some Python 3 syntax until 2.6 was required. I have updated the post to reflect this.

mixmastamyk on Jan 14, 2020 | | | [–]

IC, that’s unfortunate. Believe that is the time to cut a legacy branch/release rather than block progress for a decade.

CJefferson on Jan 13, 2020 | | | [–]

Interesting you mention http headers. I had a program converted Python 2 -> Python 2 which was crashing occasionally, and it turned out it was because I was being sent a http request which wasn't valid unicode, so decoding failed.

I had to switch back to treating headers as bytes for as long as possible.

It is a stupid client which doesn't send valid ascii for http headers of course.

takeda on Jan 13, 2020 | | | [–]

I believe the headers are encoded using ISO-8859-1 not Unicode. That encoding has 1:1 mapping with bytes so wouldn't break this way. Treating them as UTF-8 was the bug.

code_biologist on Jan 13, 2020 | | | [–]

This is exactly the sort of encoding issues that the python 2 to 3 transition has flushed out. People get frustrated with python 3, yet the actual failure was their mishandling of encoding issues -- papered over by python 2.

xorcist on Jan 13, 2020 | | | [–]

But that's not what frustrates people with the transition. It's that they suddenly get encoding issues where there should have been no encoding to begin with!

cbsmith on Jan 13, 2020 | | | [–]

No observed encoding issues.

CJefferson on Jan 13, 2020 | | | [–]

When I treated headers as bytes, there wasn't an "encoding".

What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

cbsmith on Jan 14, 2020 | | | [–]

> When I treated headers as bytes, there wasn't an "encoding".

If you are representing strings as bytes, you are intrinsically using an encoding.

> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.

But yes, this strategy largely avoids encoding issues... until it doesn't.

Dylan16807 on Jan 14, 2020 | | | [–]

> If you are representing strings as bytes, you are intrinsically using an encoding.

It's just binary data that might resemble a string. No encoding necessary.

fiedzia on Jan 14, 2020 | | | [–]

This is false more often than not. Many programs taking user input will treat it as string, assuming specific encoding or compatibility with screen output/some api, at least in some code paths. For example if you print an error message when you can't open some file, you are very likely to assume its encoded in a way terminal can handle, so its no longer "just binary data".

CJefferson on Jan 14, 2020 | | | [–]

Yes, I have to worry about how to make a "best effort" to show it to users, but in all internal code paths it must stay as "just binary data", else I lose information. This is exactly how chrome and Firefox handle headers internally.

cbsmith on Jan 15, 2020 | | | | [–]

It might resemble a particular encoding of a string... and the way you got that string to that particular sequence of bytes is by... encoding it.

Dylan16807 on Jan 15, 2020 | | | [–]

> and the way you got that string to that particular sequence of bytes

No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.

cbsmith on Jan 16, 2020 | | | [–]

In that context, you aren't using strings. You are using bytes. HTML without interpreting it as strings isn't really HTML, nor is it a string. It's just a blob that is passing through.

takeda on Jan 14, 2020 | | | | [–]

> When I treated headers as bytes, there wasn't an "encoding".

oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.

It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.

CJefferson on Jan 13, 2020 | | | | [–]

I'll admit, I'm not positive what the encoding should be. However there is a bunch of people who do clearly send UTF-8, and I can also promise you there are headers out there which just have binary nonsense in them. See for example https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

If you want to handle all headers, you have to be prepared to just get binary data.

takeda on Jan 14, 2020 | | | [–]

Yes, and using ISO-8859-1 is the way to handle them without issues. You will never get error when decoding it that way. If you are using UTF-8 there are character combinations that are invalid.

dnautics on Jan 13, 2020 | | | | [–]

> It is a stupid client which doesn't send valid ascii for http headers of course.

...or a smart malicious actor.

utxaa on Jan 20, 2020 | | | [–]

> But you don't need all b"" everywhere.

as a mercurial user i never understood this decision. for instance look at this recent commit: https://www.mercurial-scm.org/repo/hg/rev/b4c82b704180

would anyone disagree with the fact that an error message should be a string?

a source transformer to add b'' all over the place? really?

and i still don't understand why the hg transition had to be more complex than: https://docs.djangoproject.com/en/1.11/topics/python3/

... and of course now this: https://www.mercurial-scm.org/wiki/OxidationPlan

i wonder what does matt mackall think of all these developments?

skywhopper on Jan 14, 2020 | | | [–]

Why are you so certain about your assertions here about when they did and did not need to use explicit byte strings?

fireattack on Jan 13, 2020 | | [–]

I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3. Coming from C#, I never get used to Python 2's approach. It's a pain in the ass working with non-Latin characters in Py2 starting from simply output in console, especially on Windows.

>assuming the world is Unicode is flat out wrong

True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ([1]).

[1] https://bugs.python.org/issue15809 (Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)

int_19h on Jan 13, 2020 | | [–]

The most amusing quote in the entire article is this (emphasis mine):

> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.

And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.

So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.

sfink on Jan 13, 2020 | | | [–]

> And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.

> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.

Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.

acdha on Jan 14, 2020 | | | [–]

Paths ARE Unicode strings on 99% of the computers with humans sitting in front of them. NTFS, HFS+, and APFS all use Unicode but more importantly, the experience of not using valid Unicode where that’s possible is horrible: undeletable files, crashes, etc. I’ve seen that many times over the years (it was popular with malware authors) but never a time where this was a desirable behavior.

The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.

jgraham on Jan 14, 2020 | | | [–]

This just isn't true. In Windows paths are UCS2 i.e. arbitary sequences of unicode code units, inclusing unpaired surrogates. This means that there are paths that will work on Windows but cannot be encoded as e.g. valid UTF-8. As a result Rust has a bespoke encoding just for representing Windows paths in a way that's compatible with UTF-8 ("WTF-8"). It also means that you can't make a guaranteed lossless conversion from a filesystem path to a Rust string; you have to handle the possibility of errors.

On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.

As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.

int_19h on Jan 14, 2020 | | | [–]

This all - including complicated equality comparisons - is why paths should have their own dedicated type, and not just be raw strings. Thankfully, Python has had pathlib for a while now.

WorldMaker on Jan 14, 2020 | | | | [–]

Paths are Unicode strings on Windows. Yes, POSIX adds a lot more spice to the mix, but if the intent is a cross-platform tool, then Unicode is a reasonable lowest-common-denominator assumption for filenames in 2020.

Conan_Kudo on Jan 14, 2020 | | | [–]

Paths are Unicode strings everywhere but Unix/Linux. And I would even argue that this is a broken aspect of POSIX today. We should make Unicode the baseline for paths in POSIX-compliant systems, but there's probably too much hand-wringing for that to ever happen.

ygra on Jan 14, 2020 | | | | [–]

Paths are sequences of 16-bit values on Windows, not necessarily valid UTF-16. It's basically the same as in POSIX, just one byte wider per character.

markbnj on Jan 13, 2020 | | | | [–]

> if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.

int_19h on Jan 13, 2020 | | | [–]

Right. But that's a very different issue, and it's not at all about string literals as such.

Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?

phkahler on Jan 13, 2020 | | | | [–]

>> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated

The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.

Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.

With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.

int_19h on Jan 13, 2020 | | | [–]

The lack of u'' in early versions of Python 3 is a valid complaint, but it's a separate one.

It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.

epage on Jan 14, 2020 | | | | [–]

Another reason the complaint doesn't make sense is that the author then praises Rust which is more similar to Python 3 than 2.

afiori on Jan 14, 2020 | | | [–]

From other comments the annoyances for the author were about the standard library using Unicode for system level API; Rust had a OSString type that works with the GIGO model of posix

the_mitsuhiko on Jan 13, 2020 | | | [–]

> but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.

Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.

harikb on Jan 13, 2020 | | | [–]

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.

Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.

PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.

the_mitsuhiko on Jan 13, 2020 | | | [–]

> Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.

Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.

All of this was known back then.

dralley on Jan 14, 2020 | | | [–]

Is there any hope of "fixing" it now without going through another massive migration struggle (which will simply not happen)?

morelisp on Jan 13, 2020 | | | | [–]

No one is complaining that Python 2 didn't DTRT when it comes to Unicode.

But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).

Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.

lmm on Jan 14, 2020 | | | | [–]

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.

I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.

burntsushi on Jan 14, 2020 | | | [–]

str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.

[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.

FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.

In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

lmm on Jan 14, 2020 | | | [–]

With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.

> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).

burntsushi on Jan 14, 2020 | | | [–]

I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(

And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.

Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.

Sohcahtoa82 on Jan 13, 2020 | | | | [–]

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.

masklinn on Jan 13, 2020 | | | [–]

> Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.

Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.

Sohcahtoa82 on Jan 13, 2020 | | | [–]

Ah, that makes sense. Thank you for the clarification.

Lev1a on Jan 14, 2020 | | | [–]

To add to your earlier dialog partner, here are the doc pages for the relevant Rust functions/methods, embedded with runnable examples:

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/primitive.str.html#method.as_b...

https://doc.rust-lang.org/std/str/fn.from_utf8.html

Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.htm...

takeda on Jan 13, 2020 | | | | [–]

IMO Python is doing exactly the same thing that Go does (I know too little about Rust to comment) the only difference is that Python respects the LANG variable while Go is just fixed on using UTF-8.

the_mitsuhiko on Jan 13, 2020 | | | [–]

> Python is doing exactly the same thing that Go does

It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

takeda on Jan 14, 2020 | | | [–]

Here's your problem: you should not care how python is representing it internally.

> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte[] when doing I/O.

earthboundkid on Jan 13, 2020 | | | [–]

Python 2's approach was bad, no argument, but the transition plan for 2-to-3 just didn't work. They thought everyone would run 2to3 in a big bang, and then we'd all switch over to 3 in a few years. Instead it dragged out over a decade because in reality we needed to write code that was compatible with both 2 and 3 (the "6" approach) until enough things were on 3 to drop 2 support.

Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".

cbsmith on Jan 13, 2020 | | | [–]

I'm not sure they really thought 2to3 would be used for a big bang. I seem to recall the general initial messaging was that Python 3 was a new language and you would need to do a language port to get to it.

kibwen on Jan 13, 2020 | | | [–]

> I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.

IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).

fireattack on Jan 13, 2020 | | | [–]

My comment is about this sentence:

>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode

>standard library APIs that formerly took bytes now incorrectly take Unicode strings

What do you mean by "incorrectly"?

ngoldbaum on Jan 13, 2020 | | | [–]

POSIX APIs take bytes, generally. Python wraps these APIs to take unicode and doesn't allow you to pass bytes, even if you need to. Filenames, for example, are just bytes, and if you force them to always be valid unicode you will make it so that you can't interact with files that have names that aren't valid unicode. That's just one example.

chaosite on Jan 13, 2020 | | | [–]

This is false since Python 3.6.

https://docs.python.org/3/glossary.html#term-path-like-objec...

morelisp on Jan 13, 2020 | | | [–]

An extremely frustrating part of the Python 3 migration is how many times Python module maintainers have had to hear "oh, now it's safe to migrate." This page currently leads off with a comment saying it's been fine any time since 3.4. You say 3.6. When I was maintaining a popular Python module, I heard the same at 3.1, and 3.2. (I didn't maintain it long after that.)

joshuamorton on Jan 13, 2020 | | | [–]

There are very few places where the bytes/string difference matters for posix paths. Python is far from the only popular tool to assume paths must be valid unicode.

morelisp on Jan 13, 2020 | | | [–]

> There are very few places where the bytes/string difference matters for posix paths.

It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)

People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.

> Python is far from the only popular tool to assume paths must be valid unicode.

It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."

masklinn on Jan 14, 2020 | | | | [–]

> There are very few places where the bytes/string difference matters for posix paths.

There’s only every single input from the system at large, no big.

joshuamorton on Jan 14, 2020 | | | [–]

I don't quite agree. There's lots of systems where it's always unicode, and the a lot of systems where it's always ASCII, and then some systems where stuff is weird (and should be unicode :x)

chaosite on Jan 13, 2020 | | | | [–]

There was a different API to get this behavior since 3.4: https://www.python.org/dev/peps/pep-0428/#id39

masklinn on Jan 13, 2020 | | | | [–]

Which means it's been true (and broken) for many many years until maintainers finally succumbed to external pressure and unbroke the API.

branko_d on Jan 13, 2020 | | | [–]

Just beware that C# is not exactly "Unicode" either.

C# char is a UTF-16 code unit, not a Unicode code point.

Most code points "fit" into just one UTF-16 code unit, but not all.

For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.

ygra on Jan 14, 2020 | | | [–]

C# isn't exactly quiet about this property, and yes, it can be annoying from an API perspective, but in C# this was likely a pragmatic choice to remain compatible (and familiar) with C++, COM, etc. where most developers would be coming from.

API members that operate on code points universally take a string and an index.

That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.

ak217 on Jan 13, 2020 | | | [–]

I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding, combined with silent coercion between unicode/bytes whenever needed. Those two features in combination made Python brittle and dangerous when handling non-ascii characters, not the "strings are bytes" default.

Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).

kibwen on Jan 13, 2020 | | | [–]

> I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding

The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!

And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.

steveklabnik on Jan 13, 2020 | | | [–]

Interestingly enough, Ruby 1.8 -> 1.9, the big version jump there, there was this kind of transition. The remainder of this post is all IIRC, it's been a while...

Ruby 1.8 had "everything as bytes" and there was no concept of encodings.

Ruby 1.9 introduced explicit encodings on every string. By default, strings would be encoded as the same encoding as your source file. The default was ASCII. You could control this explicitly with a magic comment, and so many folks added the "UTF-8" comment, to get strings encoded as utf-8 by default.

Ruby 2.0, which was not as large a transition as Ruby 1.8 -> 1.9, even though it sounds like a larger one, said that encodings of files were UTF-8 by default, and therefore, strings generally became UTF-8 by default as well. Most folks just removed their magic comments.

mark-r on Jan 13, 2020 | | | [–]

It's surprising how many people believe that you can use a magic comment to make Python use UTF-8 encoding as the default. All the magic comment affects is the encoding of the source file, not the run-time.

morelisp on Jan 13, 2020 | | | | [–]

Enforcing UTF-8 as the default encoding, barring a magic comment otherwise, would hardly have been the biggest compatibility break in the 2.x line. It could have been done in any minor release, IMO.

im3w1l on Jan 14, 2020 | | | [–]

To be fair, IDLE is pretty garbage in most ways.

jmilloy on Jan 13, 2020 | | [–]

> in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case.

I feel like this is the essence of the article: specific constraints/choices of Mecurial made their port to Python 3 difficult. Working with early Python 3 certainly did not help. But there seems to have been some stubbornness here mixed with a lot of retroactive justification.

> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code.

This is almost ridiculous. You are going to write a JIT partial 2to3 instead of just increasing your length limits and/or using an autoformatter? (Of course, it turns out they eventually did do that... after a bit more stubborness regarding the autoformatter.)

> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial.

Couldn't this have been a very occasional copy and paste, instead of a downstream dependency? [six](https://six.readthedocs.io/) "consists of only one Python file, so it is painless to copy into a project."

> Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility.

Yes, can't disagree. Early adopters who attempted to write 2- and 3- compatible code suffered the most.

intrepidhero on Jan 13, 2020 | | [–]

> Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

Having just done transitions on a number of much smaller projects I had the same thought. Changes to string handling tripped me up and the changes to relative imports took some thinking. But the biggest frustration was the nagging question: Why am I doing this?

edit: missing word

libria on Jan 13, 2020 | | [–]

> Why am [I] doing this?

Lack of security updates past 2019 forced our hand. Did you find a way around that?

falcolas on Jan 13, 2020 | | | [–]

> Lack of security updates past 2019 forced our hand. Did you find a way around that?

Amazon is maintaining Python 2 for at least 4 years, as part of their Amazon Linux long term support release. Google app engine will support Python 2 for an unknown amount of time; they haven't announced an end date. PyPy is Python 2, with (to the best of my limited knowledge) no plans to deprecate support. There are also other LTS releases out there which include Python 2 support.

IOW, the forcing function of the PSF no longer supporting Python is not as big a factor as was hoped.

rst on Jan 13, 2020 | | | [–]

Security updates in Python itself aren't the only issue; a Python 2 project may also depend on packages with security issues of their own, which require continued upstream maintenance.

For example, the python-saml package (for managing SAML-based single sign-on) has separate Python 2 and Python 3 versions, and implements a security-sensitive protocol which means it has (in the fairly recent past) gotten security updates for issues serious enough to rate an assigned CVE. If you're using it, having the current maintainers walk away from the Python 2 version is a serious risk...

jnwatson on Jan 13, 2020 | | | | [–]

I'm a maintainer for a somewhat popular Python package that had support all the way back to 2.4, but I've had to systematically remove support for those versions. The problem is all the CI infrastructure and testing packages are removing support.

Is Amazon planning to support pytest for at least 4 years? It will have its last 2.7-supporting release very soon.

viraptor on Jan 13, 2020 | | | | [–]

This would only help the server side of mercurial though. There's no client-side supported distribution really. Pypy is not that popular yet.

falcolas on Jan 13, 2020 | | | [–]

I don't know about others, but when I used Mercurial, it was via installing it through brew. And if brew installed pypy as a dependency so Mercurial could still use Python 2, I probably wouldn't have noticed.

viraptor on Jan 14, 2020 | | | [–]

You'd notice, because it wouldn't work: https://www.mercurial-scm.org/wiki/PyPyPlan

kick on Jan 13, 2020 | | | | [–]

PyPy is keeping Python 2 support indefinitely, I believe.

hsivonen on Jan 13, 2020 | | | | [–]

There's a project for keeping Python 2 alive: https://github.com/naftaliharris/tauthon

It's particularly uncool that Guido brought up the prospect of lawyers (https://github.com/naftaliharris/tauthon/issues/47#issuecomm...) to force it not to be called Python and opposed to letting people who care about keeping Python 2 alive evolve it as "Python 2". (I know he has the legal right to insist on the name change. Still uncool.)

simias on Jan 13, 2020 | | | [–]

I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6. Having incompatible forks of the language share the same name is a pain for everyone involved. I can completely understand the "mainline" python maintainers not wanting to have to deal with that.

Besides if the Tauthon people are serious about maintaining their fork long term it needs to become more than a mere fork and a real language ecosystem of its own, in the long run having a different name will probably help with that, assuming that they ever get there.

EDIT: Also reading the rest of the thread I realize that the post that you linked out of context is slightly misleading (but I blame github's aggressive folding more than you here). Guido's answer comes after the following exchange:

stefantalpalaru: "Disregard Guido's objection. The "Python" trademark doesn't extend to "py2" or "py28". Read this for details: https://www.python.org/psf/trademarks/"

Guido: "Isn't the whole point that we're trying to solve this without lawyers?"

stefantalpalaru: "The whole point is that you've been sabotaging Python 2 for years and when someone does what needed to be done from the start, you come up with silly objections."

Guido: "OK, bring in the lawyers."

In that light, and given the other poster's ridiculously inflammatory take, Guido's answer seems rather level headed and appropriate IMO. He stands his ground, so to speak.

lizmat on Jan 14, 2020 | | | [–]

Re: I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6.

Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media). So Perl and Raku are now considered to be different languages, albeit from the same inspiration.

Now, if Python 2 people would decide to rename Python 2 to something else, I guess it would be a mirrored parallel :-)

simias on Jan 14, 2020 | | | [–]

That's precisely what's happening with the third-party Python2 forks. The renaming of Perl 6 occurred last October, specifically because of the problems caused by the confusion between the incompatible Perl 5 and 6 that caused a lot of trouble to the Perl people on either side for many years.

It's not a mirrored parallel, it's the Python folks learning from Perl's mistakes and making sure that this parallel won't come to be.

mjw1007 on Jan 13, 2020 | | | | [–]

I think it makes a great deal of sense for the Python core team to say "we're finished with Python 2 and want nothing more to do with it".

But I'm very disappointed that the Python Software Foundation isn't explictly supporting people who want to keep Python 2 compiling and running on modern systems. I think that would be well within their remit to "promote, protect, and advance the Python programming language".

This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

Even before Python 3.0 appeared, I came across scientists saying "I prefer to stick with Fortran because new Python versions break old code too frequently".

int_19h on Jan 13, 2020 | | | [–]

PSF does not object to people who keep Python 2 compiling and running, such as ActiveState (https://www.activestate.com/company/press/press-releases/act...).

This case is different, because it's a project that uses the Python name, but actively adds features to the language. This is the classic example of brand confusion - someone might try to use it, find something to complain about, and PSF's reputation suffers as the result. They also get support overhead from the users of the fork (even if all they do is tell them to go away, that is still triage time that could be spent on other issues).

mjw1007 on Jan 13, 2020 | | | [–]

"Does not object" is better than nothing, but I think it would be better if the PSF actively helped to coordinate this work (again, without bothering the core team). As far as I'm concerned, this is exactly the sort of thing that the PSF exists for.

forgotpwd16 on Jan 14, 2020 | | | | [–]

>This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

You can always download an old version and the respective libraries and use them to reproduce any results you want. That doesn't mean that old version should be supported anymore.

Klonoar on Jan 13, 2020 | | | | [–]

Longtime Python dev who was also annoyed by the 2 - 3 transition here.

I don't see Guido as in the wrong for that. It'd be a smack in the face when you spend years trying to finally push people to switch (for better or for worse) and then a project like this takes the SEO and gets to run freely with it.

hsivonen on Jan 13, 2020 | | | [–]

Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it? It's ungraceful not to hand off maintainership on good terms to someone who wants to do the work.

Imagine if Stroustrup had done D and insisted that it be called C++ and wanted everyone to stop using the language everyone knew as C++ on Jan 1st 2020.

dragonwriter on Jan 13, 2020 | | | [–]

> Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it?

They aren't stopping people from using Python 2, the language or Python 2, the software.

They are stopping people from using the name “Python” as the name of forked implementations of Python 2 not maintained by the PSF. No implementation not maintained by PSF is allowed to be called unqualified Python; the name is an important indicator of provenance. There are and have been plenty of third-party Python (2 and otherwise) implementations, the implementations just need their own names.

hsivonen on Jan 13, 2020 | | | [–]

> They aren't stopping people from using Python 2, the language or Python 2, the software.

The effort to claim the binary name python for Python 3 is actively hostile to Python 3 and a thing that runs unmodified Python 2 unmodified on the same operating system installation. (It's unclear to me how much this is a PSF push, but at least the PEP isn't telling distros to refrain from this hostile-to-comaptibility action.)

> No implementation not maintained by PSF is allowed to be called unqualified Python

The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.

joshuamorton on Jan 13, 2020 | | | [–]

> The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.

For who? This costs the PSF manpower/overhead that they don't want to expend on a thing they don't want to maintain. It dilutes the language that the PSF are stewards of, and would further cause a schism in the python community. None of those things sounds good for python, its ecosystem, or the PSF. They sound good for, like, a few curmudgeonly companies and individuals that don't want to migrate.

I can't parse your first sentence, so I can't respond to it.

hsivonen on Jan 13, 2020 | | | [–]

> For who?

For users of the Python 2 language who have a lot of Python 2 code and for whom migration doesn't make cost/benefit sense on technical merits of Python 3.

There's Tauthon. There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2. There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

It would be great if there was a common venue for collaboration for these by the parties who are interested in keeping Python 2 going. (I'm not suggesting that Python 3 core devs should do the work.) Like a foundation for Python software.

The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.

joshuamorton on Jan 13, 2020 | | | [–]

> The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.

Yes, but I don't believe I've seen any (real) suggestions to change PEP 394.

> There's Tauthon.

Which I claim is actively bad for python's ecosystem in the long term. It shouldn't be supported by any organization that wants what is best for Python.

> There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

That works just fine without any help. pypi continues to support python2 tags and wheels, and I doubt that'll change anytime soon.

> There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2.

So the entire reasonable bit here is that the PSF should provide something to help various enterprise companies manage backporting security patches. Which, like, I'm not sure what infrastructure is actually needed for that. They already make security patches public. Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.

hsivonen on Jan 14, 2020 | | | [–]

> Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.

If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

That such improvements are considered actively harmful comes from a point of view where there's a top-down imperative to shut down Python 2 in order to make Python 3 succeed. It's not harmful from the point of view of the code people have written in Python 2 being valuable.

The notion that there user community needs to work for Python (by porting to Python 3) and that Python 2 needs to be shut down as opposed to Python development valuing the existing code that had been developed is the core problem with Python 3.

joshuamorton on Jan 14, 2020 | | | [–]

> If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

But it really doesn't. If the new features are that valuable, you can convert your code. It's not actually that hard (I have a few 100kloc ported forward now, with millions of lines of dependencies that says so).

takeda on Jan 13, 2020 | | | | [–]

That project is not Python 2 though, it added features that made it incompatible with both Python 2 and Python 3. Just look at their effort to add wheel support.

Any project that forks changes name:

nagios -> icigna

mysql -> mariadb

NetBSD -> OpenBSD

FreeBSD -> DragonflyBSD

Python -> PyPy, Jython, IronPython

It would be crazy for them to keep the same name and not be compatible. It would cause confusion and also lead to increase of support tickets in wrong bug trackers.

hsivonen on Jan 14, 2020 | | | [–]

In those cases, the original project lived on. Here Python 3 is the incompatible fork, but because the technical fork is done by the folks who control the name and who want to shut the old thing down, so the compatible evolution of Python 2 had to change the name.

gjulianm on Jan 13, 2020 | | | | [–]

Your analogy is not appropriate. The actual situation with Tauthon is as if someone was not happy with C++17, so they forked C++14, added new features and changed syntax and then insisting to call it C++. It's just confusion for the users and it's in the best interest of the PSF to protect the Python name.

hsivonen on Jan 13, 2020 | | | [–]

I'd agree if the Python core devs were still interested in evolving Python 2.x. But they aren't, so now no one else gets to do Python 2.8, either. It would be the best if the PSF provided a venue for Python 2.x development even if the folks who went on to do Python 3 weren't the people working on it.

Anyway, the core problem is a top-down effort to try to make a programming language of Python 2.x’s level of usage stop to the extent it’s stoppable under its license, because its creators wanted to do something else, as opposed to facilitating its user community to pool resources to continue its development. Does the PSF have a legal obligation to do such facilitation? No. Is the lack of such facilitation bad for parties who bought into Python when it was Python 2? Yes.

joshuamorton on Jan 14, 2020 | | | [–]

> core devs were still interested in evolving Python 2.x.

They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

You're arguing that the psf should treat python2 and 3 as different languages. In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.

hsivonen on Jan 14, 2020 | | | [–]

> They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

I meant compatible (in the sense that old programs keep running and you can add new stuff to old programs using the new features) evolutions.

> You're arguing that the psf should treat python2 and 3 as different languages.

For practical purposes, they are different languages and the PSF has been treating them as distinct things.

> In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

It indeed is bad. I hope that every other programming language community and designer takes a close look at what happened and makes sure never to do a Python 3 analog of their language.

> In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.

That's the core problem from the perspective of Python 2 users. The organization that was the steward of the language that they invested in (in the form of writing code in the language) decided not only that a different programming language is more important for the org but that the old language needed to be shut down in order to benefit their new thing.

It's OK for people to get bored with a project and move onto something else, but with the level of usage that Python 2 had and has, it's very problematic for the language steward organization to turn around and seek to shut the language down instead of continuing to evolve it in a way that's respectful of the language users' investment in the language.

joshuamorton on Jan 14, 2020 | | | [–]

> a way that's respectful

You had like 10 years of warning and it's "disrespectful"? I don't think there's a chance of productivity if you're starting from that baseline level of entitlement. Sure, mandates are annoying. But I just can't fathom that.

hsivonen on Jan 14, 2020 | | | [–]

It's not about how many years of warning there was. It's about making users of the language to rewrite by mandate as opposed to the new features being incrementally adoptable into existing code bases. Sure, that means there are some language changes you never get to make.

Java, JavaScript, C, and C++, for example don't break investment in old code like Python 3 did. They form a reasonable baseline.

joshuamorton on Jan 14, 2020 | | | [–]

And we have kotlin, typescript, and rust due to those languages unwillingness to make breaking changes. The cpp committees unwillingness to remove old garbage from the language is iirc the most cited issue with the language by longtime users.

There are tradeoffs.

hsivonen on Jan 14, 2020 | | | [–]

You can add Kotlin to you app without rewriting all Java. You can add TypeScript to your app without rewriting all JavaScript. You can add Rust to your app with with rewriting all C++. Seems reasonable.

That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

joshuamorton on Jan 14, 2020 | | | [–]

> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

You're mistaken. I have python3 binaries and python2 binaries that share dependencies.

You're correct that fully automatic transpilation is impossible, but that doesn't mean that there can't be shared source. It does however mean that things like per-file flags or whatnot aren't possible. Python became a better language with text vs. bytes support, but that support couldn't be done in a backwards compatible way. Oh well.

> You can add Rust to your app with with rewriting all C++.

It's not as good as you seem to think. It's a nonstarter for a lot of people otherwise interested in adopting rust into existing codebases. Certainly not better than the py2/3 situation.

Kotlin interop also is troublesome, although granted better than rust/cpp or py2/3.

> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

That python didn't get replaced by a different language is an incredible testament to the foresight of the python language stewards.

gjulianm on Jan 13, 2020 | | | | [–]

> as opposed to facilitating its user community to pool resources to continue its development

How does reusing the name facilitate the development? Every time there is a fork of an open-source project the name changes, precisely to avoid confusions. Reusing the Python name in a fork that is not just a redistribution, but a new version with new features and syntax, is just confusing, unusual and does not help anyone.

umanwizard on Jan 13, 2020 | | | | [–]

The difference is that the major C++14 implementations are still supported, and probably will be for as long as the concept of C++ exists.

There's no `--std=python2` flag you can pass to the interpreter, unfortunately.

gjulianm on Jan 13, 2020 | | | [–]

There is no '--std=C++14-with-arbitrary-things-from-c++20' flag either, which is what this fork does. We can discuss whether breaking backwards compatibility was bad or necessary, but creating another fork of Python that backports some features of Python 3 is just adding confusion. If their primary purpose is supporting Python 2.7 applications, they can do just fine without calling it Python.

comex on Jan 13, 2020 | | | [–]

There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

Indeed, C++ has rarely made any breaking changes. A decade or so ago, GCC did cause some major ecosystem breakage, by cracking down on C++ constructs which had never been valid according to the spec but which GCC had previously allowed. When that happened, there was a flag to (at least partially) revert to the old behavior: -fpermissive.

Conan_Kudo on Jan 14, 2020 | | | [–]

> There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

This literally does not parse. How do you know "nobody" cares about those exceptions?

wott on Jan 14, 2020 | | | | [–]

'-std=C++14' already includes a few extensions, it is not pure C++14 but a superset. And then there is '-std=gnu++14' too.

CJefferson on Jan 14, 2020 | | | | [–]

No, this is like if someone was not happy with C++17 AND gcc removed its support for C++14. Instead, I can still happily compile C++14, C++11, C++03 and C++98 code with gcc.

hackbinary on Jan 13, 2020 | | | | [–]

I disagree. If someone else wants to continue to develop Python 2 outside the Python foundation and formalised development community, then that is their prerogative, but Python has the right to decide what is and is not Python.

Dilution of what is commonly accepted to be Python would not be a good thing, and would further add to confusion.

I know that platform upgrades are painful, but we need to move with the times or we'll all be mired in technical debt and old technology.

takeda on Jan 13, 2020 | | | [–]

Yes, them keeping Python 2 alive for 10 years where Python 3 was developed caused a lot of issues, it would be extremely short sighted to allow third (incompatible) python into the mix.

hsivonen on Jan 14, 2020 | | | [–]

> it would be extremely short sighted to allow third (incompatible) python into the mix

The whole point of Tauthon is that it is compatible with Python 2 (in the direction that old programs work).

camgunz on Jan 13, 2020 | | | | [–]

Letting "Python 2" zombie around is unacceptable. Python 3 is better in every way, and has been since 3.3 (which Armin deserves a lot of credit for).

Consider anyone who wants to build something with Python, whether it's a library, application, or service. What's better, having to build for Python 3 and 2, or just Python 3?

Thank God that Guido did this, despite knowing all the blowback he'd get. To me, that's super cool.

eesmith on Jan 13, 2020 | | | [–]

"better in every way" ... except for 1) startup time (according to the linked-to article), 2) support for existing Python 2 code, and 3) support for Python 2 C extensions.

For example, https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stabili... describes the "Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread[1]) for which resources (funding plus competent staff) are very difficult to find."

[1] The thread at https://twitter.com/khinsen/status/930749714567434240 includes "Lots of C modules written for Python 1.4 are waiting for enthusiastic code archeologists ;-)".

I don't think Hinsen is alone in that situation. I can well believe there are some people who, for example, plan to retire in about 5 years and would rather keep with with a Python 2 zombie than spend time to port working code to Python 3.

camgunz on Jan 14, 2020 | | | [–]

Startup time is complex, but base startup only increased about 20ms, and that's being generous.

I'll admit Python 3 is still slower at a lot of things. But that feels like saying your new dog is even worse at math than your old one.

The C extension thing isn't Python's fault. It's the job of library and app authors to update. Do we complain that Vulkan has bad SunOS support? This is totally backwards.

Could Hinsen (and others) not just version their deps? It's not like people are erasing Python 2 off the internet. If his main worry is reproducibility, he should be doing that anyway.

---

I don't want to give the impression I like the whole Python 3 thing. I think it was a pretty big mistake and a huge missed opportunity. I'm very sympathetic to people who had to put in a lot of work for basically no good reason--Python 3 didn't really offer anything significantly better than 2 until... 3.5 (3.4 if you think the first pass at async was useful, I personally don't).

But I also find the ballyhooing about it really insufferable. Yeah it was a mistake; Armin Ronacher (as usual) was right. It was also over 11 years ago. Time to forget all about this and build cool stuff, please please please.

tick_tock_tick on Jan 14, 2020 | | | [–]

It takes me 37ms to load hackernews a 20ms start time is embarrassing what is it even doing?

eesmith on Jan 14, 2020 | | | [–]

It does a lot of module imports. Most of those probe the filesystem.

Try "python -vv -c 'pass'" - I'm only showing the first few dozen lines, and I've trimmed some of the paths for conciseness:

    % python -vv -c 'pass'
    import _frozen_importlib # frozen
    import _imp # builtin
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    # installing zipimport hook
    import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
    # installed zipimport hook
    import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
    import '_io' # <class '_frozen_importlib.BuiltinImporter'>
    import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
    import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
    import _thread # previously loaded ('_thread')
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import _weakref # previously loaded ('_weakref')
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    # miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/__init__.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc'
    # trying miniconda3/lib/python3.7/codecs.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/codecs.abi3.so
    # trying miniconda3/lib/python3.7/codecs.so
    # trying miniconda3/lib/python3.7/codecs.py
    # miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc matches miniconda3/lib/python3.7/codecs.py
    # code object from 'miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc'
    import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
    import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44c90>
    # trying miniconda3/lib/python3.7/encodings/aliases.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/aliases.abi3.so
    # trying miniconda3/lib/python3.7/encodings/aliases.so
    # trying miniconda3/lib/python3.7/encodings/aliases.py
    # miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/aliases.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc'
    import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd67d10>
    import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd440d0>
    # trying miniconda3/lib/python3.7/encodings/utf_8.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.abi3.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.py
    # miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/utf_8.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc'
    import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44bd0>
    import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
    # trying miniconda3/lib/python3.7/encodings/latin_1.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.abi3.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.py
    # miniconda3/lib/python3.7/encodings/__pycache__/latin_1.cpython-37.pyc
    matches miniconda3/lib/python3.7/encodings/latin_1
       ... many, many more lines omitted ...

This can be sped up a lot using a zipimport of the Python standard library, https://docs.python.org/3/library/zipimport.html?highlight=z... , where all of the standard library is put in a zipfile. Then there's a file access to get the zip metadata.

One of the things that bugged me in Python2 was that every startup imported UserDict:

    # trying python2.7/UserDict.so
    # trying python2.7/UserDictmodule.so
    # trying python2.7/UserDict.py
    # python2.7/UserDict.pyc matches python2.7/UserDict.py
    import UserDict # precompiled from python2.7/UserDict.pyc

This is because os.environ was an instance of UserDict:

    % python2.7 -c 'import os; print(os.environ.__class__.__bases__)'
    (<class UserDict.IterableUserDict at 0x1029b14c8>,)

Under Python3 this is spelled collections.abc.MutableMapping:

    % python3 -c 'import os; print(os.environ.__class__.__bases__)'
    (<class 'collections.abc.MutableMapping'>,)

which triggers its own set of imports:

    # trying python3.6/collections/abc.cpython-36m-darwin.so
    # trying python3.6/collections/abc.abi3.so
    # trying python3.6/collections/abc.so
    # trying python3.6/collections/abc.py
    # python3.6/collections/__pycache__/abc.cpython-36.pyc matches python3.6/collections/abc.py
    # code object from 'python3.6/collections/__pycache__/abc.cpython-36.pyc'
    import 'collections.abc' # <_frozen_importlib_external.SourceFileLoader object at 0x103c2ecf8>

There's better performance using SDD than a HDD, which is in turn better than using a networked filesystem.

tick_tock_tick on Jan 14, 2020 | | | [–]

I'm guessing startup time just isn't an important goal for cython then? I've know they refused to implements some optimization that would significantly increase complexity but this seem like low hanging fruit?

eesmith on Jan 14, 2020 | | | [–]

Oh, there's been plenty of work to reduce the Python startup cost.

It's just hard to fix.

I'm not sure the os.environ example I gave is low-hanging fruit now. The collections.abc module might be imported anyway.

This is neat! Python 3.7 added the `PYTHONPROFILEIMPORTTIME=1` environment variable to help track down these sorts of import overheads:

  % env PYTHONPROFILEIMPORTTIME=1 python -c pass
  import time: self [us] | cumulative | imported package
  import time:       523 |        523 | zipimport
  import time:       722 |        722 | _frozen_importlib_external
  import time:       156 |        156 |     _codecs
  import time:      2254 |       2409 |   codecs
  import time:      1293 |       1293 |   encodings.aliases
  import time:      7192 |      10893 | encodings
  import time:      1108 |       1108 | encodings.utf_8
  import time:       182 |        182 | _signal
  import time:      1069 |       1069 | encodings.latin_1
  import time:       395 |        395 |     _abc
  import time:      1486 |       1881 |   abc
  import time:      1540 |       3420 | io
  import time:       100 |        100 |       _stat
  import time:       975 |       1075 |     stat
  import time:      1481 |       1481 |       genericpath
  import time:      1734 |       3214 |     posixpath
  import time:      2558 |       2558 |     _collections_abc
  import time:      2234 |       9079 |   os
  import time:      1407 |       1407 |   _sitebuiltins
  import time:      3498 |       3498 |   sitecustomize
  import time:        85 |         85 |   usercustomize
  import time:      4129 |      18196 | site

Investigating further, the "import os" which triggered the UserDict/collections.abc is a consequence of "import site". If I use "python -S" then those aren't imported.

eesmith on Jan 14, 2020 | | | | [–]

I can't help but interpret your response as "Python 3 is better in every way ... for the ways I think are important."

For some of my programs, Python startup time is the main overhead. I avoid NumPy and SciPy if at all possible because they have a huge startup overhead.

Some of this is inherent in those packages. NumPy internally imports everything so someone can do "import numpy as np; np.package.subpackage.module.function()" without doing the intermediate imports.

This means NumPy is optimized for programmers (especially novice programmers) using NumPy in long-lived processes where startup cost is a negligible overhead.

Which isn't all use-cases for numeric computing.

15 years ago I supported a CGI-based web app. It was very important to pull out all the stops (delay imports until needed, use zip packages) because it was easier to do that than to re-write everything for another architecture.

The dog does count pretty well after all.

> It's the job of library and app authors to update.

Why? Linus Torvalds doesn't agree with you, for one.

As Hinsen points out,

] Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it’s mostly a calamity for computational science.

Some scientific code has been able to run unchanged since the 1970s, through multiple new Fortran language releases.

Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities.

But why not recognize that for some situations Python 3 is not better?

Hinsen also comments on your proposal:

] The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don’t agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter.

> Could Hinsen (and others) not just version their deps?

He addresses that, I think. One of the other commenters gives a more complete reply at https://metarabbit.wordpress.com/2017/11/18/numpy-scipy-back... ending "Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.".

> Time to forget all about this and build cool stuff, please please please.

I'll quote Hinsen again "I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter."

Your implicit statement is that mmtk (Hinsen's code base) isn't "cool stuff". Why? Simply because it's old, or because you don't know about it or need it? What other cool old stuff will die because it's part of a community without the resources to update?

Instead, accept that that loss is part of the trade-offs, be empathetic to those who suffer, and bear those lessons in mind for future work you do.

camgunz on Jan 14, 2020 | | | [–]

Well, I'd like to start off by saying I think we agree overall. I do think Python 3's advantages didn't merit its disadvantages until many years after the initial release.

Second, I admit to engaging in hyperbole when I said "Python 3 is better in every way"; usually I'm on the other side of these, but I'm just so fed up with people complaining. But you're right, there are still ways Python 3 isn't "better". I'd love to have productive, technical discussions about them, but we can't seem to get beyond the "Python 3 was a super bad idea" stuff, and I'm totally uninterested in that.

But beyond that, you and I are mostly talking about different things. Python 3 isn't NumPy or SciPy. If you're building extensions on top of them, you need to look at their compatibility commitments. If you want them to make more commitments, you have to convince them. This isn't specific to software engineering; this is due diligence for anything you're gonna put years of work into.

Django's page [1] is a great example of this. Python has one too [2]. I don't have any idea about SciPy/NumPy; It looks like SciPy 1.2.0 was an LTS release supported until 1/1/2020, but what do I know.

But importantly, the end result of this "hey, do 100x the work otherwise our science won't be reproducible" stuff will be to force people out of producing free software for scientific computing. And the non-free stuff is expensive, good god. Surely this isn't what you want.

A better tactic here is to work with the developers in establishing more compatibility between releases. You probably aren't gonna get Fortran levels of compatibility--a language and platform that's seen very, very little change over the decades. But then again, the core selling point of scientific Python is that you get to use a modern platform with modern features. Asking for that along with a 50 year compatibility guarantee is a laughably tall order: you can't have it both ways without exponential amounts of work. So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources. And the best place to do that is probably their contact page [3], not Twitter, HN, or random blogs.

[1]: https://www.djangoproject.com/download/#supported-versions

[2]: https://devguide.python.org/#status-of-python-branches

[3]: https://www.scipy.org/scipylib/mailing-lists.html

eesmith on Jan 15, 2020 | | | [–]

You write "we can't seem to get beyond the "Python 3 was a super bad idea" stuff, and I'm totally uninterested in that."

Perhaps your "fed up"-ness means you overlook conversations which do go beyond that? Or do you put me into that category as well?

> Python 3 isn't NumPy or SciPy. ... this is due diligence for anything you're gonna put years of work into.

Hinsen's essay discussed these issues related to "software layers and the lifecycle of digital scientific knowledge". He put Python in layer 1, and NumPy/Scipy in layer 2.

In his essay he also said "I would like to see the SciPy community define its point of view on these issues openly and clearly. ... It’s OK to say that the community’s priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don’t have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community’s business, and that those who have such needs should look elsewhere."

So I'm not convinced that we are talking about different things as you are making points I already referred to, albeit indirectly.

I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"

But Hinsen said "Layer 4 code is the focus of the reproducible research movement" and "the best practices recommended for reproducible research can be summarized as “freeze and publish layer 4 code” -- a solution you mentioned earlier.

It's just that reproducibility isn't the only goal for stability.

Another is to be able to go back to a 15 year old project and keep working on it, without taking the hit of rewriting it to a new, albeit similar, language.

I also have a small amount of umbrage about your comment:

> So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources.

I earlier wrote "Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities."

Did you overlook that because of your '"fed up"-ness', or was that not enough for you?

camgunz on Jan 15, 2020 | | | [–]

> Perhaps your "fed up"-ness means you overlook conversations which do go beyond that? Or do you put me into that category as well?

I do put you in that category, because you seem to be focused much more on the negative, rather than being constructive and trying to find solutions to problems.

> I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"

I've read and directly disagreed with his essay. His points are:

- Python 2 going away orphans a lot of software, because there's a lack of resources/willingness to port to Python 3.

- Python 3 didn't provide enough value to the scientific community to justify all the breakage (this is true for almost every community, btw).

- SciPy breaks compatibility roughly every 2-3 years, which is a bad fit for the pace of scientific computing.

- Beyond that, breaking compatibility threatens reproducibility.

- The SciPy community doesn't seem to know or care about compatibility concerns.

- Projects written on top of SciPy libraries ("Layer 3" code) have to keep updating, and they don't always have resources/willingness to do that.

- It would be cool if SciPy laid out a support schedule.

- It isn't cool that SciPy says, "hey use us", and then breaks compat all the time.

- There are some languages/platforms that haven't changed in decades, this isn't an excuse.

Here's what I've said:

- Agree Python 3 didn't provide enough value.

- If you want to build something on SciPy that you expect to last for decades, you should look for a compat guarantee. If you don't, that's on you.

- If you want new features plus decades of compat, that's a ludicrous amount of work.

- If you want to find a way forward, start a dialogue with SciPy devs.

Hinson's examples of Fortran and Java are illuminating. Fortran's a platform that's seen very minimal evolution over its history. That's exactly the reason people want to use SciPy instead of Fortran. Java's a platform with... billions of engineering hours? It's ironic that a guy who doesn't want to spend the resources to update his own software is asserting that someone else can continually deliver a modern scientific computing platform with new features while never breaking compat, they just don't feel like it ("It's all a matter of policy, not technology"). That's wrong, it's a question of resources.

---

My diagnosis here is communication breakdown. Everyone here wants the same thing: use a modern software stack for scientific computing. So again I'll say get on the mailing lists, get on IRC, go to the conferences, and talk to the engineers. Be constructive.

_pd19 on Jan 13, 2020 | | | | [–]

Did you see the other replies on the thread?

Guido has absolutely every right here.

Shorel on Jan 13, 2020 | | | | [–]

I downgraded Deluge to the Python 2 version because the new one doesn't work in Windows and I use both operating systems.

makecheck on Jan 13, 2020 | | [–]

It’s funny, on the Mac one becomes used to constant changes, rewriting damn near everything just to stand still. Yet I designed my Mac app long ago to depend on the system “Python 2” (bound to C++), because it seemed that both the installation itself and the Python language and libraries were very stable. Looking back, this turned out to be sustainable for a remarkably long time, as “Python 2” really did evolve only additively and there was almost no reason to even touch 15-year-old code that was relying on Python 2. For the Mac platform especially, this reliability is unheard of.

More amazing to me is that in Catalina, the release famous for breaking just about everything else, “Python 2” is still there and works as it always has! Of course, Apple did announce that it will be ripped out in the next release. :)

pfranz on Jan 13, 2020 | | [–]

I think this weird thing happened with Python 2. I believe Python 2.6 (Oct-2008) was the last "feature release" and 2.7 (Jul-2010) was intended as a bridge. So since 2008, 2.x users have been shielded from most all of the normal churn of any widely used language that's in active development.

What I don't think people realize is that not only are you expected to move to 3.x, but you'll have to keep up or fall behind with new 3.x releases. During that same period (since 2008) 3.x has had 9 big releases. Of course that 2.x stability was done with the assumption you'd move to 3.x and isn't sustainable for PSF indefinitely.

cannam on Jan 13, 2020 | | | [–]

> Of course, Apple did announce that it will be ripped out in the next release

They did? Damn, I was using that...

adontz on Jan 13, 2020 | | [–]

I have never seen such rejection in Django community, despite real problems, like with WSGI design, handling I/O and thus working with bytes a lot.

Every huge task, like porting from Python 2 to Python 3 or any other huge task is either everybody's task or just a small group's one. And since latter seems more reasonable to not interfere with ongoing development, former is the only way I have seen such tasks to succeed.

Artificial rules to create comfort for one group at the expense of another group, like the following

>> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

sound pretty much wrong to me.

If there is a pain, it should become everybody's pain, or otherwise people will simply burn out and hate own work, like the author did. There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

Overall, described situation looks like management issue and not a technical one to me.

Edit: typos.

reubenmorais on Jan 13, 2020 | | [–]

> There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

The author addresses this. The difference is that when porting to Rust you'd likely get a faster and more correct program in the end. (Huge caveat of big rewrites, of course). Whereas with Python 3 they feel like they did all the porting work and got nothing valuable in return.

roca on Jan 13, 2020 | | | [–]

> There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

The Rust compiler statically checks those decisions, while in Python issues with string types will only be caught at run-time, so everywhere your test suite has missing coverage, porting is likely to introduce regressions. That is one way in which a Rust port would be easier.

hinkley on Jan 14, 2020 | | | [–]

I used to switch unit tests from jasmine 1.3 to mocha because jasmine is kind of a mess, and jasmine 1.3 tests look too much like they should still work in jasmine 2.0, except some of the corner cases on equivalence of objects are wrong. So some of your tests would go red with no code change, but others would be green and stay green even when the code no longer functions properly. Like cutting the wires to your smoke detector.

It would take quite a bit of change in a language for a port to be safer than an upgrade, but it's not completely impossible.

war1025 on Jan 13, 2020 | | [–]

We are on the brink of completing the transition to python3 at my work.

The end result of this is that I just spent a good chunk of last week reviewing a pull request with 70,000 lines of changes, which was one of the final in a series of ~10k line pull requests that came in through the fall.

All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes. Here is an api boundary where we need to encode / decode." etc.

It was a nightmare of effort that I'm glad to have behind us.

vmchale on Jan 13, 2020 | | [–]

> All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes.

Dynamic typing!

war1025 on Jan 13, 2020 | | | [–]

Not dynamic typing's fault.

The issue is they changed the types out from underneath you.

And then left it to each library to decide which type it was actually going to accept.

magicalhippo on Jan 13, 2020 | | | [–]

But with static typing, the compiler can let you know when you're doing something wrong with the new Unicode-based string.

war1025 on Jan 13, 2020 | | | [–]

Well, it can, except you then need to go through and update all of your internal APIs to be correct.

Really the string transition was just a poor choice in my opinion. Python2 already had unicode strings that were easy enough to specify (just prefix with a `u`).

It would have been better to just delineate that barrier better from an API standpoint.

I understand the appeal of having unicode for the default string literal type, but it was actively hostile to existing projects.

lmm on Jan 14, 2020 | | | [–]

> Well, it can, except you then need to go through and update all of your internal APIs to be correct.

You do, but it's easy: run a compile, fix the errors, repeat until no more errors.

> It would have been better to just delineate that barrier better from an API standpoint.

Isn't that exactly what the Python 3 transition was? i.e. stop accepting non-unicode "strings" (actually just arbitrary byte sequences) for APIs that semantically require a string, reserve them for APIs that actually want a byte sequence.

war1025 on Jan 14, 2020 | | | [–]

> You do, but it's easy: run a compile, fix the errors, repeat until no more errors.

The reason this doesn't work is that previously the double-quote literal was a "string" type. The string type was, yes, just a sequence of bytes, but in an ascii-centric world that also mean text.

Python2 added unicode string literals that accepted unicode code points. Most APIs were happy to sloppily mix the two and generally work quite adequately.

Python3 then made the hard distinction between byte-string and unicode-string. Not an unreasonable position to take on the face of it. The issue is many python2 APIs were written from the perspective of "accepts string literal types", where that could be either bytestring or unicode string.

Now suppose you have a large codebase in python that spans the entire stack from database interaction, to webserver, to desktop application. All built on double quoted string literals. Accepting unicode strings in the places that needed that (user-facing places mainly, utf-8 bytestrings anywhere being stored on disk or sent over network)

Then you go to switch to python3, and suddenly all of your string literals are interpreted as unicode instead of bytestring / ascii sequences. So now you need to go through every place in your codebase that accepts strings as an argument and determine, "is this a user-facing string, or a utf-8 bytestring", because they used to be basically the same thing, and now they aren't.

It's not "difficult" really, it's just a pain in the neck.

lmm on Jan 14, 2020 | | | [–]

None of that would be a problem in a typed language. The ultimate destination of any string literal is some standard library function, whether that's write to network socket, display to user, or something else. So you just ripple backwards from that through your own functions that are calling those standard library functions, until you get to the point where you're passing in the literal, and then you know what kind of literal it needs to be.

vmchale on Jan 14, 2020 | | | [–]

> None of that would be a problem in a typed language.

Python is dynamically typed and weakly typed, but still typed. That's precisely the problem! The difference is just that a statically typed language gives you all the information, and a dynamically language doesn't, but still fails. Just without providing you the necessary information up-front.

There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan...

lmm on Jan 15, 2020 | | | [–]

> Python is dynamically typed and weakly typed, but still typed.

People who claim that dynamic typing is a thing claim that Python is strongly typed. (This is of course nonsense; there's no such thing as dynamic typing, because types are by definition something that expressions in a language have, not something that runtime values have).

> There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan....

That is not a "nice explanation". It is writing to obscure rather than to clarify. And it certainly acknowledges that one cannot have differently typed values in a dynamic language.

war1025 on Jan 14, 2020 | | | | [–]

That's not really much different from the path we took. It's just instead of running the compiler, we ran the linter and test suite until things passed. It's just when you have a million lines of code that takes quite a while.

magicalhippo on Jan 13, 2020 | | | [–]

Delphi also went through a similar transition from "strings are in whatever the local code page says" with one byte chars to Unicode strings (Windows-style).

However the makers of Delphi spent many years preparing for this, so when the time came for us to switch we only had to spend half a day or so to migrate our half a million lines of code.

d0mine on Jan 13, 2020 | | | [–]

Something is wrong if there is no third type: the "natural" string (bytes on Python 2, unicode on Python 3).

war1025 on Jan 13, 2020 | | | [–]

I assume many of the strings were left untouched. But you still have to audit all of it to know which needs to be used where.

eesmith on Jan 13, 2020 | | | | [–]

I believe that's included in the "etc."

quietbritishjim on Jan 13, 2020 | | | | [–]

Surely any "natural" string would be better represented as unicode in Python 2? What is an example that wouldn't be?

masklinn on Jan 14, 2020 | | | [–]

> Surely any "natural" string would be better represented as unicode in Python 2?

No because much of the stdlib works in terms of native strings and will choke (or worse silently fuck up) on the other. Yes also in python 2, the stdlib was absolutely not “unicode clean”.

So a transitional / polyglot codebase has and needs not 2 but 3 string types: bytes, unicode, and native. And neither “unicode literals” nor “bytes literals” were good things to apply across the board.

cjbillington on Jan 14, 2020 | | | | [–]

I've found myself defining `native_str = bytes if PY2 else str` (with `if PY2: str = unicode` at the top of the file, as in all my py2/py3 polyglot code) because there are some things that need bytes on Python 2 and unicode on Python 3 - e.g. the `__file__` attribute of a dynamically created module or other low-level things.

war1025 on Jan 13, 2020 | | | | [–]

I believe what they meant was that for many strings it really shouldn't matter if they were bytes or unicode. They would perform their function correctly either way. That's completely true, but you do still have to go through and find the cases where that doesn't work.

jnwatson on Jan 13, 2020 | | | | [–]

There is. It looks like this:

  u"Hello World"

michaelhoffman on Jan 13, 2020 | | [–]

The biggest problem with the Python 2 to Python 3 transition was not that breaking changes were made. It’s that breaking changes were made in a way such that you could not easily have code that worked both on Python 2 and Python 3.

It took years before the advent of six, Python 3 u’’ literals, and modernize. The author discusses this at length.

choppaface on Jan 13, 2020 | | [–]

Another big problem is there was no significant incentive to adopt Python 3. That’s why it took so long for large projects to transition. In comparison, during the last decade, C++ went from dodgy C++11 toy projects to all new code being written in modern C++. The modern feature set is that good.

Jasper_ on Jan 13, 2020 | | | [–]

C++ doesn't mandate you switch from std::cout to fmt in order to use lambdas. If they did that, I think we'd see a lot less modern C++.

choppaface on Jan 13, 2020 | | | [–]

That’s a find-and-replace fix that can be addressed reliably. A relatively smaller problem versus moving off of boost.

The compiler support for C++11 (and especially inconsistencies in Debian packages, compiled flags, etc) was a very painful issue for several years. But auto is that useful ...

Jasper_ on Jan 13, 2020 | | | [–]

Right, moving to std::cout to fmt could be as simple as a find-and-replace fix. That the C++ committee could have inflicted this minimal pain on their users, but chose not to do it, shows some amount of concern for backwards compatibility and old codebases. By comparison, Python 3 changed the entire text model and dropped the mic, and waited for 8 years to start to pick the pieces back up.

choppaface on Jan 14, 2020 | | | [–]

I guess I don’t understand your argument that “Python 3 changed the entire text model and dropped the mic.” format strings are optional; the old % operator still works fine. The change to unicode is dramatic, but personally I haven’t run into major problems. I’ve had unit tests break because of it, but that’s why one has unit tests. I’ve also worked on a very large python webapp that underwent painful internationalization, and in that case we ended up using unicode strings everywhere anyways.

richardwhiuk on Jan 19, 2020 | | | [–]

The % didn't use to work fine. .iteritems() was made for no good reason.

Python 3 could have required all strings began with u" or b", but they didn't - they did something which encouraged breakage.

j88439h84 on Jan 13, 2020 | | | [–]

Six was available for years (2011) before Mercurial even started porting (2015).

https://github.com/benjaminp/six/graphs/contributors

eesmith on Jan 13, 2020 | | | [–]

That was part of the "discusses this at length". Part of the relevant discussion is:

> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)

nemothekid on Jan 13, 2020 | | [–]

> Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. [..] However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).

Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.

Jasper_ on Jan 13, 2020 | | [–]

Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism. Python 2 had different types but you could be sloppy and it would kinda work out.

There was this assumption that Unicode code points were the correct single unit to talk about Unicode. You iterate over code points, you talk about string lengths in terms of code points, you slice in terms of code points. Much like the infamy of 16-bit Unicode, this is an assumption that has kinda gotten worse over time. Now we can and do want to talk about bytes, code points, and newer sets like extended grapheme clusters. I think this is probably the big failing of Python 3's Unicode model. Making a string type operate on extended grapheme clusters might fix it, but we'd be in for the same sort of pain, and the flexibility of "everything is bytes, we can iterate over it differently" of Go and Rust is much nicer in comparison.

The second thing was this assumption that everything remotely looking like text was Unicode, despite this maybe not being true. HTTP has parts that look like plain text, like "GET" and "POST" and the headers like "Content-Type: text/html". But the correct way to view this as ASCII bytes, and no other encoding makes sense; binary data intermixed with "plain text" definitely happens, and the need to pick and choose between either Unicode or Bytes caused major damage in the standard library which still persists to this day -- some parts definitely chose the wrong side. Take a look at the craziness in the "zipfile" module for one other example. It's probably fixed now, but back then, I basically had to rewrite it from scratch in one of my other projects.

They eventually relented and added back a lot of the conveniences to blur the line between bytes and unicode again, like adding the % formatting operator for bytes, which I think shows that their insistence on separating the two didn't really pan out in practice. And yet, migration is still a pain.

int_19h on Jan 13, 2020 | | | [–]

> Python 2 had different types but you could be sloppy and it would kinda work out.

It would "kinda work out", if your Unicode strings were ASCII in practice, and only then. Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.

Which is to say, it only worked out for English input, and even then only until the point where you hit a foreign name, or something like "naïve". Then you'd suddenly get an exception - and it happened not at the point where the offending input was generated, but at the point where two strings happened to be combined.

This was a horrible state of affairs for basically everybody except the English speakers, because there was a lot of Python code out there that was written against and tested solely on inputs that wouldn't break it like that.

Intermixing binary data with text can be represented just fine in a type system where the two are different. For your HTTP example, the obvious answer is that the values that are fundamentally binary, like the method name or the headers, should be bytes, while the parts that have a known encoding should be str - there's nothing there that requires actually mixing them in a single value. In those very rare cases where you genuinely do have something like Unicode followed by binary followed by Unicode in a single value, that is trivially represented by a (str, bytes, str) tuple.

The problem with the Python stdlib isn't that bytes and Unicode are distinct. It's that it's overly strict about only accepting Unicode in some places where bytes should be legal, too. This is orthogonal to them being separate types.

otabdeveloper4 on Jan 14, 2020 | | | [–]

> Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.

They could have just changed the default encoding to utf8. (For those too lazy to configure their Python properly.)

There, problem solved - and no need for a breaking Python 3.

int_19h on Jan 14, 2020 | | | [–]

It would still be a mess any time you have to deal with byte strings that aren't UTF-8. The problem is with the implicit conversion itself - it shouldn't happen, because there's no way to properly guess the encoding. But there was no way to get rid of it without breaking things.

otabdeveloper4 on Jan 14, 2020 | | | [–]

> But there was no way to get rid of it without breaking things.

Even such a breaking change would be a molehill compared to the mountain of breaking changes in Python 3.

Point is, they had one job, and they failed.

int_19h on Jan 14, 2020 | | | [–]

That change was at the heart of the breaking changes around strings in Python 3. If the conversions remained implicit, most people would probably have never even noticed that string literals default to Unicode, or that some library functions now require Unicode strings.

hsivonen on Jan 13, 2020 | | | | [–]

> There was this assumption that Unicode code points were the correct single unit to talk about Unicode.

The most messed-up thing about Python 3 is that it's supposed to be justified by doing Unicode right and they still got it wrong.

Having strings be sequences of Unicode code points is a super-bizarre design. That is, Python 3 strings indeed are semantically sequences of Unicode code points rather than sequences of Unicode scalar values. You can not only materialize lone surrogates (defensible for compatibility with UTF-16) but you can also materialize surrogate pairs in addition to actual astral characters. You still can't materialize units that are above the Unicode range, though, so it's not like C++'s std::u32string.

Looking at the old PEPs, it appears to have arisen by accident rather than as an actual design.

joshuamorton on Jan 13, 2020 | | | | [–]

I'm confused, there isn't an insistence that everything is unicode. Http headers are treated as bytes before you decode them, but you can totally decode an http request or response as ASCII. At least until you're interacting with a website that has unicode codepoints in it's url.

takeda on Jan 13, 2020 | | | [–]

I think the issue is with people getting used to python 2 approach, where the distinction was between str (bytes) and unicode. In python 3 you should not think of bytes vs unicode, you should think as text vs bytes and you should use text as long as necessary.

BTW: I believe the http headers supposed to be encoded using ISO-8859-1 it's essentially same thing as US-ASCII, but it covers an entire byte.

takeda on Jan 13, 2020 | | | | [–]

> Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism

Go has string and byte[], and you can't mix it, you have to cast. Java has String, char[] and byte[] and similarly you need to do cast. Rust has Bytes and String (I don't know Rust enough, but I'm pretty sure it doesn't implicit conversion between them).

Also Python 3 doesn't distinct between Bytes and Unicode, Python 3 has distinction between bytes and text (str - BTW: Guido actually expressed regret that he did use "str" instead of "text", because it would be much clearer)

In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes, how the bytes are stored internally is an implementation detail, if you need to write to a file or to network, you encode the text using various encodings (most popular is UTF-8) and you decode it back when reading.

Jasper_ on Jan 13, 2020 | | | [–]

Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.

> In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes

Python 3 strings store Unicode code points. When you iterate over a Python 3 str, you get back Unicode code points. As mentioned elsewhere, this is not a Unicode scalar value, and can include things like unpaired surrogates. This is also not an extended grapheme cluster, which is the current best-effort description as to what counts as a "single character".

So, you really do need to be concerned about what your strings contain. If you don't want people to care, don't give them the ability to iterate, slice, or index into str to retrieve Unicode code points, and leave them as opaque blobs, as some of those other languages do.

takeda on Jan 14, 2020 | | | [–]

> Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.

Yes, but at this point you're arguing about implementation details. The idea is that if you use it as a string it is string, if you need bytes, you need to perform a conversion. It shouldn't be your concern how it is stored internally.

If we are going into Python internals, the string can be stored as multiple versions from basic C-string to unicode code points. If you perform conversion it will cache the result so it can be used in other places. I don't remember the details, since I looked at the code long time ago, but it isn't that simple.

Jasper_ on Jan 14, 2020 | | | [–]

I don't know how to explain it any simpler. Iterating over a str type in Python 3 enumerates Unicode code points. The length of a str type is the number of code points it contains. Reversing a str will reverse the Unicode code points it contains (not guaranteed to be a sane operation). Indexing into a str with foo[0] gives you back a str containing a single Unicode code point.

This is not an implementation detail, it is fundamental to how the str type in Python 3 operates. I have not talked at any point about the internal storage of this type, just the interface it publicly exposes.

takeda on Jan 14, 2020 | | | [–]

This is called a leaky abstraction. I can't find a good behavior for a high level language to do it this way. If you use index in a string you always will get something that's invalid, at least in Python or Java you get code points.

lmm on Jan 14, 2020 | | | | [–]

Python 3 strs should not be iterated over, sure. Ban that in your linter, then you're in the same position you would be in Rust. It's a misfeature but it's still a detail.

toyg on Jan 13, 2020 | | | | [–]

Zipfile has always been a mess. I have no idea why, but its interfaces have been consistently poor from a usability perspective. This well before py3 was a factor.

steveklabnik on Jan 13, 2020 | | | [–]

The blog post talks about this a bit in Rust, but we don't actually say that. We do make that the default, but we also give you the ability to get at the underlying things as well. There's a lot of interesting work here, actually, like WTF-8...

kevingadd on Jan 13, 2020 | | | [–]

In the wild WTF-8 and its 16-bit equivalent show up more often than you'd expect. I ended up discovering a case recently where part of the .NET executable file format is actually encoding strings as WTF-16 (not UTF-16) and any internal lowering needs to map them to WTF-8 instead of UTF-8. Until that point I had expected to only ever encounter WTF-8 in web browsers!

vmchale on Jan 13, 2020 | | | [–]

> Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.

When working with e.g. filepaths, Rust has an OsStr type.

dilap on Jan 13, 2020 | | | [–]

A go string is just a sequence of bytes, which is usually/by convention utf8. But you can store anything you want in there, if necessary.

bjoli on Jan 13, 2020 | | | [–]

I would say that it is just shitty design not not differentiate between bytestrings and regular strings in a way that causes problems. The biggest design flaw here was not forcing people to understand the difference in python2

rossdavidh on Jan 13, 2020 | | [–]

Having worked in python for about a decade, first in python2 and lately in python3, and having seen projects convert, I find this article baffling. I found Six to work pretty well, and where it didn't it wasn't hard to change.

I think the core error here was in NOT doing what he calls a "flag day" conversion. Sometimes it is easier to do something quickly, than to live with it happening slowly. I've done "flag day" conversions, and they were pretty painless, if stressful at the time.

the_mitsuhiko on Jan 13, 2020 | | [–]

> Having worked in python for about a decade, first in python2 and lately in python3, and having seen projects convert, I find this article baffling. I found Six to work pretty well, and where it didn't it wasn't hard to change.

It matters a lot where you work. If you are in high level land Python 3 is not much of a chance. If you work at the boundary (wire protocols, OS interop, text transformation) then Python 3 is a significant step back, especially before 3.6. A lot of the mud that Mercurial stepped through is also where I went through with my libraries. The day I managed to get the PEP through that reintroduced the u prefix on strings was also the last time I voluntarily participated in a language summit. The atmosphere was awful and not evidence based.

Jasper_ on Jan 13, 2020 | | | [–]

Yeah, I was around for the sidelines of PEP-461, reintroducing % onto the bytes type, trying to get things done behind the scenes, and it was just a miserable experience all around. I don't think anybody cared to understand our concerns about why we should bother making the bytes type useful. At times it seemed like they believed leaving in the concatenation on the bytes type was a mistake.

Read through the receipts here: https://bugs.python.org/issue3982

the_mitsuhiko on Jan 13, 2020 | | | [–]

A lot of present day Python 3 programmers also have not been part of the very emotional conversations about getting some of these things back in the early days of Python 3.

klodolph on Jan 13, 2020 | | | [–]

I think Six works pretty well for a certain percentage of projects, and then completely blows up for others.

Same thing applies to using Mypy. Some modules are easy to add annotations for, other modules have insanely complicated types.

CJefferson on Jan 13, 2020 | | | [–]

Mercurial still can't have a "flag day" as Macs are still distributed with Python 2, and not 3. Therefore it would make Mercurial significantly worse for mac users if it didn't support Python 2.

krupan on Jan 13, 2020 | | | [–]

I love mercurial to death, but let's be honest, how many Mac/mercurial users are there? Very few, I would guess. Now, how many of those users don't install a version of python not included with the OS? I'd guess we are getting pretty darn close to zero there.

CJefferson on Jan 14, 2020 | | | [–]

I work with 2 such people (academics, prefer mercurial, use R rather than Python so I can't imagine would have had any reason to install Python 3). Lots of people use version control without being "developers", so not needing a more up to date Python (or, at just happy with Python 2).

eesmith on Jan 13, 2020 | | | | [–]

"how many Mac/mercurial users are there?" - Hi! I'm one!

"how many of those users don't install a version of python not included with the OS". FWIW:

    % which hg
    /usr/local/bin/hg
    % ls -l /usr/local/bin/hg
    lrwxr-xr-x  1 eesmith  admin  30 May 23  2019 /usr/local/bin/hg -> ../Cellar/mercurial/5.0/bin/hg

McP on Jan 13, 2020 | | | | [–]

Last time I used Mercurial (it's been a while) it shipped with its own python executable

pseudalopex on Jan 14, 2020 | | | [–]

You can install it with pip.

ufov2 on Jan 13, 2020 | | [–]

The approach of doing the transition slowly over many years maybe was a mistake here, and another thing making it harder seems to have been little support from the top of the project.

I ported two projects with ~200000 Python-SLOC (about the same size as Mercurial according to sloccount) back in the early 3.x days. Doing this via more or less flag-day conversions within a few months, converting the codebases first to 2to3-able subset, and as a second step later on dropping 2to3 via common dialect of Python 2/3 with six, was not very painful in the end.

sfink on Jan 13, 2020 | | [–]

Did you have a large ecosystem of third party extension modules that also had to obey the flag day?

masklinn on Jan 13, 2020 | | | [–]

> I ported two projects with ~200000 Python-SLOC (about the same size as Mercurial according to sloccount) back in the early 3.x days. Doing this via more or less flag-day conversions within a few months, converting the codebases first to 2to3-able subset, and as a second step later on dropping 2to3 via common dialect of Python 2/3 with six, was not very painful in the end.

Sounds like you used the same method, just over a smaller timeframe: convert to a common 2/3 subset, then drop Python 2 at some later point.

epage on Jan 14, 2020 | | [–]

I can't believe their leadership de-prioritized the port until the last minute when they have an ecosystem on top of them that also has to port. I feel that was irresponsible.

The project lead said to not push `b""` on people. That was a mistake imo that led them down a very frustrating rabbit hole (transformers, `pycompat`) that probably greatly extended their port time. One reason given is to not confuse devs with those details but they are critical details and ones you can't avoid with Rust. This inconsistency makes me wonder if the post is mostly misdirected frustration. A lot of it centers on.

I agree about the early python3 releases making it harder. I don't remember what the python leadership's intent was but i think I actually agree with what they did, now. Over my career, I've come to appreciate starting with the ideal and working backwards. This let's you learn what is needed rather than wasting time on speculation (planning or dev) or making a more crippled product.

I can understand frustrations with bugs / differences in python versions. I ran into that a lot just within `2.7.*`

In my mind, the most notable complaint is the stdlib's mixed efforts in supporting str or bytes. I feel "batteries included" maked this harder. They had to port a lot. Not everything can get the same level of scrutiny, especially from domain experts that represent a variety of use cases. They also can't break compat. If they weren't battries included, the porting efforts would be more directed, pull in the right people, and you can fix things later if you get it wrong.

What I find interesting is how different our experiences are that lead to the same place. My frustrations with python are rooted in build tools and packaging and have been loving Rust.

EDIT: I'm also surprised at the hostility towards distribution packagers. Instead of working with them to find mutually valid solutions, the express frustration at distributions and cripple themselves in not allowing third-party dependencies.

Conan_Kudo on Jan 14, 2020 | | [–]

> EDIT: I'm also surprised at the hostility towards distribution packagers. Instead of working with them to find mutually valid solutions, the express frustration at distributions and cripple themselves in not allowing third-party dependencies.

These days, it's "cool" to hate your downstreams (y'know, bite the hands that feed you and all that).

Seriously though, as one of those "distribution packagers" (Fedora, Mageia, OpenMandriva, and openSUSE!), it sucks that I encounter this more and more often. I try to be somewhat involved in the projects I package and contribute where I can, be it code, advice, or anything in between. Ten years ago, people were generally friendly to me. These days? It's rare to get a thank-you. Usually I get grumbles and anger for daring to ship it in a distro package. I've even had a couple of patches rejected that fix real bugs simply because they were discovered as part of my packaging and testing something because it doesn't happen on the dev's machine in his virtualenv on his Mac...

brohee on Jan 14, 2020 | | | [–]

In the case of Debian, the Debian official policy was for a long time to explicitly unbundle everything, which for many things amounted to sabotage. Things like rvm were born of distributions unusable packages.

peatmoss on Jan 13, 2020 | | [–]

I think my takeaway lesson is that it’s very hard to introduce large, breaking changes to a language and not alienate a large proportion of existing users. I don’t know that there’s a right way.

I look at Perl, which was a juggernaut when I first used Python, and announcements of Perl 6 certainly didn’t help Perl’s slide. Often cited is the fact that Perl 6 is a totally different language unrelated by anything but creator and name. The Perl brand was not enough to carry the bulk of Perl users from Perl 5 to Perl 6. Perl 6 is now called Raku, which probably better reflects the magnitude of the change.

On the other hand Python 3 is a small but still significant departure from Python 2. If they’d called Python 3 something else, we’d probably be griping about how superficially different from Python 2 it was without bringing substantially new ideas.

Oddly my feeling is that Racket, in its departure from mainline Scheme, largely did retain its core audience, but that may have been a feature of its usage in academia.

Fast forward to last year when a prominent Racket architect announced “Racket 2” which would completely change the syntax of the language. Prominent community members reacted negatively, due to fears of Perl 6’s fate. But now they’ve decided to simply call the new research language Rhombus and have reiterated plans to continue supporting Racket. I went from feeling very negative to the change to being okay with the direction.

I’m not sure there are lessons to draw, other than noting than version bumping versus making a new language with a new name can be bad for entirely different reasons.

intrepidhero on Jan 13, 2020 | | [–]

I'm not well informed about the breaking changes made under the hood in Python 3. But I wonder if breaking backwards compatibility in this case wasn't simple an externalization of costs. The community certainly went through a TON of work to port libraries, etc. Could all those man-hours have gone towards making incremental, backwards compatible changes instead?

I think the takeaway is that if you want to make breaking changes, make a new thing and turn the old thing over to a maintenance team. If after a while you learn some things that could improve the old thing see if they can be incorporated without breaking compatibility or if the new thing is really so much better people will switch.

b2gills on Jan 18, 2020 | | | [–]

Everything that I've heard that was changed from Python 2 to 3 are reminiscent of things that Perl5 handled while maintaining backwards compatibility.

I mean Perl5 is still mostly backwards compatible back to the original version released in 1987. (There were a few rarely used bad features that should have never been there which have been removed.)

The way it does this is by having you specifically ask for the new features, if they would otherwise break code.

edflsafoiewq on Jan 13, 2020 | | | [–]

I think that's the wrong takeaway, especially since it kind of absolves Python of doing anything wrong. There were many examples of ways that this was unduly hard just because of how poorly the transition was designed for.

> While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.

mikepurvis on Jan 13, 2020 | | | [–]

The author's suggestion of permitting a certain set of "from __past__" imports seems especially astute. This would have made it much more possible much earlier to have a single large codebase running natively on the Python 3 interpreter, but with modules (especially leaf modules) at varying degrees of ported-ness.

In contrast, the original porting guidance for module authors was actually to maintain the Python 2 source as the master copy, and use 2to3 to transform it for running tests or cutting a Python 3 release. How is a transition ever supposed to happen if the new hotness is perpetually a second class citizen?

masklinn on Jan 13, 2020 | | | [–]

> The author's suggestion of permitting a certain set of "from __past__" imports seems especially astute.

Hell that's essentially what ended up happening over time, as past features got reintroduced (the `u` prefix, bytes.__mod__, `callable` being reintroduced, `range` being improved, …), as well as the serious Python3-ification of the Python2 branch that was 2.7.

mikepurvis on Jan 13, 2020 | | | [–]

I feel like "from __past__" would have allowed projects to get to the point where they were "on Python 3" much sooner, though.

The mentality would have been "we're on Python 3, and we have a long tail of cleanup to do excising from-past out of our codebase" rather than "we and our users are all still on Python 2 but we thiiiiink our code is mostly using constructs that are Python 3 safe. We get a green CI checkmark, anyway, but who really knows."

peatmoss on Jan 13, 2020 | | | | [–]

Hmm, I didn’t mean to convey that the Python 2/3 transition was done well. I think lots of projects have tried to port their community success to substantially different projects, and that most have failed for a variety of different reasons.

Python 3 is a “success” in that a lot of people have moved. But it was, as you rightly point out, a hard won victory that left a lot of people unhappy.

tus88 on Jan 13, 2020 | | | [–]

> introduce large, breaking changes without major benefits

FTFY

lizmat on Jan 14, 2020 | | | [–]

Re: Often cited is the fact that Perl 6 is a totally different language unrelated by anything but creator and name

Indeed. That is why Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).

I'm not sure what lessons can be drawn from this, other than being indecisive has its price.

mark_l_watson on Jan 13, 2020 | | | [–]

I totally agree with you, I am uncomfortable with Racket 2 having a non-Lisp syntax. As someone who has used Lisp languages to get stuff done for over 30 years, I would say NO to the syntax change.

That said, Racket is open source the maintainers have good reasons for a change based on getting a larger user base. I wish them great success.

peatmoss on Jan 13, 2020 | | | [–]

I disliked the “Racket2” name because it strongly implied that Racket 1 (the one with parentheses on the outside and few commas) was the past, and that something with a vaguely Algol-ish syntax was the future.

But Racket as a project was always about language experiments. I certainly didn’t bristle at Typed Racket or any of the other languages. And so, in changing “Racket 2” to “Rhombus” and committing to mainline support of Racket, I feel pretty comfortable with the direction. I find this fascinating that I feel this way given that nothing has really changed but the name.

reggieband on Jan 13, 2020 | | [–]

In the past I have been a vocal advocate for the way the transition from Python 2 to Python 3 was handled. However, it should be said I use Python primarily as scripting glue, e.g. for build scripts and automation tasks. I have never worked on a "large" Python code base nor did I have to migrate anything. Almost everything I had written in Python 2 was just naturally replaced by newer scripts in the due course of time.

I also remember my first forays into Python 3 and the annoyance I had at some of the decisions. I recall when they relented on the % operator for string interpolation and I agree it was a poor initial choice to leave it out. I totally agree with the author that Python 3 could have made some subtle changes earlier on to help those with massive codebases.

And I still feel it was the right move. Somehow Python is even more relevant today than it was when this painful process began. While some may say that popularity is despite missteps I actually believe the general slow and cautious push forward is one of the primary reasons Python continues to succeed. There is a balance between completely abandoning old users (e.g. Perl 5 to Perl 6) and keeping every historical wart (e.g. C++). IMO, the Python community found a middle ground and made it work.

b2gills on Jan 21, 2020 | | [–]

I have yet to find out about a Python 3 change that couldn't have been handled in a backward compatible way.

I know this because every change I've heard about is reminiscent of a Perl5 change where backwards compatibility was not broken.

The transition to Python 3 was not handled anywhere as well as it could have been.

There is a reason Go2 isn't copying Python3. (Strangely they seem to be copying the Perl5 update model even though they don't realize it.)

The thing is that because Python has an unhealthy fixation on “There should only be one way to do things” they rejected things that would have made the transition easier. (Or even less necessary.)

I think it is kind-of telling that someone thought it necessary to create Tauthon. Tauthon is sort-of applying the Perl5 update model to Python 2.7.

lizmat on Jan 14, 2020 | | | [–]

Re: There is a balance between completely abandoning old users (e.g. Perl 5 to Perl 6)

Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).

In the original design of Perl 6, a source-level compatibility layer ("use v5") was envisioned that should allow Perl 5 source code to run inside of Perl 6. So the plan was to actually not abandon old users.

In my opinion, this failed for two reasons:

1. Most of Perl 5 actually depends on XS code, the hastily devised and not very well thought out interface to C code of Perl 5. Being able to run Perl 5 source code in Perl 6 doesn't bring you much, unless you have a complete stack free of XS. Although some people tried to achieve that (with many PurePerl initiatives), this really never materialized.

2. Then when the Inline::Perl5 module came along, allowing seamless integration of a Perl 5 interpreter inside Perl 6, using Perl 5 modules inside of Perl 6 as if they were Perl 6 modules, it basically nailed the coffin in which the "use v5" initiative found itself already in.

And now they're considered different languages after the rename to Raku, dividing already limited resources. I guess that's the way of life.

reggieband on Jan 14, 2020 | | | [–]

I think my wording "abandoning" was more inflammatory than I would have liked. And I didn't want to call out or target Perl 6 / Raku. What I meant to convey was that the language team behind Perl 6 (as it was known before the rebranding) made a decision that it would be a new language and not an evolution of the existing language. It was the first example in my mind, one most people would recognize, that anchored one side of the continuum I was describing. I assume there are better examples (or worse offenders) but I don't know of any off hand.

mikl on Jan 13, 2020 | | [–]

There’s no doubt that the 2 -> 3 transition was rough for the Python community. I personally stopped using Python as my go-to language in the early Python 3, since writing Python 2 code felt stupid since it was outdated the moment you wrote it, and 3 wasn’t really well supported by the community and tooling yet.

On the other hand, Python adoption has really taken off since Python 3.5-ish. Python has never been more popular.

So while you may wonder what might have been, had the transition been smoother, it’s hard to argue that Python 3 is a failure. All’s well that ends well, I guess?

Although it’s sad that Guido felt the need to step down. It’ll be interesting to see where Python goes this decade, now the transition is over and there’s a wealth of possibilities in front of it.

I expect there’ll be a lot of people looking to replace JavaScript with Python once you can run it in the browser with WASM.

roca on Jan 13, 2020 | | [–]

"All's well that ends well" neglects the costs to the community of bad decisions. It also encourages people to think that those decisions must not have been very bad, and not learn from those mistakes.

You can see that at work in the responses here. "And I still feel it was the right move. Somehow Python is even more relevant today than it was when this painful process began." I.e. success is thought to justify every decision made along the way.

I see this fallacy at work in Linux too. "Linux is successful, therefore haphazard CI and using email to track bugs and patches must be a fine way to operate".

mrr54 on Jan 14, 2020 | | | [–]

Haphazard CI and using email to track bugs and patches is a fine way to operate.

zmmmmm on Jan 13, 2020 | | [–]

It mostly makes me question the wisdom of implementing such a tool in Python in the first place - if you want low level access to raw underlying representations etc, using a super high level scripting language seems like a "wrong tool for the job" scenario. I am sure on the other hand they got a lot of productivity benefits from doing that, which is great, but having taken that tradeoff I don't think it is fair to "sour" on a language when you clearly applied it out of it's domain and then encountered problems due to that.

Tanooki_Mario on Jan 14, 2020 | | [–]

You can't really say this after the language has supported the feature as a design decision for 15 years and then removes it. Part of the popularity of python is it made things easier for systems programmers. Unless you want us to write everything in C/C++ again?

souprock on Jan 13, 2020 | | [–]

Not what you want to hear about a version control system: "Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years."

indygreg2 on Jan 14, 2020 | | [–]

C/C++ is a language with limited facilities to ensure correctness at compile time. The languages are riddled with undefined behavior in common features that programmers with multiple decades of experience still get tripped up by. NULL access - the so called "billion dollar mistake" - out of bounds reads and writes, and use after free create a litany of security issues and create massive liability for companies who choose to author software in these languages.

Not what you want to hear about an operating system :p

otabdeveloper4 on Jan 14, 2020 | | | [–]

"C/C++" is not a language.

Good lord, how much ignorance can hackernews handle??

weberc2 on Jan 13, 2020 | | [–]

> One of the biggest early hurdles in our porting effort was how to overcome the string literals type mismatch between Python 2 and 3. In Python 2, a '' string literal is a sequence of bytes. In Python 3, a '' string literal is a sequence of Unicode code points. These are fundamentally different types. And in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case. We knew that Mercurial would need to eventually switch many string literals from '' to b'' to preserve type compatibility. But doing so would be problematic.

> In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case).

As a general rule, this seems like good practice, but surely b-strings, print_function, etc are a trivial upfront cost, and one that would have to be paid sooner or later anyway?

lacker on Jan 13, 2020 | | [–]

It seems like a lot of the cost was when system libraries made different choices than Mercurial would have made. For example, the Python 3 filesystem libraries often used unicode as a wrapper around an underlying bytes interface, and Mercurial really wanted to be able to pass bytes directly to the underlying interface. So it isn't just that they had to update their data types, they also had to adjust code to work with system libraries of slightly different semantics.

wnoise on Jan 13, 2020 | | | [–]

The python interface usually would take bytes, and if it did would also return bytes. But there were a lot of things that didn't take arguments, so always returned Unicode strings. Instead of e.g. getcwd(), you would have to use getcwdb(). Which naturally didn't have an equivalent in python 2 though they did add the complementary getcwdu() (which one should basically never use).

novok on Jan 13, 2020 | | | [–]

Having done a py2 to py3 migration and ran into the same issues, these wouldn't of been that big of a deal IF python had static typing from the get go.

The static compiler would notice all the breaking type changes at compile time and you can systematically fix all of them at once. You wouldn't miss one or two and have to run your unit testing suite to exercise the type system underneath.

I really believe big breaking changes like this in a language causing migration stagnation is a property of dynamically typed languages. With other statically typed languages like swift or rust, it happened quite frequently but wasn't as big of a deal in practice.

TylerE on Jan 13, 2020 | | | [–]

Python with static typing wouldn't be python. It wouldn't even be python-adjacent.

weberc2 on Jan 13, 2020 | | | [–]

Python has had static types for years. https://www.python.org/dev/peps/pep-0484/

TylerE on Jan 14, 2020 | | | [–]

Yes, because optional annotations are totally the same thing as changing the languages basic functionality.

weberc2 on Jan 14, 2020 | | | [–]

Oof, that’s quite a take. Static type checkers by definition don’t change the language’s functionality; they merely check the types.

TylerE on Jan 14, 2020 | | | [–]

Static typing is ceremony. That is anti-Python.

If you want Java, write Java. Don't try to make Python Java.

weberc2 on Jan 14, 2020 | | | [–]

It’s literally Python, and if you’re writing sane/correct Python, you should hardly notice (things will work a little better, documentation will be more complete and actually accurate, etc). You can even do many of the same bizarre things that shitty Python programmers do without static typing, such as return differently types data based on the value of an input parameter. Someone passes in the string “asdf” so we’ll return an int, but if they pass “xyzfoo” we return a datetime. Next thing you know you’ve built matplotlib.

kingemer on Jan 13, 2020 | | | [–]

It sounds like the cost was non trivial for them, partially because they weren’t allowed to break things for python2, or even disrupt the efforts of those using it.

The language wasn’t ready for the transition, but it feels like it may have been even harder on them because of the requirements imposed on their project.

kingemer on Jan 13, 2020 | | | [–]

Considering how much opposition there is in moving to python3, has there been any significant effort in the community keeping python2 alive?

acdha on Jan 13, 2020 | | | [–]

Most of what opposition there is comes from people with projects which are some combination of under-staffed, poorly tested, or with a marginal approach to scheduling necessary maintenance work. That is not a great place to expect to find reliable maintenance contributions.

Where you are more likely to find this is from the major Linux vendors: e.g. Red Hat is committed to support it through 2024 and I would expect that they won't be alone in offering paid support for remaining users.

weberc2 on Jan 13, 2020 | | | [–]

I'm actually (pleasantly) surprised that someone like Google (or a league of someones like Google) who have deep pockets and so much Python 2 that it's cheaper to maintain Python 2 than it is to port to Python 3. Perhaps they (correctly) were concerned about the ecosystem moving on toward Python 3, leaving them behind?

weberc2 on Jan 13, 2020 | | | | [–]

Right, they could have paid a nontrivial cost (asking devs to use b-strings, print_function, etc) but didn't by fiat, choosing instead to pay a greater cost during the migration in addition to the nontrivial cost. My comment expresses skepticism about that decision.

j88439h84 on Jan 13, 2020 | | [–]

A lot of Mercurial's issues would have been resolved much easier if they'd used the common tools for maintaining polyglot 2/3 code instead of trying to invent everything themselves.

Futurize and Pasturize in particular provide essentially all of the features that this post laments missing.

https://python-future.org/

lacker on Jan 13, 2020 | | [–]

The author does touch on that.

When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code.

Some environments just can't use dependencies like this. IMO Python 3 was too much of a breaking change, and in particular, the ability to transition from 2 to 3 should have been better in Python itself.

j88439h84 on Jan 13, 2020 | | | [–]

Six, for example, is designed to be a single file -- specifically to ease copying it directly into the code base. But the idea that Mercurial couldn't use dependencies because of fear for what Debian might do...I find it so hard to believe that's the best choice. Vendor if you must, but do not reinvent the wheel.

masklinn on Jan 13, 2020 | | | [–]

Having done so, six really is not necessary for a transition. We went with forking werkzeug's minimalist compat file because six was waaay overkill, and that made it easier to progressively rip it out later.

python-future is invaluable for the fixers (which are way better than 2to3's), the runtime stuff you can easily do without.

morelisp on Jan 13, 2020 | | | [–]

The first versions of Flask that supported Python 3 absolutely tanked performance. Some memory k/v store esque APIs we had that spent a good chunk of their time processing headers / query params to know what keys to look up got something like 8x slower. I remember ripping all the logic out of one endpoint and finding out it was >1ms minimum round trip, something like 85% of the time spent switching pure ASCII back and forth in six. It's a shame, because I absolutely love Flask, but we couldn't tolerate that so new APIs ended up in other languages.

I can't imagine what it would do to Mercurial's performance to have picked the wrong migration library early on.

untitaker_ on Jan 13, 2020 | | | [–]

Those tools were not popular at the time for sure. I remember that Python-Future only got traction sometime after Flask has been ported to Python 3.

j88439h84 on Jan 13, 2020 | | | [–]

FWIW, Flask supported Python 3 in 0.10.1, released 2013-06-14 and Mercurial started porting in 2015.

mbar84 on Jan 14, 2020 | | [–]

I guess this is as good a time as any to pimp my work in this area: https://pypi.org/project/lib3to6/

lib3to6 is a Python compatibility library similar to to BableJS. It translates (most) valid Python 3.7 syntax to valid Python 2.7 and Python 3 syntax (aka. universal python). If you would like to develop with a modern python version and yet still maintain backward compatibility or if you want to bring a legacy codebase forward step by step (my use case), then please have a look.

TeeWEE on Jan 13, 2020 | | [–]

We just migrated a big codebase to python 3 in about a year. It was not easy, but also not super hard with tools like futurize, and mypy.

Gladly we already had mypy hints, this helped us find a lot of mistakes when (not) using bytes.

Now we're on python 3 we're auto-migrate the type hints to be inlined with tools, like com2ann https://github.com/ilevkivskyi/com2ann And we're auto rewriting code to be more python 3 like with libcst and custom codemods...

alangpierce on Jan 13, 2020 | | [–]

It's interesting that they wanted to add b' prefixes to all strings, and I wonder if they would have had a better experience by embracing regular strings instead. At least in Python 3, if your string only contains ASCII, then the underlying representation will use one byte per character, so ASCII-only strings are stored just as efficiently as ASCII-only bytes instances.

I think there are two mental models for how to approach the str/bytes split:

1.) A `str` is for unicode use cases, and a `bytes` is better for cases that don't support unicode.

2.) A `bytes` is an array of numbers between 0 and 255. A `str` should almost always be used when your value is conceptually a sequence of characters. `str` doesn't imply that arbitrary unicode is allowed, and it's fine to have a convention that a particular `str` is ASCII-only, just like other conventions you might have on variable values.

My impression is that #1 is the Python 2 mental model and is tempting for Python 3, but that #2 often works better when writing Python 3 code. Under mental model #2, asking for "%s" formatting is really asking for a replacement strategy that detects the number 37 followed by the number 155 in an array of numbers and fills in a sub-array, which seems more strange and likely to get false positives if you're really working with binary data like the bytes of a .jpg file.

That said, I'm sure the devil is in the details, and maybe a project like mercurial has to stay backcompat with bytes data that is neither ASCII nor valid UTF-8, or some other compelling reason to stick with bytes everywhere.

CJefferson on Jan 13, 2020 | | [–]

The problem is many strings might contain things like commit messages, or filenames, neither of which has to be valid unicode.

I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.

alangpierce on Jan 13, 2020 | | | [–]

Got it. So I understand, maybe someone saved a filename as the latin-1 encoding of some non-ASCII text, and Mercurial would need to support such files (but also would have no contextual information that it's latin-1)?

I'm tempted to say "nobody should have filenames like that", but I guess a project like Mercurial needs to be as compatible as possible. Are there modern use cases for filenames like that, or is it fair to say it's all legacy data?

xorcist on Jan 13, 2020 | | | [–]

It's going to be the case every time you mount a Windows file system, for example.

A big part of the problem is that a project like Mercurial doesn't have control over what files people use it on. They have to design for the pessimal scenario, because when the tool breaks, users complain.

CJefferson on Jan 13, 2020 | | | | [–]

If you want to write a version control system, banning a big chunk of perfectly legal filenames on both linux and windows seems like a bad choice. Users do have such filenames, and saying you can't store their files "because they aren't UTF-8" will annoy them.

I've seen such filenames occur from people using the name as a binary encoding in some way. As long as you miss (from wikipedia) NUL, \, /, :, *, ", <, >, | you will end up with a filename which all OSes support, and some systems do that.

gfxgirl on Jan 14, 2020 | | [–]

I wish more devs cared about backward compatibility not just in python but in general. I know a particular, very popular library, 60k stars on github, who's maintainers break stuff every month. They don't care how many developers time it wastes they just decide FooBar should really be named FooB and rename it. No Effs given how many people it disrupts. You'd think people would complain but cult of personality and/or popularity of library turns people in to fan boys where they seem to think "If these geniuses are doing it this way then it must be good". .... sigh

rurban on Jan 15, 2020 | | [–]

For me this is the most exciting outcome:

> The only Python 3 feature that Mercurial developers seem to almost universally get excited about is type annotations. We already have some people playing around with pytype using comment-based annotations and pytype has already caught a few bugs. We're eager to go all in on type annotations and uncover lots of dynamic typing bugs and poorly implemented APIs.

Over in perl land people still spill their hate on types, which caused hard forks.

raymondh on Jan 13, 2020 | | [–]

This criticism of the dev team seems naïve, "It should not have taken 11 years to get to where we are today." Core developers can make tooling available, but they can't control adoption. That is a user decision. Users switch-over on a timetable governed by their own individual cost-benefit analysis.

yjftsjthsd-h on Jan 13, 2020 | | [–]

> Core developers can make tooling available, but they can't control adoption

Core developers made the design decisions that made nobody want to adopt it.

> the ecosystem of users and projects are collectively much better-off than if the transition had not occurred at all.

The question seems more like, "could the same benefits have been had with less pain", and a reasonable reading is that the answer is yes. (ex. 4 years of not being able to work with bytes reasonably even if you did need them)

bschwindHN on Jan 14, 2020 | | [–]

I wonder how many human lives' worth of work has been wasted from the decision to use Python and having to deal with 2/3 transitions, and if it was worth it for the speedup of using an interpreted language.

2T1Qka0rEiPr on Jan 14, 2020 | | [–]

Not knowing really anything about Mercurial, the `skip-blame` feature seems interesting, but seems Git doesn't have something similar (has to be constrained when calling `blame`)

loxs on Jan 13, 2020 | | [–]

It would probably be less painful and much better (for other reasons) to migrate to some other language. Some projects did that successfully, or are in the process of doing so. Most notably reposurgeon: https://gitlab.com/esr/reposurgeon

sfink on Jan 13, 2020 | | [–]

Greg says as much in the article: in hindsight, porting to Rust would have worked out better. Which is a pretty bold statement, but very interesting to hear from someone with intimate experience to back the opinion up.

cookiecaper on Jan 13, 2020 | | | [–]

Mercurial's dependence on Python has always held it back, IMO. Self-contained Rust or Go-style static binaries work much better for "install everywhere" system utilities. I'd love to see Hg port to a more concise ecosystem and potentially claw some of the market away from Git.

mixmastamyk on Jan 14, 2020 | | | [–]

The author wrote pyoxidizer for that purpose.

ufmace on Jan 13, 2020 | | | | [–]

In a sense, Rust may have been a better choice for Mercurial overall, but it's hard to imagine how much of a pain the migration process would be. I don't think you could make much of any headway going Python 2 -> Rust with automated tools. That means the transition would look like, stop all Mercurial dev in its tracks, have all current contributors (who can and care to) learn Rust, bring on a couple of devs with experience in architecting large Rust projects, spend however long redesigning and rewriting in Rust, release a roughly feature-equal version a year or 2 later. Good way to move Mercurial from second-place to Git to barely known.

cesarb on Jan 13, 2020 | | | [–]

> That means the transition would look like, stop all Mercurial dev in its tracks

Not necessarily. One of the cool things of Rust is that it can easily expose and use a C-compatible API; one of the cool things of Python is that objects can somewhat easily be implemented or accessed through a C-compatible API. This allows for gradual replacement: the code can be piece by piece rewritten in C (actually, Rust pretending to be C), while still looking like Python objects to the rest of the code.

And from what I have heard, mixing Python and Rust in Mercurial is actually being done: https://www.mercurial-scm.org/wiki/OxidationPlan

raymondh on Jan 14, 2020 | | | | [–]

FWIW, Rust didn't become feature stable until mid-2015. So even there, the world was changing.

Conan_Kudo on Jan 14, 2020 | | | [–]

Rust is still not stable as a language. Things are still changing, behaviors are still breaking, and the language is iterating every six weeks as if it doesn't matter if stuff breaks because of it.

Funnily enough, the major thing that breaks is Firefox, and they say to always use the latest stable version.

So Rust isn't winning points there either... :/

cesarb on Jan 14, 2020 | | | [–]

> Things are still changing, behaviors are still breaking, and the language is iterating every six weeks as if it doesn't matter if stuff breaks because of it.

It does matter: the Rust developers are careful to avoid breaking working Rust code, unless the breakage is caused by fixing a soundness hole (in which case, one could argue that the code was already broken). They even have a tool (crater) which tries to compile every Rust library they know of, which they use whenever they suspect a change might break working code.

The things which are still changing are mostly new features. For instance, async was recently introduced; introducing it didn't break anything (even the new reserved keywords are reserved only if the code is built with "edition=2018", while old code predating that change will be using the default "edition=2015").

> Funnily enough, the major thing that breaks is Firefox, and they say to always use the latest stable version.

Isn't that on the other direction, with Firefox using new features which are only available on the latest stable version of Rust, meaning it "breaks" on old versions of Rust?

Conan_Kudo on Jan 14, 2020 | | | [–]

> Isn't that on the other direction, with Firefox using new features which are only available on the latest stable version of Rust, meaning it "breaks" on old versions of Rust?

No. In openSUSE, we've been blocked for far too long because the latest Firefox releases did not compile with the latest Rust releases. Some enterprising individual went and debugged it and patched Firefox to work with newer Rust and we were finally able to land it, but it took months and we missed out on 1.37, 1.38, and almost 1.39. This happens because openSUSE's build service rebuilds everything dependent on Rust at build-time automatically, and if things fail to build, new Rust can't land in the distro.

This is not the first time it has happened, and I suspect it won't be the last.

So, I meant what I said.

steveklabnik on Jan 14, 2020 | | | [–]

Did you file those bugs upstream? This is the first time I’m hearing about this.

Conan_Kudo on Jan 17, 2020 | | | [–]

I have not, but the ones who fixed the issues specifically have been filing bugs, sending patches, etc.

Whether it'll go anywhere is another question...

Ohn0 on Jan 14, 2020 | | [–]

For such a thorough article, I wish there were mention of Python 4

luord on Jan 15, 2020 | | [–]

1. Introduce a new version with the plan of discontinuing the previous version 11 years later (that's almost half of the time that, by then, python had been a thing), that itself was released only three years after the very tool you're talking about was released.

2. Don't even pretend to be interested in trying to do a migration until seven years later.

3. Make sure that your migration plan includes a development cycle that's deliberately hostile to the migration process.

4. ?

5. How could the python maintainers do this to us.

The description of the migration process was a good read. The fud afterwards... wasn't.

And there were a few inaccuracies (I'm being charitable, some of them were straight up lies).

> Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it.

False, I've been using python 3, python 3 exclusively, since 2014, for all my projects.

> Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale

False, same as above.

> I am ecstatic the community is finally rallying around Python 3

Again, false. Not only did "the community" rallied around python 3 years ago, he isn't really happy about it, but I'll get to that later.

> For nearly 4 years, Python 3 took away the consistent syntax for denoting bytes/Unicode string literals.

Or, to put it another way, python 3 was compatible with python 2's string types almost eight years before python 2 reached end of life.

> An ecosystem that falters for that long is generally not healthy

This entire paragraph was a hypothetical. It seems he really wanted to criticize something that did not happen.

> The only language I've seen properly implement higher-order abstractions on top of operating system facilities is Rust

And here's where his true point becomes evident: this is a hype piece for a language he found that he likes better. He's just attacking something in his previous language that he thinks is valid just as an attempt to highlight why the new toy is truly better. In short: He felt like complaining about the migration would be a good way to proselytize.

Just in case: no, it isn't better, and I say this as someone who currently isn't using python nor rust. I'm using a language that I'm quickly growing to hate more than I do either of them at their worst (no, it's not JavaScript).

> if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment.

So... The best he can say about rust is that it might be better than python 3 five years ago that, by his own opinion on everything he wrote before this, was terrible? Well, that's a recommendation not to use rust if I ever saw any.

When a hype piece defeats its own point.

> And speaking as a maintainer, I have mad respect for the people leading such a large community.

No, he doesn't; he used several appeals to emotion beforehand to try to paint them as terrible people.

> It should not have taken 11 years to get to where we are today.

This statement by itself is a truism that doesn't really mean anything, but the implication is that python 3 is only worthwhile 11 years later and it took that long for it to be so I'll reply to that.

No, it didn't. It didn't even take that long for mercurial, they started the migration four and half years ago, not eleven.

> am confident it will grow stronger by taking the time to do so

What is it to him? He should just move on to rust and be happy with it (sure, there are many people unhappy with it, but he wouldn't take the effort to proselytize if he wasn't).

In conclusion, I just don't understand the need to tear something else down to prop up a new thing. I'm sure I would have liked a post about things he could do with rust, but now...

ascotan on Jan 13, 2020 | | [–]

Python 3 is the new Windows Vista.

marcosdumay on Jan 13, 2020 | | [–]

You mean they just have to fix the issues behind the scenes, then rename the last version (like to "Python 4"), and it will become the greatest version ever?

yjftsjthsd-h on Jan 13, 2020 | | | [–]

Yes, actually. Now that we've gotten through the pain of the first years of Python 3, if we could have a clean start and call Python 3.8 Python4, it would probably be well-received.

sprash on Jan 13, 2020 | | [–]

The transition from Python 2 to 3 was one worst things that could happened to the whole community. The costs of the transition never justified the benefits. The new features were negligible at best or a regression at worst and in some cases performance got even worse. One could even assume flat out sabotage.

Let's just hope there will never be a Python 4 and the developers now finally start focusing on the greatest flaw of Python: performance.

Tanooki_Mario on Jan 14, 2020 | | [–]

What's crazy is if they had removed the GIL python 3 adoption would have been huge. All python 3 offered were marginal benefits for adopting functionality that broke huge code bases.

raverbashing on Jan 13, 2020 | [–]

Ok, yeah, maybe mercurial's case was special, but still, this seems like they made it harder on themselves needlessly.

> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code

ORLY?! Well, guess what: hard line size limits are stupid. Now you know why.

That's why "foolish consistency is the hobgoblin of little minds" is one of the 1st phrases of PEP-8.

But I'm tired of people saying "oooh let's cut all lines to be under 80-characters" like it's some kind of Biblical Mandate. No, it isn't. And the 80 chars limit is BS. Probably the part I hate the most about PEP-8 (and especially how people interpret the PEP-8)

> is its insistence that the world is Unicode

Oh please. Yes, the world is Unicode. Get over it. Maybe not bytes on disk/network. But apart from that? Yes. If libraries take bytes or unicodes I can agree it's a thorny issue, but let's move on because a happy day is a day where I don't get an UnicodeDecodeError because Python2, to add insult to the injury thinks the world is not only not Unicode, but it's all ASCII.

Windows made the right call a long time ago when it decided to make all strings Unicode. Ok, maybe UTF-8 would be better than 16, but it still does the job.

But I have to agree with them that any version < Py3.4 or 3.5 was really not worth it.

yjftsjthsd-h on Jan 14, 2020 | [–]

> the world is Unicode. Get over it. Maybe not bytes on disk/network. But apart from that? Yes

So, just ignoring the 2 things that a version control system exists to work with directly?

raverbashing on Jan 14, 2020 | | [–]

You're ignoring the parts where they have several hardcoded byte/unicode strings.

Otherwise just convert to and from when saving and sending it to network