> Every string is internally encoded in utf-8, all the string operations are Unicode-safe
This seems very slightly disingenuous, if my memory is correct. I don't remember all the details, but I ran into an issue with a Tcl application a while back and Unicode support. Digging into it, I recall that the Tcl interpreter actually represents every character as a predefined number of bytes, set by a preprocessor definition. The default was 2, and cut off any Unicode character that needed more than 2 bytes to encode, unless you were willing to recompile the interpreter to use 4 bytes, at the cost of doubling memory consumption for every string.
Very nitpicky, but I think it's important to point out it's not "quite" utf-8, because Tcl needs each codepoint to be O(1) indexable in an array, something normal utf-8 can't do.
> Digging into it, I recall that the Tcl interpreter actually represents every character as a predefined number of bytes, set by a preprocessor definition.
Ah, that’s exactly what old python did. Wonder of it was inspired by the tcl solution.
Fwiw because they rejected indexing recent cpython uses a variable encoding based on contents (possibilities are iso-8859-1, ucs2, or ucs4). That does mean adding an astral codepoint to an ASCII string quadruples its size.
I assume its because the original unicode only supported 2 byte characters, and even after astral characters became a thing it took a while for them to be used for non-cjk things.
> That's pretty much what every older language did.
The fixed size, but I’m referring to the compile-time width switch.
I may be wrong but I was under the impression most older langages simply remained on their original character width (or left the issue ill defined and/or added an ancillary type like C).
This seems very slightly disingenuous, if my memory is correct. I don't remember all the details, but I ran into an issue with a Tcl application a while back and Unicode support. Digging into it, I recall that the Tcl interpreter actually represents every character as a predefined number of bytes, set by a preprocessor definition. The default was 2, and cut off any Unicode character that needed more than 2 bytes to encode, unless you were willing to recompile the interpreter to use 4 bytes, at the cost of doubling memory consumption for every string.
Very nitpicky, but I think it's important to point out it's not "quite" utf-8, because Tcl needs each codepoint to be O(1) indexable in an array, something normal utf-8 can't do.