TL;DR : Characters and Strings considered harmful.
And he's right, they totally are ! (Also, 'string' can mean an ordered sequence of similar objects of any kind, not just characters.)
But (as these discussions also mention) replacing them by much more clearly defined concepts like byte arrays, codepoints, glyphs, grapheme clusters and text fields is only the first step...
The big question (these days) is what to do with text, specifically the 'code' kind of text (either programming or markup, and poor separation between 'plain' text and code keeps causing security issues).
To start with, even code needs formatting, specifically some way to signal a new line, or it will end up unreadable.
Then, code can't be just arbitrary Unicode text, some limits have to apply, because Unicode can get verrrry 'fancy' !
(Arbitrary Unicode is fine in text fields and comments embedded in code.)
So, I'm curious, is there any Unicode normalization specifically designed for code ? (If not, why, and which is the closest one ?)
I'm thinking of Python (3), which has what seems to be a somewhat arbitrary list of what can and what can't be used as a variable name ? (And the language itself seemingly only uses ASCII, though this shouldn't be a restriction for programming/markup languages !)
Also I hear that Julia goes much further than that (with even (La)TeX-like shortcuts for characters that might not be available on some keyboards), what kind of 'normalization' have they adopted ?
Yes, Julia really lets one get wild with Unicode. There are certain classes of unicode characters that we have marked as invalid for identifiers, some which are used for infix operators, and some which count as modifiers on previously typed characters which is useful for creating new infix operators, e.g. one might define
julia> +²(x, y) = x^2 + y^2
+² (generic function with 1 method)
such that
julia> -2 +² 3
13
If someone doesn't know how to type this, they can just hit the `?` button to open help mode in the repl and then paste it:
help?> +²
"+²" can be typed by +\^2<tab>
search: +²
No documentation found.
+² is a Function.
# 1 method for generic function "+²":
[1] +²(x, y) in Main at REPL[65]:1
Note how it says
"+²" can be typed by +\^2<tab>
Generally speaking we don't have a ton of strict rules on unicode, but it's a community convention that if you have a public facing API that uses unicode, you should provide an alternative unicode-free API. This works pretty well for us, and I think can be quite useful for some mathematical code if you don't overdo it (the above example was not an example of 'responsible' use).
I know we have a code formatter, but it doesn't do any unicode normalization. We generally just accept unicode as a first class citizen in code. This tends to cause some programmers to 'clutch their pearls' and act horrified, but in practice it works well. Maybe just because we have a cohesive community though
Nice ! Python allows to define operators too, but AFAIK you can't use Unicode in those ? And ² (or any other sub/superscript number - at least some letters are fine) is not allowed in identifiers either.
The point is to get closer to math notation though, if anything x +² y is IMHO even farther away than (x + y)*2 !
Any way to have (x + y)² or √(x + y) to work ?
––––
The new AZERTY has a lot of improvements : ∞, ±, ≠, √, the whole Greek alphabet, () and [] and {} next to each other... but for some reason they've removed the ² that the old AZERTY had ?
The sqrt one works out of the box actually, no new definitions required:
julia> √(1 + 3)
2.0
The second one does not work because we specifically ban identifiers from starting with superscript or subscript numbers. If it was allowed, we could work some black magic with juxtaposition to make it work.
Basically, we have a system called 'juxtaposition' where 2x is parsed as 2*x (but not x2). It generalizes in funky ways one can abuse if they really want (kinda discouraged though)
And he's right, they totally are ! (Also, 'string' can mean an ordered sequence of similar objects of any kind, not just characters.)
But (as these discussions also mention) replacing them by much more clearly defined concepts like byte arrays, codepoints, glyphs, grapheme clusters and text fields is only the first step...
The big question (these days) is what to do with text, specifically the 'code' kind of text (either programming or markup, and poor separation between 'plain' text and code keeps causing security issues).
To start with, even code needs formatting, specifically some way to signal a new line, or it will end up unreadable.
Then, code can't be just arbitrary Unicode text, some limits have to apply, because Unicode can get verrrry 'fancy' ! (Arbitrary Unicode is fine in text fields and comments embedded in code.)
So, I'm curious, is there any Unicode normalization specifically designed for code ? (If not, why, and which is the closest one ?)
I'm thinking of Python (3), which has what seems to be a somewhat arbitrary list of what can and what can't be used as a variable name ? (And the language itself seemingly only uses ASCII, though this shouldn't be a restriction for programming/markup languages !)
Also I hear that Julia goes much further than that (with even (La)TeX-like shortcuts for characters that might not be available on some keyboards), what kind of 'normalization' have they adopted ?