We don't need a string type (2013)

hollasch · on Feb 11, 2021

Curious. I have to come to exactly the opposite conclusion — that we should drop the idea of a fixed-length character type, and instead _only_ have (Unicode) string types. Actually, I'd prefer something like `std::text` to finally be free of the baggage of "string". Operations on text should work on logical text concepts. For example, something like `someText.firstCharacter()` would have a return type of `text`, with logical length 1. It's _data_ length is variable, since a Unicode character is variable length. So many Unicode-containing string design problems arise because of the stubborn insistence of having an integral character type.

I should be able to extract UTF-8, UTF-16 or whatever encoding I want from a `text` value. Something like `c_str()` would be pretty important, but the semantics would be a design problem, not an encoding problem. Any Unicode-encoding string should be able to encode U+0000, so you'd need to figure out how to handle that from `c_str()` (perhaps a substitution ASCII character could be specified to encode embedded nulls).

Basically, users should definitely _not_ need to understand the deeper details of Unicode. They shouldn't need to understand and worry about different entities such as code units, code points, graphemes, and the like, though they should be able to extract such encodings on demand.

donaldihunter · on Feb 11, 2021

This!

Raku introduced the concept of NFG - Normal Form Grapheme - as a way to represent any Unicode string in its logical ‘visual character’ grapheme form. Sequences of combining characters that don’t have a canonical single codepoint form are given a synthetic codepoint so that string methods including regexes can operate on grapheme characters without ever causing splitting side effects.

Of course there are methods for manipulating at the codepoint level as well.

shadowgovt · on Feb 11, 2021

Essentially, different tools for different applications.

"A string is a vector of characters, which happen to each be one byte in length" was more of an artifact of a time where there happened to be representational overlap than some deep truism about proper data structure. Strings intended to be displayed to humans are specialized constructs, much as a "button" or a "file handle" are. A buffer of unstructured bytes is a separate specialized construct, suitable for tasks unrelated to "displaying text to a human."

lisper · on Feb 11, 2021

I fully endorse the general idea here, but this:

> `someText.firstCharacter()` would have a return type of `text`, with logical length 1

is a huge mistake. There are operations that make sense on characters that do not make sense on texts whose length happens to be 1. The most obvious of these is inquiring about the numerical value of the unicode code point of a character. Conflating characters and texts-of-length-1 is a mistake of the same order as conflating strings and byte vectors. Python makes this mistake even in version 3. As a result, a function like this:

def f(s, n, m): return ord(s[n:m])

will return a value iff m is one more than n. Not good.

donaldihunter · on Feb 11, 2021

Only if you ignore the rest of what the post said. First it should make things easy for ‘normal’ tasks, then it should make everything else possible.

> Basically, users should definitely _not_ need to understand the deeper details of Unicode. They shouldn't need to understand and worry about different entities such as code units, code points, graphemes, and the like, though they should be able to extract such encodings on demand.

lisper · on Feb 11, 2021

Except that the "users" of a string type are programmers, and a "normal task" for a programmer often requires things like this. I'll give you an example from a project I am currently working on: a spam filter. One of the things my filter does is count the number of Chinese characters in a string. I implement this as n1<=ord(c)<=n2 where n1 and n2 are integers representing the start and end of the range of Unicode Chinese characters. This seems like a "normal task" to me and I don't see how conflating characters and texts-of-length-1 would make this any easier.

Koshkin · on Feb 11, 2021

Functors are everywhere. That's why we need monads!

lisper · on Feb 11, 2021

Gnats and sledgehammers something something...

specialist · on Feb 11, 2021

My future perfect programming language will have explicit native types for strings: ansi, utf8, utf16, and unicode.

The nth of any array works as expected for that type. Convert to bytes as needed.

  ansi "abcd"[1] -> byte
  utf8 "abcd"[1] -> char
  utf16 "abcd"[1] -> char
  utf8 "abcd".toBytes()[1] -> byte
  unicode "abcd"[1] -> word

Were it still the 90s, I'd probably care about locales, so somehow imbue ansi arrays with that metadata.

BlueTemplar · on Feb 11, 2021

> Actually, I'd prefer something like `std::text` to finally be free of the baggage of "string". Operations on text should work on logical text concepts. For example, something like `someText.firstCharacter()` would have a return type of `text`, with logical length 1. It's _data_ length is variable, since a Unicode character is variable length.

I don't see how you came to the "opposite conclusion" when the author basically says the same thing ?

37ef_ced3 · on Feb 11, 2021

Go's immutable UTF-8 string type is one of the nice things about the language

A Go string is almost exactly like this C struct:

  struct String {
      uint8_t* addr;
      ptrdiff_t len;
  };

The language guarantees you can't modify the bytes in memory range [addr, addr+len)

Go's garbage collection makes it simple and natural to have one string alias ("point into", "overlap") part of another string. This works because strings are immutable. Compare this to the nightmare in C++, where substrings require copying or explicit handling

The rune (UTF-8) iterator and other facilities make Unicode handling natural in Go

In summary, Go's string type is a huge win

jrimbault · on Feb 11, 2021

I'd arguee Go's string type is "somewhat unusable"* since it doesn't enforce the guarantees it says/implies it does. The byte slice it points to is not guaranteed to be valid utf8.

* of course to a degree, let's be reasonable, it's usable in a _lot_ of contexts, but I like my types to actually mean something.

throwaway894345 · on Feb 11, 2021

I think of a string as an immutable byte slice. This is a little confusing since the language supports utf-8 literals only and it also lets you iterate over individual runes with for loops, but those are just conveniences over the fact that these are really just immutable byte slices. You could probably make your own “UTF8” type with the invariants you want (or at least someone would have to drop down into unsafe to violate the invariants) but in general Go programs don’t typically go that far, presumably because it doesn’t add much value in practice, which would suggest that your “somewhat unusable” claim (even with its caveat) is too strong. That said, I think it would be nice if Go made it a little easier/clearer to model a type that can only be created by a particular constructor or some such.

DougBTX · on Feb 11, 2021

Go doesn't guarantee any encoding for strings, very deliberately (so that, eg, they can be used to represent file names).

KMag · on Feb 11, 2021

Filesystem paths are not strings. Linux doesn't enforce an encoding. Windows at least didn't used to enforce proper use of conjugate UTF-16 pairs (see WTF-8 encoding).

I think OS X does perform UTF-8 normalization, which might include sanity checking and rejecting malformed UTF-8, but I'm not sure.

A byte array (or a ref-counted singly-linked list of immutable byte arrays to save space/copying) is a much better representation for a file system path. That doesn't have great interaction with GUIs, but there are other corner cases that are often problematic for GUIs. In high school, one of my friends had a habit of putting games on the school library computers, and renaming them to names with non-printable characters using alt+number pad. (He used 129, IIRC, which isn't assigned a character in CP-1252.) The Windows 95 graphical shell would convert the non-printable characters to spaces for display, but when the librarian tried to delete the games, it would pass the display name to the kernel, which would complain that the presented path didn't exist.

jerf · on Feb 11, 2021

It is not clear to me if you're elaborating or think you're disagreeing, but that is what Go does. It is generally assumed in Go that strings are UTF-8, but in practice what they actually are are just bags of bytes. Nothing really "UTF-y" will happen to them until you directly call UTF functions on them, which may produce new strings.

It's something that I don't think could work unless your language is as recent as Go, and perhaps even Go 1.0 was pushing it, but it is an increasingly viable answer. For as thin as Go's encoding support really is in some sense, it has almost never caused me any trouble. The contexts where you are actively unsafe in assuming UTF-8 are decreasing, and the ones that are going to survive are the ones where there's some sort of explicit label, like in email. (Not that those are always trustworthy either.)

KMag · on Feb 11, 2021

I'm saying it's useful to have valid strings and paths as separate types, but Go conflates the two types. Conflating the two is likely to lead to confused usage (such as programmers assuming there's a bijective mapping between valid paths and valid sequences of Unicode codepoints.)

Pervasive confused usage of this sort in the wild in Python 2 was the motivation behind splitting bytes and strings in Python 3.

jerf · on Feb 12, 2021

As you pointed out, path types are awfully specialized to the OS and really even the file system itself. It is not clear that "Go" could provide such a thing. It doesn't need to, really, you can relatively easily create a type for the specific case you have.

    type PathSegment struct {
        path string // not exported, so only the empty one can be created externally
    }

    func MakePath(in string) (PathSegment, error) {
        // validate the input here
    }

You'll need some more supporting types, of course, but it doesn't have to be provided by "Go" itself. (I have something rather like this in my codebase, though it is specialized to just Unix paths since I have no need to care about all the cross-platform details in this code base.)

I wouldn't expect this to be something the language itself provides, and I'm not even that worried about it being missing from the standard library because it's awfully detail-oriented even for that.

throwaway894345 · on Feb 11, 2021

A string is a byte array for all intents and purposes. In Go specifically, it’s an immutable byte slice with some built-in operator overloading, some of which is sugar for dealing with utf-8, but there’s nothing that suggests a string must be encoded any particular way.

KMag · on Feb 11, 2021

I'm saying that it's useful to not conflate the types for sequences of Unicode codepoints and and filesystem paths. Using the same type for both is likely to result in code with baked-in assumptions that for any path, there is a standard encoding that will yield a sequence of Unicode codepoints.

Pervasive code with this sort of type confusion in the wild in Python2 is why Python3 separated bytes and strings.

throwaway894345 · on Feb 11, 2021

Maybe, but a decade of experience with Go suggests that this isn’t a significant problem (i.e., more than a handful of instances).

Koshkin · on Feb 11, 2021

> A string is a byte array for all intents and purposes.

This smacks of reductionism. String as an abstract type only needs to conform to a number of certain axioms and support certain operations. (Thus, for example, a text editor, where a string can be mutable, could choose a representation of this type that is different from a simple byte array.)

throwaway894345 · on Feb 11, 2021

Based on the context of the thread, the definition of "string" used in this thread must also include the properties possessed by Go strings in order for the original criticism to be coherent. It seems more likely (and charitable) that the criticism is incorrect rather than incoherent.

In whatever case, Go strings have all of the relevant properties for modeling file paths.

GoblinSlayer · on Feb 11, 2021

Posix thinks paths are strings. See https://pubs.opengroup.org/onlinepubs/009695399/functions/op...

msla · on Feb 11, 2021

"String" has multiple meanings in this context. In the context of that manpage, it means "nul-terminated array of char" which is the C language meaning. In the context of what you're replying to, a "string" is a sequence of bytes (octets) in a specific Unicode Transformation Format. Those are very different things when it comes to programmatic manipulation of those things.

GoblinSlayer · on Feb 12, 2021

What you can do with a "string" that you can't do with a C string?

cygx · on Feb 12, 2021

From on-screen to in-memory representation, we go from glyphs to grapheme clusters, to unicode 'characters', to codepoints, to encoded bytes. None of these steps are bijections (ligatures, multi-character graphemes, invalid characters, encoding errors).

I'd argue a 'proper' string type should operate at the grapheme cluster and/or character level and take care of things like normalization (eg for string comparisons) and validation.

skybrian · on Feb 11, 2021

Go's standard library works with both possibly-malformed and verified UTF-8 strings, which is a nice property.

The type system needed to explain what they actually do (take one of two possible input types and return the corresponding output type) would require generics, which we don't have yet.

An alternative would be to duplicate the code to account for the different types, but we already have that for []byte versus string and that's bad enough already.

37ef_ced3 · on Feb 11, 2021

In Go, malformed UTF-8 encodings are expected

They are handled in a well-defined and graceful manner by all aspects of the language, runtime, and library

foo_barrio · on Feb 11, 2021

Java's sub-strings used to work sort of like this but was changed to use copy semantics. The structure was to have a "char[]" and an "offset" into that array. This allowed sub-strings to share the underlying array. However if you had a 1 char sub-string to a 1 GB array, the underlying array was never trimmed for garbage collection.

In the case of a 1 char substring to a 1 GB string, is Go smart enough to free the rest of the array and keep only the 1 char?

KMag · on Feb 11, 2021

I wonder how hard the JVM folks looked into specialized weak references for solving this issue. The mark phase would treat all Strings with zero offset and full length as strongly referencing the byte[], and weakly otherwise. At the end of each full GC, you could iterate over all of your Strings (custom allocate/compact them to their own ranges of the heap for faster scanning), use some heuristics and probabilistic sampling to select some of the weakly reachable byte[]s for size reduction. A specialized copy/compact pass over the Strings could in-place replace byte[] references and fix up offsets.

You'd probably also want to modify String.equals() to internally mutate equal strings to point to the same byte[], preferring smaller offsets, and when offsets are equal, preferring lower addresses. This is a light weight lazy version of the background String byte[] interning done by some JVMs.

37ef_ced3 · on Feb 11, 2021

In Go, as in C, there's no magic. A programmer using Go thinks of a string variable as a pointer/length pair, and knows what will happen. Just like with slices

If you keep a pointer into an allocation (in your example, a small Go string pointing into a much larger Go string) the allocation is preserved by the garbage collector

You should explicitly copy the substring out (instead of aliasing the underlying string) if retaining the underlying string causes you a problem

foo_barrio · on Feb 11, 2021

Okay I understand. In Java, the default String class has the option to "intern" a string which just maintains a list of strings that can be shared.

The change was made because often the devs were unaware of the string manipulation taking place in a third party library (eg XML/JSON/HTML parsing). You'd see the memory balloon, investigate and notice that String/char[] instances were dominating your heap. Instead of changing the entire implementation of the standard String class, they changed the semantics of the "substring()" call from O(1) to O(n) + memory side-effects.

Koshkin · on Feb 11, 2021

Oh well. The could've (should've?) used the layout of _bstr_t instead.

pca006132 · on Feb 11, 2021

I think the problem is that, a lot of time when we deal with strings, we are thinking about ASCII strings instead of other encoding like UTF-8. If we treat them as ASCII strings, an array of characters would make sense, but it is not that simple for other encoding.

One of the languages that considered the issue is Rust. In rust, we don't really index into strings, but use iterators or other methods to do the operations required. https://doc.rust-lang.org/std/string/struct.String.html

sfvisser · on Feb 11, 2021

I really don’t think many programmers nowadays actually think this.

bobthepanda · on Feb 11, 2021

I would hazard that very few people think about what an underlying String is at all.

String encoding is something I encountered as a problem in college, but is up there with implementing a homemade red-black tree in terms of “things that are asked in interviews but have little to no bearing on my day-to-day.”

BlueTemplar · on Feb 11, 2021

Really, they don't run into string/character issues regularly ? Because I do...

bobthepanda · on Feb 11, 2021

I certainly run into them rarely, and if I do have an issue it is usually solved by bunging it into some purpose built standard or third party library and calling it a day.

I’m sure people have jobs that deal with this, but the low-level form of the problem is not something that I could see one encountering in a meaningful way for building a standard CRUD app or service.

belval · on Feb 12, 2021

I completely agree with you, but everyone who doesn't have to deal with unicode/strings on a regular basis should consider themselves lucky.

Once your add RTL text (with the matching bidi algorithm) or grapheme-based written system such as Devanagari which doesn't really have characters at all it becomes such a mess so fast.

DougBTX · on Feb 11, 2021

The date should be (2013) not (2018), as that dates it before Rust 1.0 (which does have a UTF-8 string type) and before the Julia 1.0 release date (which implements UTF-8 strings as arrays with irregularly spaced indexes, eg, the valid indexes may be 1, 2, 4, 5, if the character at 2 takes up two bytes). Both would be interesting examples to compare against if this article was written today.

dang · on Feb 11, 2021

I've fixed the date now. Actually the date at the top of the article "2013-08-13" is in a font that somehow makes it look like 2018. I had to squint a couple times to make sure I was reading it right! The year in the URL is easier to read.

shadowgovt · on Feb 11, 2021

I think the author started from an assertion ("This primary difference between a C++ ‘string’ and ‘vector’ is really just a historical oddity that many programs don’t even need anymore") that highlights an error in the C++ model of strings, not in the way we must think about strings.

Contrast NSString in Cocoa (https://developer.apple.com/documentation/foundation/nsstrin...). The Cocoa string is extremely opaque; it's basically an object. And under the hood, that opacity allows for piles of optimization that are unsafe if the developer is allowed to treat the thing as just a vector of bytes or codepoints. Under the hood, Cocoa does all kinds of fanciness to the memory representation of the string (automatically building and cutting cords, "interning" short strings so that multiple copies of the string are just pointers to the same memory, caching of some transforms under the assumption that if it's needed once, it's often needed again).

Taken this way, one can even start to talk about things like "Why does 'indexing' into a string always return a character, instead of, say, a word?" and other questions that are harder to get into if one assumes a string is just 'vector of characters' or 'vector of bytes.'

BlueTemplar · on Feb 11, 2021

Today I learned that Python does interning of shorts strings too :

https://news.ycombinator.com/item?id=26097732

ncmncm · on Feb 11, 2021

The article is an argument against types, in general.

The point that characters can be stored in other containers is meaningless: the question is whether, conceptually, a specific sequence of character values distinct from another sequence has compile-time meaning. It does. Therefore, it needs a type.

Such a sequence has numerous special characteristics. In particular, element at [i] often has an essential connection to element at [i+1] such that swapping them could turn a valid string to an invalid one. In fact, that an invalid sequence is even possible is another such characteristic.

arcbyte · on Feb 11, 2021

I actually read it as a argument FOR types and against modern languages choice to make the String class a weak proxy for typeless byte arrays. See all the arguments (in this HN comments no less!) for just using utf8 byte arrays as strings.

Hes saying semantically there's no difference between arrays and string classes except that with string classes we let you do all kinds of dangerous byte manipulation that we would never dream of with any other type. Moreover, most of the uses for this dangerous access aren't real usages because if you're manipulating strings you're almost certainly actually manipulating code points. So why wouldn't you just use a code point array and give yourself real type safety instead?

ncmncm · on Feb 11, 2021

I did not get that at all. Anyway a code point array would not serve the purpose: most possible sequences of valid code points are not valid strings.

A variable-size array of code points is also useful, just as, in C++, a std::vector<char> is useful, but that doesn't make it a string.

That C++ std::string<> is wrong for what we now think of as strings is a whole other argument. People once hoped that std::string<wchar_t> or std::string<char32_t> might be the useful string, but they were disappointed. C++ does not have a useful string type at this time, but there is ongoing work on one. It should appear in C++26.

AnimalMuppet · on Feb 11, 2021

> most possible sequences of valid code points are not valid strings.

Could you clarify? In what way are they not valid strings?

KMag · on Feb 11, 2021

Some code points are characters. Others are operators with constrained contexts in which they operate. Sufficiently long random sequences of characters and these context-specific operators are likely to apply the operators in invalid contexts. Invalid characters mean invalid strings.

For instance, there are code points that are effectively operators that add continental European accents (umlaut, accent grave, etc.) to Latin characters. (Also, there are redundant code points for accented characters.) There's a whole set of code points that are combinators for primitive components of Han characters, etc. (Also, there are redundant code points for pre-composed Han characters.) One way of writing Korean syllables strictly requires triplets of individual jamo components: initial consonant jamo, vowel jamo, and final consonant jamo. (Also, there are redundant code points for every valid triple-jamo syllable in Korean.)

A Han character with an ancient Greek digamma in its "radical" position, a poo emoji inside a box, a thousand umlauts, all three French accents, a Hangul jamo vowel sticking through its center, a Hebrew vowel point, and a Thai tone mark is not a valid character. Any string containing invalid characters is not a valid string.

arcbyte · on Feb 11, 2021

Let me respond to you again in a different way, this time referencing some unicorn definitions I like (https://stackoverflow.com/a/27331885).

I don't think we can have a meaningful conversation in terms of characters so I'm going to ignore that and reference your last paragraph. You seem to be arguing that string as a type has use when viewing it as a collection of methods that allow access to Code Points given an underlying storage of Code Units. The article is arguing that unless you're writing a unicode encoder/decoder, you probably don't care about manipulating Code Units (except that modern languages have given you these byte arrays that you reference the length of for memory purposes). What you really usually care about is searching, replacing, concating, and cutting collections of Code Points. But languages have only given you this hodge podge grouping of Code Unit arrays and specialty methods for Code Point access so thats what you're used to dealing with and of course you want some kind of abstraction, like a string type, to deal with so you don't end up with the scenario you describe where you screw up a Code Unit sequence trying to manipulate a Code Point.

So the final point is that unless you're working with unicode encoding/decoding, you really only care about Code Points. And once you create a String class that only exposes Code Points, you have got something equivalent to a simple array.

GoblinSlayer · on Feb 11, 2021

You can mess up any ordered sequence in this way.

ncmncm · on Feb 12, 2021

That is an argument for making and using ordered-sequence types, not an argument against a string type.

Generic ordered-sequence containers have not appeared in standard libraries, except where the container itself depends on the ordering, for various practical and historical reasons, but are very useful to wrap (say) a vector instantiated on a particular type, often with some metadata stuck on.

irogers · on Feb 11, 2021

String should be an interface/protocol. When I log a message, I want to pass a string. If I have to append large strings for a log message I don't want to run out of memory, I should be able to pass a rope/cord [1]. We've known how to abstract this for forever and should work to optimize our compilers/runtimes accordingly. I'm not aware of a language which has got this right, for example, Java has the ugly CharSequence interface that nobody uses. StringProtocol in Swift (can I implement it?) makes you pay a character tax rather than to just pass a string. Rust/C++ give various non-abstracted types.

[1] https://en.wikipedia.org/wiki/Rope_(data_structure)

pdimitar · on Feb 11, 2021

Erlang/Elixir's iolists which are heavily utilized in Phoenix's templating engine are a rope and are extremely efficient (for a dynamic language). Phoenix's templating is very fast.

60secz · on Feb 11, 2021

Can't agree more. Java in particular suffers greatly from Object toString with a weak contract and no global String interface. If String were an interface instead of an implementation than any method signature could accept multiple implementations. This allows for really effective type aliases which even support strong typing so if you have a signature with multiple String values you can use the strong types to ensure you don't transpose arguments.

quelsolaar · on Feb 12, 2021

I think that the problem with text is that the basic operation you want to do is inserts. The way memory works in computer makes that an inherently inefficient operation. I'm a bit fascinated by how bad computer are at text give that that is what we use so much of them for.

As a C programmer I think that its not really possible to implement an efficient text processing library, because there is no good universal way to store text. So much depends on the pattern of the processing functions. If you want to avoid allocating new memory and moving a lot of text for each operation, the implementation needs to make speculative choices about how text can best be stored. How you store text depends so much on your access pattern. Do you need to be able to get to a line fast? or know how long the text is? Or insert something? and if so how much?

A C style string would for instance be terrible for something like a text editor, because every key press would cause a complete copy of the document to have to be allocated, and then copied over. So maybe a linked list? But you dont want just one character in each link because that trashes the cache right? but then its still slow to just skip forward fast, so maybe an array of pointers to snipets? or maybe a linked list of pointers to snippets? So many possibilities that all impact performance differently depending on what you do with it.

When I see higher languages with nice easy to use string functionality, I always consider, the impossible choices that had to be made under the hood.

xscott · on Feb 12, 2021

I think you want a "gap buffer".

quelsolaar · on Feb 12, 2021

A gap buffer is an example of a data structure for text that is optimized for one usage pattern, and performs badly with other patterns. Generalized text structures are hard.

giardini · on Feb 11, 2021

Surprising to a Tcl programmer!8-)) b/c

"Everything is a String":

https://wiki.tcl-lang.org/page/everything+is+a+string

and

"Everything is a Symbol":

https://wiki.tcl-lang.org/page/Everything+is+a+Symbol

BlueTemplar · on Feb 11, 2021

Looks like that what Tcl means by 'string', the author names 'text' ?

What does Tcl mean by 'character' ?

See for instance, the author's HTML example :

> Combining characters can create an accented version of that symbol, <̧. In text this is clearly a different symbol: it’s a distinct grapheme cluster. The HTML parser doesn’t care about that. It sees code #60 followed by #807 (combining cedilla). It thus sees the opening of an element. However, since it isn’t followed by a valid naming character most parsers just ignore this element (I’m not positive that is correct to do). This is not the case with an accented quote, like "̧. Here the parsers (at least the browsers I tested), let the quote end an attribute and then have a garbage character lying around.

https://mortoray.com/2014/03/17/strings-and-text-are-not-the...

EDIT: Ok, it looks like by 'character', Tcl means what the author (and Unicode ?) calls a 'grapheme cluster' ?

https://wiki.tcl-lang.org/page/Characters%2C+glyphs%2C+code%...

https://mortoray.com/2016/04/28/what-is-the-length-of-a-stri...

BlueTemplar · on Feb 11, 2021

The author has these followup blogposts :

2013 : https://mortoray.com/2013/11/27/the-string-type-is-broken/

2014 : https://mortoray.com/2014/03/17/strings-and-text-are-not-the...

(See also : https://thehardcorecoder.com/2014/04/15/data-text-and-string... )

2016 : https://mortoray.com/2016/04/28/what-is-the-length-of-a-stri...

dang · on Feb 11, 2021

Discussed at the time: https://news.ycombinator.com/item?id=6204427

tyingq · on Feb 11, 2021

I can't speak for C++, but for C, the repeated issue is that a null-terminated string has lots of utility routines that are handy for manipulating them. Without 3rd party libraries, plain length-header buffers don't. Hence things like Antirez's sds library, which by nature, is a compromise. I get you can't fundamentally change C now, but a buffer type with a rich manipulation library would have been nice.

BlueTemplar · on Feb 11, 2021

Anyone else thinks that we missed an opportunity to make text much simpler to deal with by not increasing the size of a byte from 8 to 32 bits when we moved from 32-bit to 64-bit word length CPUs ?

I mean, isn't the 7-bit ASCII text the reason why the byte length was standardized to the next power of two bits ?

(With e-mail still supporting non-padded 7-bit ASCII until recently for performance reasons.)

BlueTemplar · on Feb 11, 2021

TL;DR : Characters and Strings considered harmful.

And he's right, they totally are ! (Also, 'string' can mean an ordered sequence of similar objects of any kind, not just characters.)

But (as these discussions also mention) replacing them by much more clearly defined concepts like byte arrays, codepoints, glyphs, grapheme clusters and text fields is only the first step...

The big question (these days) is what to do with text, specifically the 'code' kind of text (either programming or markup, and poor separation between 'plain' text and code keeps causing security issues).

To start with, even code needs formatting, specifically some way to signal a new line, or it will end up unreadable.

Then, code can't be just arbitrary Unicode text, some limits have to apply, because Unicode can get verrrry 'fancy' ! (Arbitrary Unicode is fine in text fields and comments embedded in code.)

So, I'm curious, is there any Unicode normalization specifically designed for code ? (If not, why, and which is the closest one ?)

I'm thinking of Python (3), which has what seems to be a somewhat arbitrary list of what can and what can't be used as a variable name ? (And the language itself seemingly only uses ASCII, though this shouldn't be a restriction for programming/markup languages !)

Also I hear that Julia goes much further than that (with even (La)TeX-like shortcuts for characters that might not be available on some keyboards), what kind of 'normalization' have they adopted ?

eigenspace · on Feb 11, 2021

Yes, Julia really lets one get wild with Unicode. There are certain classes of unicode characters that we have marked as invalid for identifiers, some which are used for infix operators, and some which count as modifiers on previously typed characters which is useful for creating new infix operators, e.g. one might define

    julia> +²(x, y) = x^2 + y^2
    +² (generic function with 1 method)

such that

    julia> -2 +² 3
    13

If someone doesn't know how to type this, they can just hit the `?` button to open help mode in the repl and then paste it:

    help?> +²
    "+²" can be typed by +\^2<tab>

    search: +²

      No documentation found.

      +² is a Function.

      # 1 method for generic function "+²":
      [1] +²(x, y) in Main at REPL[65]:1

Note how it says

    "+²" can be typed by +\^2<tab>

Generally speaking we don't have a ton of strict rules on unicode, but it's a community convention that if you have a public facing API that uses unicode, you should provide an alternative unicode-free API. This works pretty well for us, and I think can be quite useful for some mathematical code if you don't overdo it (the above example was not an example of 'responsible' use).

I know we have a code formatter, but it doesn't do any unicode normalization. We generally just accept unicode as a first class citizen in code. This tends to cause some programmers to 'clutch their pearls' and act horrified, but in practice it works well. Maybe just because we have a cohesive community though

BlueTemplar · on Feb 11, 2021

Nice ! Python allows to define operators too, but AFAIK you can't use Unicode in those ? And ² (or any other sub/superscript number - at least some letters are fine) is not allowed in identifiers either.

The point is to get closer to math notation though, if anything x +² y is IMHO even farther away than (x + y)*2 !

Any way to have (x + y)² or √(x + y) to work ?

––––

The new AZERTY has a lot of improvements : ∞, ±, ≠, √, the whole Greek alphabet, () and [] and {} next to each other... but for some reason they've removed the ² that the old AZERTY had ?

http://norme-azerty.fr/

eigenspace · on Feb 11, 2021

> if anything x +² y is IMHO even farther away than (x + y) * 2 !

Yeah, it was just a random example that came to mind, not to be taken seriously. Here's perhaps one example of unicode being used in a way that's pleasing to some and upsetting to others: https://www.reddit.com/r/programminghorror/comments/jqdi4i/y...

> Any way to have (x + y)² or √(x + y) to work ?

The sqrt one works out of the box actually, no new definitions required:

    julia> √(1 + 3)
    2.0

The second one does not work because we specifically ban identifiers from starting with superscript or subscript numbers. If it was allowed, we could work some black magic with juxtaposition to make it work.

Here's an example with the transpose of an array:

    julia> struct ᵀ end

    julia> Base.:(*)(x, ::Type{ᵀ}) = transpose(x)

    julia> [1, 2, 3, 4]ᵀ
    1×4 transpose(::Vector{Int64}) with eltype Int64:
     1  2  3  4

Basically, we have a system called 'juxtaposition' where 2x is parsed as 2*x (but not x2). It generalizes in funky ways one can abuse if they really want (kinda discouraged though)