Filesystem paths are not strings. Linux doesn't enforce an encoding. Windows at least didn't used to enforce proper use of conjugate UTF-16 pairs (see WTF-8 encoding).
I think OS X does perform UTF-8 normalization, which might include sanity checking and rejecting malformed UTF-8, but I'm not sure.
A byte array (or a ref-counted singly-linked list of immutable byte arrays to save space/copying) is a much better representation for a file system path. That doesn't have great interaction with GUIs, but there are other corner cases that are often problematic for GUIs. In high school, one of my friends had a habit of putting games on the school library computers, and renaming them to names with non-printable characters using alt+number pad. (He used 129, IIRC, which isn't assigned a character in CP-1252.) The Windows 95 graphical shell would convert the non-printable characters to spaces for display, but when the librarian tried to delete the games, it would pass the display name to the kernel, which would complain that the presented path didn't exist.
It is not clear to me if you're elaborating or think you're disagreeing, but that is what Go does. It is generally assumed in Go that strings are UTF-8, but in practice what they actually are are just bags of bytes. Nothing really "UTF-y" will happen to them until you directly call UTF functions on them, which may produce new strings.
It's something that I don't think could work unless your language is as recent as Go, and perhaps even Go 1.0 was pushing it, but it is an increasingly viable answer. For as thin as Go's encoding support really is in some sense, it has almost never caused me any trouble. The contexts where you are actively unsafe in assuming UTF-8 are decreasing, and the ones that are going to survive are the ones where there's some sort of explicit label, like in email. (Not that those are always trustworthy either.)
I'm saying it's useful to have valid strings and paths as separate types, but Go conflates the two types. Conflating the two is likely to lead to confused usage (such as programmers assuming there's a bijective mapping between valid paths and valid sequences of Unicode codepoints.)
Pervasive confused usage of this sort in the wild in Python 2 was the motivation behind splitting bytes and strings in Python 3.
As you pointed out, path types are awfully specialized to the OS and really even the file system itself. It is not clear that "Go" could provide such a thing. It doesn't need to, really, you can relatively easily create a type for the specific case you have.
type PathSegment struct {
path string // not exported, so only the empty one can be created externally
}
func MakePath(in string) (PathSegment, error) {
// validate the input here
}
You'll need some more supporting types, of course, but it doesn't have to be provided by "Go" itself. (I have something rather like this in my codebase, though it is specialized to just Unix paths since I have no need to care about all the cross-platform details in this code base.)
I wouldn't expect this to be something the language itself provides, and I'm not even that worried about it being missing from the standard library because it's awfully detail-oriented even for that.
A string is a byte array for all intents and purposes. In Go specifically, it’s an immutable byte slice with some built-in operator overloading, some of which is sugar for dealing with utf-8, but there’s nothing that suggests a string must be encoded any particular way.
I'm saying that it's useful to not conflate the types for sequences of Unicode codepoints and and filesystem paths. Using the same type for both is likely to result in code with baked-in assumptions that for any path, there is a standard encoding that will yield a sequence of Unicode codepoints.
Pervasive code with this sort of type confusion in the wild in Python2 is why Python3 separated bytes and strings.
> A string is a byte array for all intents and purposes.
This smacks of reductionism. String as an abstract type only needs to conform to a number of certain axioms and support certain operations. (Thus, for example, a text editor, where a string can be mutable, could choose a representation of this type that is different from a simple byte array.)
Based on the context of the thread, the definition of "string" used in this thread must also include the properties possessed by Go strings in order for the original criticism to be coherent. It seems more likely (and charitable) that the criticism is incorrect rather than incoherent.
In whatever case, Go strings have all of the relevant properties for modeling file paths.
"String" has multiple meanings in this context. In the context of that manpage, it means "nul-terminated array of char" which is the C language meaning. In the context of what you're replying to, a "string" is a sequence of bytes (octets) in a specific Unicode Transformation Format. Those are very different things when it comes to programmatic manipulation of those things.
From on-screen to in-memory representation, we go from glyphs to grapheme clusters, to unicode 'characters', to codepoints, to encoded bytes. None of these steps are bijections (ligatures, multi-character graphemes, invalid characters, encoding errors).
I'd argue a 'proper' string type should operate at the grapheme cluster and/or character level and take care of things like normalization (eg for string comparisons) and validation.
I think OS X does perform UTF-8 normalization, which might include sanity checking and rejecting malformed UTF-8, but I'm not sure.
A byte array (or a ref-counted singly-linked list of immutable byte arrays to save space/copying) is a much better representation for a file system path. That doesn't have great interaction with GUIs, but there are other corner cases that are often problematic for GUIs. In high school, one of my friends had a habit of putting games on the school library computers, and renaming them to names with non-printable characters using alt+number pad. (He used 129, IIRC, which isn't assigned a character in CP-1252.) The Windows 95 graphical shell would convert the non-printable characters to spaces for display, but when the librarian tried to delete the games, it would pass the display name to the kernel, which would complain that the presented path didn't exist.