Filesystem paths are not strings. Linux doesn't enforce an encoding. Windows at ...

jerf · on Feb 11, 2021

It is not clear to me if you're elaborating or think you're disagreeing, but that is what Go does. It is generally assumed in Go that strings are UTF-8, but in practice what they actually are are just bags of bytes. Nothing really "UTF-y" will happen to them until you directly call UTF functions on them, which may produce new strings.

It's something that I don't think could work unless your language is as recent as Go, and perhaps even Go 1.0 was pushing it, but it is an increasingly viable answer. For as thin as Go's encoding support really is in some sense, it has almost never caused me any trouble. The contexts where you are actively unsafe in assuming UTF-8 are decreasing, and the ones that are going to survive are the ones where there's some sort of explicit label, like in email. (Not that those are always trustworthy either.)

KMag · on Feb 11, 2021

I'm saying it's useful to have valid strings and paths as separate types, but Go conflates the two types. Conflating the two is likely to lead to confused usage (such as programmers assuming there's a bijective mapping between valid paths and valid sequences of Unicode codepoints.)

Pervasive confused usage of this sort in the wild in Python 2 was the motivation behind splitting bytes and strings in Python 3.

jerf · on Feb 12, 2021

As you pointed out, path types are awfully specialized to the OS and really even the file system itself. It is not clear that "Go" could provide such a thing. It doesn't need to, really, you can relatively easily create a type for the specific case you have.

    type PathSegment struct {
        path string // not exported, so only the empty one can be created externally
    }

    func MakePath(in string) (PathSegment, error) {
        // validate the input here
    }

You'll need some more supporting types, of course, but it doesn't have to be provided by "Go" itself. (I have something rather like this in my codebase, though it is specialized to just Unix paths since I have no need to care about all the cross-platform details in this code base.)

I wouldn't expect this to be something the language itself provides, and I'm not even that worried about it being missing from the standard library because it's awfully detail-oriented even for that.

throwaway894345 · on Feb 11, 2021

A string is a byte array for all intents and purposes. In Go specifically, it’s an immutable byte slice with some built-in operator overloading, some of which is sugar for dealing with utf-8, but there’s nothing that suggests a string must be encoded any particular way.

KMag · on Feb 11, 2021

I'm saying that it's useful to not conflate the types for sequences of Unicode codepoints and and filesystem paths. Using the same type for both is likely to result in code with baked-in assumptions that for any path, there is a standard encoding that will yield a sequence of Unicode codepoints.

Pervasive code with this sort of type confusion in the wild in Python2 is why Python3 separated bytes and strings.

throwaway894345 · on Feb 11, 2021

Maybe, but a decade of experience with Go suggests that this isn’t a significant problem (i.e., more than a handful of instances).

Koshkin · on Feb 11, 2021

> A string is a byte array for all intents and purposes.

This smacks of reductionism. String as an abstract type only needs to conform to a number of certain axioms and support certain operations. (Thus, for example, a text editor, where a string can be mutable, could choose a representation of this type that is different from a simple byte array.)

throwaway894345 · on Feb 11, 2021

Based on the context of the thread, the definition of "string" used in this thread must also include the properties possessed by Go strings in order for the original criticism to be coherent. It seems more likely (and charitable) that the criticism is incorrect rather than incoherent.

In whatever case, Go strings have all of the relevant properties for modeling file paths.

GoblinSlayer · on Feb 11, 2021

Posix thinks paths are strings. See https://pubs.opengroup.org/onlinepubs/009695399/functions/op...

msla · on Feb 11, 2021

"String" has multiple meanings in this context. In the context of that manpage, it means "nul-terminated array of char" which is the C language meaning. In the context of what you're replying to, a "string" is a sequence of bytes (octets) in a specific Unicode Transformation Format. Those are very different things when it comes to programmatic manipulation of those things.

GoblinSlayer · on Feb 12, 2021

What you can do with a "string" that you can't do with a C string?

cygx · on Feb 12, 2021

From on-screen to in-memory representation, we go from glyphs to grapheme clusters, to unicode 'characters', to codepoints, to encoded bytes. None of these steps are bijections (ligatures, multi-character graphemes, invalid characters, encoding errors).

I'd argue a 'proper' string type should operate at the grapheme cluster and/or character level and take care of things like normalization (eg for string comparisons) and validation.