_There are_ programs that are constantly running strlen(). C strings are the def...

WalterBright · on Aug 9, 2022

> that has an acceptable tradeoff for performance vs space and simplicity for where they are used

Is it? I've been programming strings for 45 years now. Including on 8 and 10 bit machines. All that space efficiency goes out the window when one wants a subset of a string that isn't a common tail.

The simplicity goes out the window as soon as you want a substring that isn't a common tail. Now you have memory allocation to deal with.

The performance goes out the window because now the entire string contents has to be loaded into the cache to determine its length.

> length-prefixed

Are worse. Which is why I didn't mention them.

> Sane programs use store length

Meaning they become length-delineated programs, except it's done manually, tediously, and error-prone.

Whenever I review C code, the first thing I look at are the strlen/strncpy/str** sequences. It's almost always got a bug in it, an off-by-one error.

jstimpfle · on Aug 9, 2022

Again, I'm not saying you should represent substrings, or strings in general for that matter, as zero terminated strings, and I'm not saying use zero terminated strings for anything longer than a couple bytes.

No, I recommend everyone to use whatever fits the situation best. It might be a 2 byte start index and a 1 byte length fields that expresses the length as a multiple of 12 bytes. It might be rope data structure. Or it might be whatever. "String" is not a super well defined thing, and I don't understand why everybody is so super concerned about a canonical string data type. String data types are for scripting languages. 99% of my usage of string (literals) is just printf and opening files, and C does these just fine.

Zero terminated strings are only a default thing for string literals that does indeed bring a little bit of simplicity and convenience (no need for a builtin string type and the associated bike shedding, and only need to pass a single pointer to functions like printf).

> Meaning they become length-delineated programs, except it's done manually, tediously, and error-prone.

Not sure when is the last time I found it "manually, tediously, and error-prone". There are very rare cases where I have to construct zero-terminated strings from code, or need to strlen() something because of an API. And even when these cases occur they don't bother me at all. Stuff just works for me generally and I'm moving on. I have probably 500 stupid bugs unrelated to string handling before I once forget a zero terminator, and when that one time happens I just fix it and move on. On the plus side, given that we're in C where there are no slice types, zero-terminated strings spare me to pass extra length values for format strings or filepaths.

Sometimes I envision being able to use slices but I have some concerns if that would be an actual improvement. Importantly it should be about arrays and not just about strings. Strings are arrays, they aren't special.

I think a good design for slices could be one whose length can never be accessed by the programmer, but which can be used for automated bounds checks. Keeping size/capacity/offset and 43 cursors into whatever buffers separate is actually correct in my view from a modularization standpoint, because "String <-> Index/Size/Offset etc." isn't a 1:1 relationship.

> Whenever I review C code, the first thing I look at are the strlen/strncpy/str* sequences. It's almost always got a bug in it, an off-by-one error.

You will have to look quite a bit to find strlen() or strncpy() in my code. I'm not advocating for them, and not advocating to build serious string processing on top of zero-terminated strings.

WalterBright · on Aug 9, 2022

D doesn't have a builtin string type. A string in D is an array of characters. All arrays are length delineated.

> You will have to look quite a bit to find strlen() or strncpy() in my code. I'm not advocating for them, and not advocating to build serious string processing on top of zero-terminated strings.

Rolling your own string mechanism is simply not a strength of C. The downside of rolling your own is it is incompatible with everyone else's notion of how to avoid using 0 termination.

jstimpfle · on Aug 9, 2022

I haven't even suggested to roll your own "string" type. Not more than rolling any other type of array or slice. In my programs I normally do not define a "string" type. Not a central one at least. Zero-terminated strings work just fine for the quick printf() or fopen().

Instead, I might have many string-ish types. A type to hold strings in the UI (may include layout information!), a type of string slice that points into some binary buffer, a rope string type to use in my editor, a fixed-size string as part of some message payload, a string-builder string that tries to be fast without imposing a fixed length... Again, there is little point in an "optimized" generic string type for systems programming, because... generic and optimized is a contradiction.

WalterBright · on Aug 9, 2022

Any length delineated string you're using, and you did say you were using length delineation, suffers from the problem of not being compatible with any other C code. There's a good reason operating system API calls tend to use 0 terminated strings.

If you want to do a quick debug printf() on it, well, you could use %.*s, but it's awkward and ugly (I speak from lots of experience). Otherwise, you gotta append the zero.

I'm not a C newbie. I've been programming C for 40 years now. I've written 2 professional C compilers, the most recent one I finished this year. When I started D, a major priority was doing strings a better way, as C ranks among the most inconvenient string processing languages :-)

jstimpfle · on Aug 9, 2022

Sure, I know who you are but I hold opinions too :-)

I don't care about having to provide zero-terminated strings to OS and POSIX APIs, because somehow I almost always have the zero already. Maybe I'm a magician.

Sometimes I have not, but >99% of what I give to printf is actually "text", and that pretty much always has the zero anyway. It's a C convention, you might not like it, but I don't sweat it.

If I want to "print", or rather "write", something other than a zero-terminated string, which is normally "binary data", I use... fwrite() or something analogous.

> C ranks among the most inconvenient string processing languages

I've written my share of parser and interpreters (including also a dysfunctional toy compiler with x64 assembler backend, but doesn't matter here), so I'm not entirely a stranger to this game either.

I find parsing strings in C is extremely _easy_, and I find it in fact easier than say in Python where going through a stream of characters one-by-one feels surprisingly unpythonic.

Writing a robust, human-friendly parser with good error reporting and some nice recovery attributes is on the harder side, but that has nothing to do with C strings. A string input for the average parser isn't even required, you just read char by char, frankly I don't understand what you're doing that is hard about it. It doesn't matter one bit if there's a zero at the end or not.

WalterBright · on Aug 9, 2022

The inconvenience and inefficiency is apparent when building functions to do things like break up a path & filename & extension into components and reassemble them. You wind up, for each function, dealing with 0 termination or length, separately allocated or not, tracking who owns the memory, etc. There's just no satisfying set of choices. Maybe you've found an elegant solution that never does a defensive copy, never leaks memory, etc., but I never have, and I've never seen anyone else manage it, either.

jstimpfle · on Aug 9, 2022

I agree filepath related tasks are ugly. But there are a number of reasons for that that aren't related to zero termination. First, there is syntax & semantics of filepaths. Strings (whatever kind, just thinking about their monoidic structure) are a convenient user interface for specifying filepath constants, but they're annoying to construct from, and disassemble into, filepath components programmatically (relative to how easy I think it should be). Because of complicated syntax and especially semantics of components and paths, there are a lot of pitfalls. Filepath handling is most conveniently done in the shell, where also nobody has any illusion about it being fragile.

Second, you're talking about memory allocation, and this is arguably orthogonal to the string representations we're discussing here. Whether you make a copy or not for example totally depends on your specific situation. The same considerations arise for any array or slice type.

Third, again, you're free to make substrings using pointer + length or whatever, and this is in many cases the best solution. I could even agree that format strings should have better standardized support for explicit length, but it's really not a pain point for me. I'm only stating that zero-terminated is an acceptable default for string literals, and I want to stress this with another example: Last time you were looking at a binary using your editor or pager, how much better has your experience been thanks to NUL terminators? This argument can also extend to runtime debugging somewhat.

WalterBright · on Aug 10, 2022

> memory allocation, and this is arguably orthogonal to the string representations

A substringz cannot be produced from a stringz without doing an allocation.

> you're free to make substrings using pointer + length or whatever, and this is in many cases the best solution

Right, I can. And it's an ongoing nuisance in C to do so, because it doesn't have proper abstractions to build new types with. Even worse, if I switch my stringz to length delimited, and then pass it to fopen() which wants a stringz, I have to convert my length delimited string to stringz even though it is already a stringz. Because my length delimited API has no mechanism to say it also is 0 terminated.

You wind up with two string representations in your code, and then what? Have each string function come in a pair?

Believe me, I've done this stuff, I've thought about it a lot, and there is no happy solution. It annoys me enough that C is just not a tool I want to reach for anymore. I'm just tired of ugly, buggy C string code.

The good news is there is a fix, and I've proposed it, but it gets zero traction:

https://www.digitalmars.com/articles/C-biggest-mistake.html

jstimpfle · on Aug 10, 2022

> You wind up with two string representations in your code, and then what? Have each string function come in a pair?

As said, I don't think this is the end of the world, and I'm likely to add a number of other string representations. While it happens rarely, I don't worry about formatting a temporary string for an API into a temporary before calling it. Because most "string" things are small and dispensable. Zero-terminated strings are the cheap plastic solution that just works for submitting string-literals to printf, and that just works to view directly in a binary. And they're compatible with length delineated in the sense that you can supply a (cheap plastic) zero-terminated string to a (more serious) length delineated API. Also the other way, many length delineated APIs are designed to work with both - supply -1 as length, and you can happily put a string literal as argument, don't even have to macro your way with sizeof then to supply the right length.

> The good news is there is a fix, and I've proposed it, but it gets zero traction

I'm aware of this and I like it ("fat pointers") but I wouldn't like it if the APIs would miss the explicit length argument because there's a size field glued to the slice.

WalterBright · on Aug 10, 2022

> many length delineated APIs are designed to work with both - supply -1 as length, and you can happily put a string literal as argument, don't even have to macro your way with sizeof then to supply the right length.

I'm sorry, I just have to say "no thanks" to that. I don't really want each string function to test the length and run strlen if it isn't there.

By now, the D community has 20 years experience with length as part of the string type. Nobody wants to go back to the C way. It's probably the most unambiguously successful and undisputed feature of D. C code that gets converted to D gets scrubbed of the stringz code, and the result is cleaner and faster.

D still interfaces with C and C strings. The conversion is done as the last step before calling the C function. (There's a clever way to add a 0 that only rarely requires an allocation.) Any C strings returned get immediately converted with the slice idiom:

    string s = p[0 .. strlen(p)];

> I wouldn't like it if the APIs would miss the explicit length argument because there's a size field glued to the slice.

I bet you would like it! (Another problem with a separate length field is there's no obvious connection between it and the string - which is another source of bugs.)

WalterBright · on Aug 10, 2022

> Last time you were looking at a binary using your editor or pager, how much better has your experience been thanks to NUL terminators?

Not perceptibly better. And yeah, I do look at binary dumps now and then, after all, I wrote the code that generates ELF, OMF, MachO, and MSCOFF object file formats, and librarians for them :-)

jstimpfle · on Aug 10, 2022

I wrote simple ELF and PE/COFF writers too, but independently of that, zero terminators are what lets you find strings in a binary. And what allows the "strings" program to function. It simply couldn't work with without those terminators.

Similarly, the text we're exchanging consists of words and sentences that are terminated using not zero bytes, but other terminators. I'm very happy that they're not length delineated.

WalterBright · on Aug 10, 2022

> It simply couldn't work with without those terminators.

Yeah, it will. For a related example, I use `grep` all the time to find strings in source code. Source code is not 0 terminated. It works fine.

jstimpfle · on Aug 10, 2022

I use "grep -w foo" (or something like "grep '\<foo\>'"), because when I look for "foo" I don't want "bazfoobar". grep -w only works because the end of words is signaled in-band (surrounding / terminating words with whitespace).

kaba0 · on Aug 10, 2022

Zero-terminated strings was a bad decision even back then, let alone now. They make vectorization very painful, and you just needlessly have to iterate over strings at every use-site.

jstimpfle · on Aug 10, 2022

Except nobody cares about vectorization of your printf("Hello, World\n") or other 12-character strings. Vectorization here would in fact be a waste of build time as well as object output size, and the runtime performance would be not measureably different, possibly even slower in some cases. It's a total waste.

When you're processing actual buffers full of text or binary data, and performance matters, of course you are not advised to use an in-band signaled sentinel like zero-terminator is. Use an explicit length for those cases.