UTF-8 Everywhere

legulere · on April 14, 2020

> In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere. Because of that, the author of the file copy utility would not need to care about Unicode

It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)

Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.

jcranmer · on April 14, 2020

> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)

A decent fraction of software can impose rules on the portion of the filesystem within their control. A tool like mv or vim has to be prepared to handle any filepath encoding. But something like a VCS could reasonably insist that they only support filetrees with normalized UTF-8 encoding and no case-insensitive conflicts as the only things reliably working cross-platform.

acdha · on April 14, 2020

The history of Git and Subversion handling filenames makes me think that the opposite is true: A VCS which doesn't handle arbitrary byte-strings will have weird edge cases which prevent users from adding files or accessing them, possibly even “losing” data in a local checkout. This is especially tedious because it'll appear to work for a while until someone first tries to commit an unusual file or checks it out with a previously-unused client.

roblabla · on April 14, 2020

My understanding is, you can't treat the filename as an arbitrary bytestring, since you have to transcode it across platforms, otherwise the filename won't show up properly everywhere. E.G. if I make a file named "test" on unix, it will be UTF-8 (assuming sane unix). If on windows I create a file with the filename "test", encoded as UTF-8, it will show up as worthless garbage in explorer.exe since it will decode it to UTF-16.

So VCS needs to know the filename encoding in order to work properly.

imtringued · on April 15, 2020

The actual text isn't an arbitrary byte string. There is logical data and then there is its representation. char, short, int, string can all logically refer to the number 0 but the representation is completely different. With char it is even possible to represent the same number in two ways. As a binary 0 or as the character code for 0. Allowing byte strings as the physical representation is not a bad idea to stay future proof but you will have to provide additional information by storing the character encoding that was used to create the arbitrary byte string. If you fail to do that then this information will have to provided through convention and that's how we get "stuck" with UTF-8 and I although I like UTF-8 this doesn't feel like the right solution. If everyone agrees to use UTF-8 then we should stop pretending that something is just an arbitrary byte string and formalize UTF-8.

The idea of an arbitrary byte string is fooling people into believing something that is not true. Developers falsely think their software can handle any character encoding. However, once you decide to support only a single character encoding you will notice that if something better comes along you need a way to differentiate the old and new codec. Then you decide to add a field that declares the character encoding type and suddenly it's obvious that your arbitrary byte string is a bad way of dealing with the problem. That byte string has meaning. Don't throw that meaning away.

Thrymr · on April 14, 2020

Sure, as long as you don't have to be compatible with anything else, you can assume whatever encoding you want. That doesn't change the point that general programs can't make that assumption.

marcosdumay · on April 14, 2020

> Unix paths don’t need to be valid UTF-8

Yet, your shell will treat them like UTF-8 just as well. As will the standard library of almost every programming language, as you noticed.

If you open one such file in most text editors, they will render whatever is in it as UTF-8. If you use text manipulating utilities, they will work with it as if it was encoded in UTF-8.

It's mostly the Linux kernel that disagrees. Everything else considers them UTF-8.

arendtio · on April 14, 2020

Doesn't it depend on your locales?

At least for source-based Linux distributions (Gentoo, Exherbo) I remember that you have to define the locales you want to use and which ones should be the default. And when I build a system without UTF-8 locales, I doubt that the shell will treat paths as UTF-8.

ordu · on April 15, 2020

Shell is like most of the programs doesn't need to bother about encoding of filenames. Mostly doesn't. I could use LANG=C and then TAB autocompletes filenames even cyrillic ones, because bash wouldn't mind encodings: terminal uses utf8, so it could output utf8 without any help from bash. Though it nevertheless pain to work with this sometimes, because readline fails to count visible characters (counting bytes instead). You type chars into command line, fill it to the end, then cursor jumps to the left side of the terminal and continues, placing characters over other characters. It is like \r used instead of \n.

`LANG=C ls` tries to be smarter and uses escape-syntax for everything except printable ASCII characters. But other utilities from coreutils work even with a locale that doesn't match file name encoding. cp, mv, grep, ...

The point is: it doesn't matter what encoding strings use until you tried to render string on a screen.

Spivak · on April 14, 2020

Which is a silly position since the kernel is the only thing that matters. You're right that not too many people will complain if your program crashes on non-UTF-8 paths. Same with spaces in group names. 100% valid and accepted. Breaks a ridiculous amount of software if you actually do it.

But that doesn't mean it's right. It just means that we have a calcified convention.

user5994461 · on April 14, 2020

What's right is that the software will be basically unusable outside of the US if it's crashing on non-ASCII characters in paths.

ñ é ß characters appear in every text in Spanish, French, German, etc and they go in filenames too.

Imagine if a table was called a tâble in English? Surely it would be outrageous to have software crash when you try to use the word tâble.

marcosdumay · on April 14, 2020

> narrow strings are considered UTF-8 by default almost everywhere

It means that this is mostly true.

I dunno what it should be. There are benefits and costs on both allowing and restricting the names. As well as there are good reason for the kernel alone to support them even tough all the userland doesn't. But it does mean that you just use UTF-8 and it's done.

rurban · on April 16, 2020

Exactly. And they still refuse to acknowledge that treating public names, like a file path, as binary only is a wellknown security issue. Names are identifiers and must be recognizable.

With utf8 it is trivial to create similar looking names and fool the user to think it is a valid name. You know this concept from domain names, using punicode as escape mechanism. But both the kernel and the various libc's are too lazy to treat confusables with escapes, to normalize unicode or to use proper unicode security mechanisms for identifiers. Like mixing scripts, right to left and such.

Eg searching a file path needs to follow unicode rules, as we are dealing with identifiers. I believe my libc, the safeclib, is the only one even offering such functionality.

Likewise the presentation layer on the UI (shell, windows) doesn't present confusables as such, but happily takes i18n seriously. Convenience first, security last.

Apple's previous HFS+ normalized names, the new one is insecure again.

DannyB2 · on April 14, 2020

> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.

Imagine if languages allowed subtypes of strings which are not directly assignment compatible.

HtmlString

SqlString

String

A String could be converted to HtmlString not by assignment, but through a function call, which escapes characters that the browser would recognize as markup.

Similarly a String would be converted to a SqlString via a function.

It would be difficult to accidentally mix up strings because they would be assignment incompatible without the functions that translate them.

There could be mixed "languages" within a string. Like a JSP or PHP that might contain scripting snippets, and also JavaScript and CSS snippets, each with different syntax rules and escaping conventions.

bruckie · on April 14, 2020

Some security-sensitive libraries do this, e.g. https://www.javadoc.io/doc/com.google.common.html.types/type...

mhh__ · on April 14, 2020

Allowed you to? You could do that in C++ quite happily, it's just not useful enough. To bother implementing, at least.

eska · on April 14, 2020

It's absolutely useful enough, it's just that it's awful in C++ due to language limitations as opposed to other languages such as Haskell, where it is standard.

gpderetta · on April 14, 2020

How would be awful in c++? It seems trivial to do, basic_string is already templated and distinct instantiations are not mutually compatible by default. In fact wstring, u8string, u16string, u32string exist today in the language simply as distinct instatiantions of basic_string. You can crate your own by picking a new char type. Algorithms can be and are, generic and work on any string type.

akiselev · on April 14, 2020

They're not worth the effort in C++ because it doesn't have strictly enforced affine/dependent types. The GP is invisioning a language that does.

ori_b · on April 14, 2020

Why do you need them to enforce that only escaped strings are passed to functions?

     html::append(html::string text);

with an constructor

     html::string(std::string)

that handled escaping seems like it'd work just fine.

jkoudys · on April 15, 2020

Not quite at that level, but rust does have OsStrings (managed the same way as the OS, often but not always utf8), and CStrings (basically just byte buffers - just like c likes). There are special rules around inclusion of nulls and null terminators. It'll give the benefits of the behaviour you mentioned - not allowing an invalid string type for a function call.

The sqlx crate for rust also has a macro called query!, which (at compile time) validates the SQL and created a value of type "record". Similar idea there, since you'll get early exceptions thrown by the compiler if you write sql with errors in it.

jdc · on April 14, 2020

Cf. newtype in Python and Haskell.

yencabulator · on April 17, 2020

Go is like that. Not the "mixed within" part, though html/template's AST understands the context where you're using a value and escapes it differently. For example, https://golang.org/pkg/html/template/#HTML

torstenvl · on April 14, 2020

Failing that, you could also adopt a naming convention with prefixes to indicate what sort of thing it is you're storing there:

hsCode = hsFromUs(usInputBuffer);

ssStoredCode = ssFromHs(hsCode);

https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...

DannyB2 · on April 14, 2020

Yes. But having the compiler enforce it is your first line of defense. If it doesn't compile, you know there is an actual problem. In modern IDEs, you see these compile errors as quickly as you type them.

gnarbarian · on April 14, 2020

You would probably like Java 1.4

masklinn · on April 14, 2020

This pattern (newtyping) is a huge weakness of Java in general, and even more so older Java, and people who like newtyping are not going to like java.

Because creating newtypes in Java is

1. verbose, defining a trivial wrapper takes half a dozen lines before you've even done anything

2. slow, because you're paying for the overhead of an extra allocation and pointer indirection every time, unless you jump through unreadable hoops making for even more verbose newtypes[0]

It is a much more convenient (and thus frequent) pattern in languages like Haskell. Or Rust.

[0] https://gist.github.com/jbgi/d6b677d084fafc641fe01f7ffd00591...

DannyB2 · on April 14, 2020

I use Java 14 now. Java 11 in production.

mika9090 · on April 14, 2020

Try Pascal (free pascal or Delphi)

DannyB2 · on April 14, 2020

I used Pascal for the 80's and part of the 90's. Currently use Java. I almost tried Delphi, but my shop moved on to something else between Pascal and Java.

robocat · on April 14, 2020

AFAIK they just provide type name aliases, which do not enforce or warn of you if you mix the “types”.

benibela · on April 15, 2020

They have changed it

Now the string types have an encoding and the string themselves, too. When you assign a string to a string variable with a type of a different encoding, the string is automatically converted.

But it is causing a huge mess. Especially with existing code. When you have a library using utf-8 and one library using the default codepage, that is not valid anymore. Although you can manually override the encoding for each string, so any string might have any encoding regardless of its type.

benibela · on April 16, 2020

Here is an example of the mess:

I have a benchmark of various maps in freepascal. The benchmark creates strings of random bytes to use as keys.

A classic key-value store is the sorted TStringList.

Now the benchmark of the TStringList fails. Apparently, because it now assumes the keys are valid utf-8 when using the utf-8 codepage as default codepage.

The default codepage can be changed. When I start the benchmark with LANG=C .. it works with the random byte keys. On Windows, the default codepage is usually latin1, so it would work there, too.

oconnor663 · on April 14, 2020

> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8

Yes but, most programs expect to be able to print filepaths at least under some circumstances, like printing error messages. Even if a program is fully correct and doesn't assume an encoding in normal operation, it still has to assume one for printing. Filepaths that aren't utf-8 lead to a bunch of �� in your output (at best). So I think it's fair to say that Unix paths are assumed to be utf-8 by almost all programs, even if being invalid utf-8 doesn't actually cause a correct program to crash.

eska · on April 14, 2020

In the Rust std one can easily use the lossless presentation with file APIs, and print a lossy version in error messages. I find this to be good enough.

Spivak · on April 14, 2020

I mean it doesn't have to assume an encoding for printing, it just has to have a sane way of turning the path into something human readable.

Look you're right that this ship has sailed but ideally we would have decided on a way to display and encode binary for file paths.

oconnor663 · on April 14, 2020

I dunno. That sounds like proposing to render "foo.txt" as "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or something. I think you probably meant something like "print the regular characters if the string is UTF-8, or a lossless fallback representation of the bytes otherwise." That's a good idea, and I think a lot of programs do that, but at the same time "if the string is UTF-8" is problematic. There's no reliable way for us to know what strings are or are not intended to be decoded as UTF-8, because non-UTF-8 encodings can coincidentally produce valid UTF-8 bytes. For example, the two characters "&!" are the same bytes in UTF-8 as the character "Ω" is in UTF-16. This works in Python:

    assert "&!".encode("UTF-8").decode("UTF-16le") == "Ω"

So I think I want to claim something a bit stronger:

1) Users demand, quite rightly, to be able to read paths as text. 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

Maybe in an alternate reality, the system locale could've been the reliable source of truth for string encodings? But of course if we were starting from scratch today, we'd just mandate UTF-8 and be done with it :)

zajio1am · on April 15, 2020

> 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.

No, there is locale settings (in envvars) and software should assume path encoding based on locale encoding.

It is true that today locale setting is usualy utf-8 based, but if i use non-utf-8 based locale then tools should not assume paths are in utf-8 and recode in.

rurban · on April 16, 2020

No, the proposal is not for crazy encoding schemes, like for domain names, that's up to the presentation layer. The need is to follow the unicode security guidelines for identifiers. A path is an identifier, not binary chunk. Thus it needs to follow some rules. Lately some filesystem drivers agreed, but it's still totally insecure all over.

masklinn · on April 14, 2020

> Unix paths don’t need to be valid

unless they do.

OSX will most likely barf at or mangle invalid file names (HFS+ requires well-formed UTF-16, which translates to well-formed UTF-8 at the POSIX layer), and there are ZFS systems which are configured with utf8only set.

It would be more precise to say that you can't assume UNIX paths are anything other than garbage.

cryptonector · on April 14, 2020

Yes, but the only way to interop multiple scripts on a POSIX filesystem is to use UTF-8. I can forgive people for not realizing that filenames in POSIX are a weird animal: they are NUL-terminated strings of characters (char) in some arbitrary codeset and encoding, but US-ASCII '/' is special.

EDIT: Also, "considered UTF-8 by default almost everywhere" is... not necessarily wrong -- nowadays users should be using UTF-8 locales by default. Maybe "almost everywhere" is an exaggeration, but I wouldn't really know.

skissane · on April 15, 2020

> Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid

How about a new mount option utf8_only? When that is set on a volume, the VFS would block any attempt to create a new file/directory if the name isn't valid UTF-8. (Pre-existing file/directories with invalid UTF-8 can still be accessed.) Distributions could set it by default on all filesystems, but a user could turn it off if it caused a problem for them (which in practice is probably going to be rare.)

One could also have a flag set on the filesystem (e.g. in the superblock) similar to utf8_only. It could only be set at filesystem creation time. If it is set, then any invalid UTF-8 in a filename is a filesystem corruption which fsck could repair. A filesystem with such a flag set would ban invalid UTF-8 irrespective of any utf8_only mount option.

If we are going to ban invalid UTF-8, it would be a good idea for security reasons to ban C0 controls as well (i.e. all characters in range U+0001 to U+001F), see [1]. This could be included in the utf8_only mount option / filesystem flag, or be an independent mount option / filesystem flag. If going with the same flag for both, maybe "sane_filenames_only" might be a better name.

(Actually, for security, one should ban the UTF-8 encodings of the C1 controls as well... the CSI character U+009B might be interpreted as an ESC[ by some applications, which could have nefarious consequences. Likewise, the APC (application program command) and OSC (operating system command) characters could cause security issues, although in practice support for them is rather limited, which limits the scope of the security issues they pose.)

[1] https://www.austingroupbugs.net/view.php?id=251

gumby · on April 14, 2020

> Unix paths don’t need to be valid UTF-8

And a lucky thing too; OSes that do have UTF-8 filesystems don’t always agree on how to apply canonicalization, much less how to deal with canonicalization differences between user entered data and normalized filesystem names.

lisper · on April 14, 2020

> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.

Common Lisp too.

tester89 · on April 14, 2020

I’ve never actually understood how pathnames work in CL actually.

gumby · on April 14, 2020

They are pretty straightforward: they are just path structures rather than path names that may turn into single strings when supplied to your kernel. Or, depending on the OS maybe only part of the name is turned into a string and part determines which device or syntax applies. All of which is abstracted away by the path objects.

Back in the 1970s when thins first appeared on lisp machines is was not uncommon to use remote file systems transparently, and those remote file systems could be on quite different OSes like ITS, TOPS10 or -20, VMS, one of the lisp machine file systems and even Unix (though Networking came quite late to Unix). “MC:GUMBY; FOO >” and “OZ:<GUMBY>FOO.TXT;0” were perfectly reasonable filenames. Some of those systems had file versioning built into them. So if the world likes like Unix to you some of that additional expressive power could be confusing.

C++17 path support is a neutered version of Common Lisp’s.

tester89 · on April 14, 2020

Ohh, so they’re kinda like date objects?

lisper · on April 15, 2020

Exactly. A pathname in CL is a data structure. Some of its fields are strings, but others (like version) are not.

lisper · on April 14, 2020

That makes two of us. But they aren't strings :-)

(Seriously though, is it pathnames you don't understand or logical hosts? Because CL pathnames are actually pretty straightforward. Logical hosts, on the other hand, are a hot mess.)

tester89 · on April 14, 2020

I don’t really understand how the #P”” “strings” aren’t différent from strings.

lisper · on April 15, 2020

They are different types.

  ? (type-of "this is a string")
  (SIMPLE-BASE-STRING 16)
  ? (type-of #P"/this/is/a/pathname")
  PATHNAME

You can't perform string operations on a pathname.

  ? (subseq "This is a string" 5 15)
  "is a strin"
  
  ? (subseq #P"/This/is/a/pathname" 5 15)
  > Error: The value #P"/This/is/a/pathname" is not of the expected type SEQUENCE.

You can perform pathname operations on a string, but only because the string is automatically converted into a pathname first.

asiachick · on April 15, 2020

Maybe Linux (and other OSes) should deprecate non utf8 filenames and start disallowing creating filenames that aren't valid utf-8?

It seems silly that directory entries are just binary blobs and yet 99.99% of all software I know of passes around paths as strings. We could ask all software to stop that (boil the ocean) or we could just ask the OSes to stop it (many less OSes than all the other software)

stubish · on April 15, 2020

Read that quote again: 'considered UTF-8 by default almost everywhere'. It is absolutely the truth. While you can stuff non-UTF8 in, almost all of your tools will handle it badly. Even Rust programs wanting to log the file name. It is the same as considering email addresses case sensitive; technically correct, practically shooting yourself in the food.

naniwaduni · on April 14, 2020

> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.

Rust is one of a few programming languages that incorrectly treat strings as if it were a coherent concept distinct from byte buffers.

Among those, it has the distinction of not forcing file paths into this inherently incorrect model.

(In practice, if you have a type system that can distinguish arbitrary byte buffers from ones with a known encoding, that is far from the most useful thing to distinguish about them anyway.)

sitzkrieg · on April 14, 2020

git will also do this, so on a fs that allowa arbitrarily byte named files, you end up with tree objects of same name which makes digging them out later "fun"

benibela · on April 14, 2020

I have a repository full of such files: https://github.com/benibela/nasty-files

You can clone the repository, and then you cannot delete it with tools that expect utf-8 names (like KDE's Dolphin)

sitzkrieg · on April 15, 2020

haha pretty handy!

ngrnjp · on April 14, 2020

That's a fundamental flaw of UNIX.

msla · on April 14, 2020

It's a reflection of the fact people aren't going to throw out existing filesystems because they aren't in a specific character encoding. There's nothing the OS can do about that, there's nothing programmers in general can do about that, and the only way to fix it is with a time machine and enough persuasion to force everyone to implement Unicode and UTF-8 to the exclusion of any other character encoding schemes.

naniwaduni · on April 15, 2020

And it would still be wrong, because the rules of what constitutes valid unicode have changed (what's a surrogate?), and also why would that be a good idea to bake into your filesystem??

imtringued · on April 15, 2020

It would be a very good idea to acknowledge the existence of codecs by storing the identifier of the chosen codec but forcing a specific one doesn't appear to be that useful.

downerending · on April 14, 2020

As flaws go, it's pretty awesome. Wish we had more such.

ken · on April 14, 2020

> one of the few programming languages that correctly doesn’t treat file paths as strings

I hear: one of those few programming languages that, despite its vaunted type-safety, makes it possible to accidentally create a file with a completely bogus name that I won't be able to view or open correctly with half the programs on my computer.

Languages which allow arbitrary byte sequences in paths are the cause of, and solution to, all of Unix's pathname problems.

lilyball · on April 14, 2020

So what you're saying is the language should not be able to work with pre-existing files whose names are not valid UTF-8?

orf · on April 14, 2020

No, it’s impossible to do that accidentally. Due to its type safety. You have to be pretty explicit about passing a non-string in (all rust strings are valid utf8).

tialaramex · on April 14, 2020

However, sometimes you're in a layer when ASCII was fine and you should just be explicit about that.

Server Name Indication (in RFC 3546) is flawed in several ways, it's a classic unused extension point for example because it has an entire field for what type of server name you mean, with only a single value for that field ever defined. But one that stands out is it uses UTF-8 encoding rather than insisting on ASCII for the server name.

You can see the reasoning - international domain names are a big deal, we should embrace Unicode. But IDNA already needed to handle all this work, the DNS A-labels are already ASCII even for IDNs.

Essentially choosing UTF-8 here only made things needlessly more complicated in a critical security component. Users, the people who IDNs were for, don't know what SNI is, and don't care how it's encoded.

apitman · on April 14, 2020

Trying to figure out how to express this without making people mad at me. I think the conflation of Unicode with "plain text" might be a mistake. Don't get me wrong, Unicode serves an important purpose. But bumping the version from plain text 1.0 (ASCII) to plain text 2.0 (Unicode) introduced a ton of complexity, and there are cases where the abstractions start leaking (iterating characters etc).

With things like data archival, if I have a hard drive with the library of congress stored in ASCII, I need half a sheet of paper to understand how to decode it.

Whereas apparently UTF8 requires 7k words just to explain why it's important. And that's not even looking at the spec.

Just to be crystal clear, I'm not advocating to not use Unicode, or even use it less. I'm just saying I think it maybe shouldn't count as plain text, since it looks a lot like a relatively complicated binary format to me.

pjscott · on April 14, 2020

Unicode is complicated because the languages it needs to handle are, alas, complicated. UTF-8 is super simple. It's a variable-length encoding for 21-bit unsigned integers. Wikipedia gives a handy table showing how it works:

https://en.wikipedia.org/wiki/UTF-8#Description

ftvy · on April 14, 2020

When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up.

I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation.

For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).

TheCoelacanth · on April 14, 2020

The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.

account42 · on April 15, 2020

And scanning backwards works too.

carapace · on April 14, 2020

Yeah, this. I have a pat "Unicode Rant" that boils down to this essentially.

Having a catalog of standard numbers-to-glyphs (or symbols or whatever, little pictures humans use to communicate with) is awesome and useful (and all ASCII ever was) but trying to digitalize all of human language is much much more challenging.

tialaramex · on April 14, 2020

But human language doesn't stop being "much much more challenging" if you decide not to engage.

Sometimes (and this can even be an admirable choice) in some specialist applications it's acceptable to decide you won't embrace the complexity of human language. But in a lot of places where that's fine we already did this with the decimal digits such as in telephone numbers, or UPC/EAN product codes, so we don't need ASCII.

In most other places insisting upon ASCII is just an annoying limitation, it's annoying not being able to write your sister's name in the name of the JPEG file, regardless of whether her name is 林鳳嬌 or Jenny Smith, and it jumps out at you if the product you're using is OK with Jenny Smith but not 林鳳嬌.

You might think well, OK, but there weren't problems in ASCII. The complexity is Unicode's fault. Think about Sarah O'Connor? That apostrophe will often break people's software without any help from Unicode.

apitman · on April 15, 2020

Your sister's name doesn't render in my browser (stable Firefox on Linux 5.6). I'm sure I'm missing a fontpack or something. Again, I'm not saying ASCII is the solution, I'm saying Unicode is much more difficult to get right, and maybe we should call it something other than "plain text", since we already had a generally accepted meaning for that for many years. I'm usually in favor of making a new name for a thing rather than overloading an old name.

tialaramex · on April 15, 2020

Firefox does full font fallback. So this means your system just isn't capable of rendering her name (which yes you might be able to fix if you wanted to by installing font packages). If you don't understand Han characters that's an acceptable situation, the dotted boxes (which I assume rendered instead) alert you that there is something here you can't display properly but if you know you can't understand it even if it's displayed there's no need to bother.

It really is just plain text. Human writing systems were always this hard, and "for many years" what you had were separate independent understandings of what "plain text" means in different environments, which makes interoperability impossible. Unicode is mostly about having only one "plain text" rather than dozens.

It is not mandatory that your 80x25 terminal learn how to display Linear B, you can't read Linear B and you probably have no desire to learn how and no interest in any text written in it. But Unicode means your computer agrees with everybody else's computer that it's Linear B, and not a bunch of symbols for drawing Space Invaders, or the manufacturer's logo, if you fix a typo in a document I wrote that has some Linear B in it, your computer doesn't replace the Linear B with question marks, or erase the document, since it knows what that is even if you can't read it and it doesn't know how to display it.

carapace · on April 15, 2020

But I'm not saying we shouldn't engage, I'm just pointing out that the catalog of lil pictures is the easy part of the task.

One way I put it is, imagine if one of the first-class outputs of the Unicode Consortium was standard libraries for different human languages for different computer languages.

perilunar · on April 15, 2020

> and all ASCII ever was

Except that's not true. The ASCII control codes were never glyphs, but were used to control the hardware.

carapace · on April 15, 2020

Sorry, what I mean was ASCII wasn't an encoding of the English language, just an encoding of the English alphabet and some other symbols.

You're quite right that, uh, meta-linguistic symbols are also in there and that does kind of complicate my argument.

nlitened · on April 14, 2020

As a person who comes from a country with non-ASCII alphabet, I strongly disagree. Since UTF-8 became de-facto standard everywhere, so many headaches went away.

hechang1997 · on April 14, 2020

That complexity comes from the fact that you are using non ASCII characters. UTF8 is a superset of standard ASCII. If you are using only standard ASCII characters, they're exactly the same thing.

TheCoelacanth · on April 14, 2020

You only need one sentence to explain why ASCII isn't sufficient: There are languages other than English.

msla · on April 14, 2020

And you're naïve if you think ASCII suffices for English. I wouldn't give you ½¢ for an OS incapable of handling Unicode and UTF-8 even if you told me every language other than English were mysteriously destroyed. Going back to ASCII is 180° from what would enrich English-language text.

ignoramous · on April 14, 2020

> You only need one sentence to explain why ASCII isn't sufficient

Nitpick: ASCII is sufficient when you consider that Base64, despite its 33% overhead from representing 6 bits with 8 bits, makes life easier for certain classes of software.

TheCoelacanth · on April 14, 2020

Base64 is an encoding for representing bytes[0] in ASCII.

That doesn't help you represent text unless you already have an encoding for representing text in bytes (e.g. UTF8).

[0] Octets if you want to be pedantic

ignoramous · on April 14, 2020

What I was alluding to is, I often convert any binary data, including text, to Base64 to avoid dealing with cross platform, cross language, cross format, cross storage, cross network data-handling. Only the layer that needs to deal with the blob's actual string representation needs to worry about encoding schemes that are outside the purview of the humble ASCII table.

naniwaduni · on April 15, 2020

Base64 encodes sextets. The mapping from octets to sextets is mostly settled for set of three octets at a time, but the situation for lengths not divisible by 6 is a mess.

dtech · on April 14, 2020

You still need an encoding to represent non-ASCII characters like ë or 木. Base64 is no help at all there

jaseemabid · on April 14, 2020

ASCII is English and limiting access to knowledge for the rest of humanity for a simpler encoding is just not an acceptable option. Someone needs to interpret those 7k words and write a (complicated?) program once so that billions can read in their own language? Sounds like an easy win to me.

droopyEyelids · on April 14, 2020

counterpoint:

A complicated program is never an easy win, and English is already spoken in every country in the world.

WorldMaker · on April 14, 2020

Sure spoken, but both Arabic and CJK ideograms are written in far more countries in the world, with far more people, and for far longer in history than the ASCII set. The oldest surviving great works of Mathematics were written in Arabic and some of the oldest surviving great works of Poetry where written in Chinese, as just two easy and obvious examples of things worth preserving in "plain text".

crazygringo · on April 14, 2020

So your argument is... it's easier to teach billions of people fluent English... than for software to support UTF-8?

You are aware that a majority of the world's population speaks no English whatsoever?

tachyonbeam · on April 14, 2020

Playing the devil's advocate here. I am not a native English speaker, I'm a French speaker, but I'm happy that English is kind of the default international language. It's a relatively simple language. I actually make less grammar mistakes in English than I do in my native language. I suppose it's probably not a politically correct thing to say, the English are the colonists, the invaders, the oppressors, but eh, maybe it's also kind of a nice thing for world peace, if there is one relatively simple language that's accessible to everyone?

Go ahead and make nice libraries that support Unicode effectively, but I think it's fair game, for a small software development shop (or a one-person programming project), to support ASCII only for some basic software projects. Things are of course different when you're talking about governments providing essential services, etc.

Symbiote · on April 14, 2020

English isn't even ASCII anyway.

Some loanwords like façade or café retain their accents.

Units like ° £ € and symbols like © ® × ÷ ½ aren't ASCII.

It doesn't take much to need one of these cases in a project.

_vbdg · on April 15, 2020

I know almost no one who actually types the accented e, let alone the c with the cedilla. I scarcely ever see the degree symbol typed. Rather, I see facade, cafe, and "degrees".

That aside, the big problem with unicode is not those characters; they're a simple two-byte extension. They obey the simple bijective mapping of binary character <-> character on screen. Unicode doesn't. You have to deal with multiple code points representing one on-screen grapheme, which in turn may or may not translate into a single on-screen glyph. Also bi-directional text, or even vertical text (see the recent post about Mongolian script). Unicode is still probably one of the better solutions possible, but there's a reason you don't see it everywhere: it means not just updating to wide chars but having to deal with a text shaper, re-do your interfaces, and tons of other messy stuff. It's very easy for most people to look at that and ask why they'd bother if only a tiny percentage of users use, say, vertical text.

Symbiote · on April 15, 2020

The first point is just because of the keys on a keyboard.

I see many uses of "pounds" or "GBP" on HN. Anyone with the symbol on the keyboard (British and Irish obviously, plus several other European countries) types £. When people use a phone keyboard, and a long-press or symbol view shows $, £ and €, they can choose £.

Danish people use ½ and § (and £). These keys are labelled on the standard Danish Windows keyboard.

There's plenty of scope for implementing enough Unicode to support most Latin-like languages without going as far as supporting vertical or RTL text.

imtringued · on April 15, 2020

For some reason people seem to think that the only options are UTF-8 and ASCII. That choice never existed. There are thousands upon thousands of character encodings in use. Before Unicode every single writing system had its own character encoding that is incompatible with everything else.

imtringued · on April 15, 2020

You didn't say spoken by every person. Merely spoken in every country. Even the existence of tourists in a country would pass this incredibly low bar...

dtech · on April 14, 2020

Of course ASCII is simpler than Unicode, it handles only 127 characters. If you restrict yourself to those characters ASCII is binary equivalent to UTF-8.

So yeah, maybe you shouldn't use characters 128+ for data archival, I doubt that's a good idea, but that's irrelevant to whether UTF-8 is plain text or not.

tachyonbeam · on April 14, 2020

I think that sometimes it makes sense to enforce strict limitations early on (eg: overly strict input validation). You can then remove such limitations in later versions of your software, after careful consideration and after inserting the necessary tests. The reverse usually doesn't work. If you didn't have those limitations early on, and your database is full of strings with characters that should never have been allowed in there, you will have a hard time cleaning up the mess.

This seems especially true to me in the design of programming languages. If you have useless, badly thought out features in your programming language, people will begin to rely on them, and you will never be able to get rid of them... So start with a small language, and make it strict. Grow it gradually.

cryptonector · on April 14, 2020

There are tens of thousands of characters in all the human scripts. If you're a librarian, scholar, researcher -- why would you not want to be able to use them seamlessly??

droopyEyelids · on April 14, 2020

If there was a complicated tool that claimed it could do the job of every tool in history, or a simple tool that was focused to cover 99% of the work you do-- and we lived on planet earth-- which would you choose?

crazygringo · on April 14, 2020

Umm... but ASCII doesn't work for 99% of people's work.

A majority of the world's population have writing systems that ASCII doesn't encode.

So not really sure what you're suggesting here.

tingletech · on April 14, 2020

LET'S GO BACK TO 6-BITS

shpx · on April 14, 2020

What I never see mentioned about Unicode is Han Unification

https://en.m.wikipedia.org/wiki/Han_unification

As I understand it, it's impossible to have a txt file that uses Japanese and Chinese characters at the same time. The file will either use the Chinese or Japanese forms of the characters, depending on your font. I would think this is a big gotcha people must run into all the time, but I never hear anyone talk about it.

klodolph · on April 14, 2020

I’m not going to try and minimize the problem, here. Han unification was pushed through by western interests, by my understanding.

However, most Unicode characters are identical or nearly identical in Chinese and Japanese. Characters with “significant” visual differences got encoded as different Unicode characters. The same thing applies to simplified and traditional Chinese characters.

So for a given “Han character”, there might be between one and three different Unicode characters, and there might be between one and three different ways of writing it.

Here’s an illustration: https://japanese.stackexchange.com/questions/64590/why-are-j...

So the issue does come up when mixing Chinese and Japanese text, but it’s not really one that has a big impact on legibility of the text but you would definitely be concerned if you were writing a Japanese textbook for Chinese students, or vice versa.

Beyond that, it is usually fairly trivial to distinguish between Japanese and Chinese text, so you could just lean on simple heuristics to get the work done (Japanese text, with the exception of fairly ancient text or very short fragments, contains kana, but Chinese does not).

cygx · on April 14, 2020

Han unification was pushed through by western interests, by my understanding.

Note that as far as I'm aware, the interest in question was the initial 16-bit limit of the character set and later on the non-proliferation of competing standards.

Also note that while Han unification is the most prominent example, there are technically similar cases, which just aren't as charged culturally. For one, Unicode doesn't encode German Fraktur: While some characters are available due to their use in mathematics, it's lacking the corresponding variants of ä, ö, ü, ß, ſ as well as specific ligatures. So if you want to intermix modern with old German writing, you'll also have to go out-of-band.

naniwaduni · on April 15, 2020

Let's not excuse the utter irresponsibility of deciding on 16 bits: the initial 16-bit limit of the character set is instantly invalidated by looking at any comprehensive Chinese character dictionary, no reasonable choice of which will give you an estimate of under about 30k characters, even excluding graphical variants.

Even assuming that we discount 80k+ estimates by collapsing graphical variants, that's over half of your code space right off the bat. For this to seem like a seem like a good idea, you'd need to assume that Chinese is a uniquely bad one-off case. Not a good bet to stake your character set on.

anoncake · on April 14, 2020

That's not the same thing. Fraktur is just a style of fonts, antiqua and fraktur letters are semantically the same.

tialaramex · on April 14, 2020

It's actually exactly the same thing. The Han Unification didn't smash together unrelated squiggles that just happened to look similar, they were semantically the same - scholars of the Han writing system spent a bunch of time deciding what is or is not the same squiggle just drawn differently, like Fraktur, and today people are annoyed because, as you'd expect some of them believed that "style of fonts" was integral to the meaning anyway.

anoncake · on April 15, 2020

Chinese characters represent the Chinese words or parts thereof, Japanese ones represent Japanese words and parts thereof. That is a semantic difference.

tialaramex · on April 15, 2020

So what you're saying is that because 'chat' in English and 'chat' in French are quite different words with very different meanings, you believe there should be a separate letter 'c' for English and French to enable us to tell those words apart?

anoncake · on April 15, 2020

The Latin alphabet is not logographic.

zajio1am · on April 15, 2020

It is not logographic, but characters still have meaning - associtated phonemes. Although this is less clear in English, it is emphasized in other languages.

And this mapping is different between languages. So 'c' in English has different meaning to 'c' in Czech.

anoncake · on April 17, 2020

Not really. Morphemes are considered (defined even) as the smallest unit that has meaning by itself.

cygx · on April 14, 2020

There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally? There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode. What if I wanted to preserve that distinction in historic writing?

edit:

[1] I think the mandatory ones are actually there (just not in Fraktur), it's some optional ones like ſch that are missing.

anoncake · on April 15, 2020

> There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally?

No, since "now" is an English word, not a Japanese or Chinese one.

> There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode.

Unicode doesn't encode ligatures except for backwards compatibility.

cygx · on April 15, 2020

Unicode doesn't encode ligatures except for backwards compatibility.

And it doesn't encode separate variants for unified Han characters. As in, that's not an argument, just a description of the status quo.

anoncake · on April 15, 2020

Of course it is. Ligatures aren't characters, they're glyphs that represent multiple characters. Unicode does not encode glyphs, that's simply not its job. No more than encoding what font to use or when to render text in italic.

cygx · on April 15, 2020

Which is the whole point of Han unification, the argument being that whether or not a particular line in U+4ECA is horizontal or diagonal is just like that. What's the difference?

anoncake · on April 17, 2020

To the contrary: What any line in any glyph looks like is of no concern because Unicode doesn't deal with glyphs. It deals with abstract characters that don't have appearances to begin with.

"Α" and "A" look exactly the same (at least in most fonts). But each has its own code point because the GREEK CAPITAL LETTER ALPHA simply isn't the LATIN CAPITAL LETTER A or any other Latin letter.

cryptonector · on April 14, 2020

As I understand it Han unification happened because at the time all there was was UCS-2 -no UTF-16, no UTF-8- so codespace was tight and precious, and that motivated codespace preserving optimizations, of which Han unification is the notable one.

To avoid that they needed to have invented UTF-8 many years earlier. Perhaps if the people designing UTF-8 were more diverse they might have felt the necessity to invent UTF-8 to the point of actually doing it, but then perhaps they might have done it poorly. At any rate, I don't know enough details to really know if "Han unification was pushed through by western interests" is remotely fair.

macintux · on April 14, 2020

UTF-8 was sketched on a placemat as a response to a different idea. It seems likely that had it not arisen in a moment of inspiration by a genius, we would be stuck with another inferior design by committee.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

cryptonector · on April 15, 2020

I agree. But too, necessity is the mother of invention. GP seems to argue that Han unification happened because the UC was not diverse enough. Maybe, and maybe if it had been diverse enough the need would have arisen sooner. But again, the thing they came up with could have been garbage, who knows!

What I do know is that UTF-8 is genius. The Han unification problems seem mostly minor -- I suspect code can detect language and do the right thing, for example, and again, we could revive language tags if need be.

gpderetta · on April 15, 2020

Utf-8 is a magnificent hack.

glandium · on April 15, 2020

Here's some text I could write about some Japanese characters, that, thanks to Han Unification, may be confusing:

In 1946, the Japanese government created a (non-exhaustive) list of common characters, some of which were simplified from their more traditional form. One of them is 臭. Its older form was 臭. Another character that shares the same root, 嗅, was not part of that list of common characters. It was added later, in 2010, and was never simplified, such that the stroke that was removed in 臭 is still there, making it just slightly different.

If your fonts are biased towards Chinese, 臭 and 臭 will be identical, and you won't know what I'm talking about. The former is 自 above 大, the latter is 自 above 犬.

You could think the difference is trivial, but 大 is big and 犬 is dog. Not that it alters the meaning of 臭, 臭, or 嗅, but when talking about how 嗅 is not 口 alongside 臭 anymore, it does make a difference.

ksec · on April 14, 2020

Yes, the real problem is when you start mixing All Four ( or Five ) of them together Chinese Traditional, Simplified Korean, Japanese things becomes extremely problematic.

I think it is by luck, All four writings has significant usage within their own region, imagine if one of them were significantly smaller and over time were forced ( or by ease of use or what ever reason ) to switch to a different style without knowing it.

cryptonector · on April 14, 2020

First of all, there is no new unification work ongoing. The Unicode Consortium moved on from that by moving on from UCS-2. UCS-2 drove unification as a way to preserve precious codespace.

There used to be language tag codepoints for this, but they've been deprecated. Han unification is an accident of history: a result of UTF-8 not having existed until it was too late!

There's not going to be a different new Unicode for doing away with Han unification, which is why no one mentions it: besides crying about it, what else can one do? Maybe we should revive language tags?

Anyways, isn't the difference between unified Han/Kanji characters mostly stylistic rather than semantic? I'm not denying that many users would get annoyed, but again, what to do about it??

wheybags · on April 14, 2020

It's different enough that users will immediately complain if you get it wrong. And it means that you, as a developer who might not understand either Chinese or Japanese, now has to deal with the fallout by setting a different font in your application depending on which of the two languages it is. This happe ed end for us in factorio, and it was super annoying, because it's really hard to spot the problem before it goes live because you A: don't know the problem exists, how would you? B: have a hard time seeing it even when you do know. The whole poi t of Unicode is to not have to think about this crap or handle it explicitly, and this breaks that guarantee fantastically.

cryptonector · on April 15, 2020

See the whole point about history. All you're doing is crying about it :(

Here's a question: when a native Chinese speaker reads a Japanese text, do they want to see it in Chinese style or Japanese style? If the former, then just know that that's their preference and always use their preference -- easy fix. If the latter... you need to know the language of a text (or sub-text), and that requires either language tags or language recognition.

I expect it's the latter, to make it easier to recognize foreign text, which is not necessarily easy to read. After all, native Chinese, Japanese, and Korean speakers who don't speak the other languages can only glean so much meaning from Han/Kanji text in the others' languages. That's because while often ideographic characters are used for (common) meaning, sometimes they are used for the sounds of the words they identify but not their meanings.

innocenat · on April 14, 2020

It's only stylistic issue if you also consider a and α (alpha) to also be just stylistic different.

I have learned to live with it, but it is very annoying.

cryptonector · on April 15, 2020

Do you want to see all Han texts in your preferred style, or do you want to see them in the style corresponding to the language they were written in?

innocenat · on April 15, 2020

In the original language, of course. Why is that even a question. That is like asking if Geek people would want to read Latin script using Greek alphabet or not.

You keep using the word 'style', so you agree that α is a style of a? Then I have no more comment. It's not 'style' at all.

cryptonector · on April 15, 2020

> Why is that even a question.

Why wouldn't it be. As I'm not a user of CJK scripts it seemed like a fair question to ask.

> You keep using the word 'style', so you agree that α is a style of a?

I didn't have a better word for it. Oddly alpha is getting different renderings in the text input box than in your comment.

So Han unification makes CJK as hard to read for you as mixing Greek and Latin scripts would for me. That's what I wanted to understand. Thanks.

ksec · on April 14, 2020

The same could be said whether è é should be the same as e with different fonts. People who cares about it would complain. To those who only uses English it is only the same e.

microtherion · on April 14, 2020

I don't think that's the same, because e.g. in French, e, é, è, and ê are all used, with different pronunciations.

cryptonector · on April 15, 2020

Not just different pronunciations. ê isn't about pronunciation but about indicating that "there used to be an 's' after this e". French written w/o circumflex accents doesn't change in pronunciation, not really, nor -mostly- in meaning, but it does look very annoying to French speakers as well as to native speakers of other Romance languages: the reason is that that reminder is in fact useful for translation.

I'm guessing Han unification is at least annoying like losing circumflex accents would be.

gsnedders · on April 14, 2020

Relatively few people frequently look at different Han languages, and relatively few people are looking at txt files containing Han characters (and I expect those that do are typically running with their OS locale set to one of the Han languages?).

Enough CJK HTML content is tagged and heuristics are mostly good enough that incorrect font selection isn't a massive issue on the web, and AFAIK most major word processors include metadata in the file that suffices to distinguish language.

Grue3 · on April 16, 2020

It's a massive issue on web, I routinely see Japanese posts on Twitter displayed in a Chinese font when they consist only of kanji.

rakoo · on April 14, 2020

Maybe it's time for MySQL to make "utf8" actually mean UTF-8 then (https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-u...)

treve · on April 14, 2020

> Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...

Can't fault a database vendor being conservative, but looks like this is planned. Maybe this will be a 9.0 thing.

smacktoward · on April 14, 2020

They probably couldn't even if they wanted to, by this point there will be too much software out there depending on "utf8" meaning "MySQL's weird proprietary hacked-up version of UTF-8".

The only real solution is to hammer home the message that "utf8mb4" is what you put into MySQL if you want UTF-8.

cosarara · on April 14, 2020

There are acual problems too, when switching from utf8mb3 to utf8mb4, because of maximum varchar length in indices: https://stackoverflow.com/questions/48500355/mysql-character...

dpc_pw · on April 14, 2020

> For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.

Is "ch" really considered one _character_ in Czech and Slovak? I'm Polish and we do have "ch" and consider it one ... sound... represented by two letters? I mean... if you asked anyone to count letters/characters in a word, they would count "ch" as two. So I wonder if that's different in Slovakia or Chech Republic, or is just my definition of "character" wrong.

masklinn · on April 14, 2020

> So I wonder if that's different in Slovakia or Chech Republic, or is just my definition of "character" wrong.

According to wikipedia, "Ch" is a character of the Czech alphabet in the sense that it impacts alphabetical ordering ("Ch" sorts between H and I), in the same way Ł or Ę are apparently characters from the Polish alphabet distinct from L and E respectively (wikipedia mentions that "być comes after bycie").

That is unlike, say, french where É and E are the same character alphabetically.

[0] https://en.wikipedia.org/wiki/Czech_orthography

mlj45 · on April 14, 2020

This depends on your definition of informal terms like "letter", "character" etc.

The typographic term for combinations like this is "digraph". (Wikipedia's definition: "A digraph [...] is a pair of characters used in the orthography of a language to write either a single phoneme [...] or a sequence of phonemes that does not correspond to the normal values of the two characters combined".)

Whether digraphs have separate keys on a keyboard, are treated as distinct for the purposes of alphabetisation, whether speakers of the language think of them as separate "letters" when spelling out a word and so on, are all separate issues and varies between languages (or, more precisely, between the conventions for writing a certain language).

Svip · on April 14, 2020

A better example would probably be "ij" in Dutch. That's definitely considered a single letter, as words starting with ij in Dutch are capitalised IJ. Though there are glyphs for Ĳ /ĳ already in unicode.

mercer · on April 14, 2020

"Ij" is also one sounds represented bij two letters, and I think capitalizing just the 'I' is pretty standard. As a Dutch person myself, I didn't even know that there's a glyph for it!

We also have "ei", which sounds the same and was invented to annoy people learning Dutch. Then there's "oe", "eu", "ui". And just to fuck even more with people learning the language, we have "au" and "ou" which also sound the same. Oh, and "ch" and "g".

Hans Brinker, the inventor of the Dutch language, famously would toss a florijn to decide between using ei/ij and au/ou, as he was not fond of foreigners. He's mostly known for saving our country though when he plugged a hole in a dyke with his finger (yes, I know what you're thinking, and no, we do not appreciate your dirty minds making light of this heroic act).

unwind · on April 14, 2020

Spelling it "dike" helps keep people's minds on the right thing. :)

samatman · on April 14, 2020

If you spell it "dijk" it's even less racy, because it's no longer a four-letter word.

mercer · on April 14, 2020

Well shit. Guess I'll have to clean out my mind with some soap...

akie · on April 14, 2020

As a Dutch person myself, capitalizing just the I and not the J hurts my eyes. Ijsselmeer or IJsselmeer?

mercer · on April 14, 2020

Interesting. I never really gave it much thought, but Ij actually bothers me so much that I usually try to avoid using it at the beginning of a sentence, and I cringe when I need to capitalize because it's a place (like Ijsselmeer).

Just did some googling. Turns out that unlike the other combinations, capitalizing both letters is mandatory for 'IJ'. TIL...

alexis_fr · on April 14, 2020

Same as OE/Œ in French, then.

bartwe · on April 14, 2020

Nobody has that as a letter on the keyboard here though, so it doesn't matter. Normally typed as a digraph. Would be nice if we just switched over to using y at this point. Makes me wonder, is the use of diacritics reducing since ascii keyboards became the norm ?

kosievdmerwe · on April 14, 2020

Afrikaans did this. We use "y" instead of "ij".

roelschroeven · on April 14, 2020

"ij" is sometimes considered a single letter, but certainly not always. Quoting Wikipedia (https://en.wikipedia.org/wiki/IJ_(digraph)):

"Ĳ (lowercase ĳ; Dutch pronunciation: [ɛi]) is a digraph of the letters i and j. Occurring in the Dutch language, it is sometimes considered a ligature, or a letter in itself. In most fonts that have a separate character for ij, the two composing parts are not connected but are separate glyphs, which are sometimes slightly kerned."

(and equivalent in the Dutch Wikipedia article)

masklinn · on April 14, 2020

I don't know that that's correct. That there exists a ligature character doesn't mean the ligature is a character of the language.

It could, mind, I don't know dutch. But in french "œ" (which has a ligatured character as you can see) is canonically equivalent to "oe". It is not a separate letter of the alphabet even though:

* many words should not be written with the ligatured form

* many words should be written with the ligatured form

* it has a different pronunciation than the base form

andy_wrote · on April 14, 2020

Based on my experience learning Czech (not native at all, just interested):

- it's typically listed as a separate letter when writing out the alphabet

- but in practice it's typed out as "c h" and not as a single character

- it occupies its own place in Czech standard alphabetical order, my English-Czech dictionary has all the "ch" words after "h" (so interestingly in order to do a proper sort programmatically you need to possibly look 2 characters ahead)

zajio1am · on April 15, 2020

A a native Czech speaker, i never really understood what it means that 'ch' is one letter in Czech. It is clearly two graphemes representing one phoneme, so one could think it is a digraph, but it has some special properties like being one element in collating order. I think people just started to call it one letter to have one-letter-one-sound property.

pilsetnieks · on April 14, 2020

At first I though they simply mean the letter "č" but no, it turns out that "ch" (and also "dz") is a digraph with a separate place in Czech and Slovak alphabets.

enedil · on April 14, 2020

Yeah, but in Czech it's "č".

pilsetnieks · on April 14, 2020

No. Č is something else, ch is a digraph that's pronounced differently. Take a look at Czech and Slovak alphabets specifically:

https://en.wikipedia.org/wiki/Czech_orthography

https://en.wikipedia.org/wiki/Slovak_orthography

enedil · on April 14, 2020

I'm Polish and I have just tangential knowledge of Czech language. Sorry for confusion.

GnarfGnarf · on April 14, 2020

I came to the same conclusion years ago. My app is Win32, but I never defined UNICODE or used the TCHAR abomination. All strings are stored as UTF8 until they are passed to Win32 APIs, whereupon they are converted to UCS-2. I explicitly call the wchar version of functions (ex: TextOutW). This strategy enabled me to transition easily and safely from single-byte ASCII (Windows 3.1) to Unicode.

The database is also UTF8.

malkia · on April 14, 2020

Calling the "A", instead of "W" functions might be some small perf hit (don't know if it matters), but for some functionality you need to call the "W" functions, for example to break the limit of 256 or was it 260 characters, up to 32768 (or was it 16384).

:)

jfkebwjsbx · on April 14, 2020

Even Microsoft is finally giving up UTF-16!

They recommend now to use the UTF-8 "code page" in new code.

snazz · on April 14, 2020

Is java.lang.String still UTF-16? Is there any plan to fix that? Once Windows and Java take care of it, I can't think of any other major UTF-16 uses left. Are there any that I've forgotten about?

Edit: Still looks like UTF-16, according to the Oracle documentation page: https://docs.oracle.com/en/java/javase/14/docs/api/java.base... Edit 2: JavaScript too. See my reply to someone else below.

josefx · on April 14, 2020

I don't think they can fix that without completely breaking backwards compatibility. The basic char type in Java is defined as a 16 bit wide unsigned integer value and String doesn't abstract over that.

diroussel · on April 14, 2020

Compact Strings were added in Java 9; https://openjdk.java.net/jeps/254

So they can now be stored as one byte per character.

kllrnohj · on April 14, 2020

Only for ASCII text. There is still no UTF-8 support (it's even called out as a non-goal in the JEP: "It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings.")

projektfu · on April 14, 2020

I don't think it's a big deal for Java because it's always easy to transfer in from and out to UTF-8. Very few Java programs use UTF-16 as a persistence format, and Java-native applications can directly marshal strings around as they are a first-class datatype.

lokedhs · on April 14, 2020

I think it will be hard to change that. But it's not alone. Javascript also uses UTF-16.

snazz · on April 14, 2020

You’re right! I’m surprised I didn’t know that. It looks like it can also be UCS-2, going by the spec:

> A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.

im3w1l · on April 14, 2020

USC-2 is an old version of UTF-16 that lacks support for surrogate pairs, which means that rare symbols and emoji don't work.

rimunroe · on April 14, 2020

JavaScript:

https://www.ecma-international.org/ecma-262/5.1/#sec-2

> A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.

https://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16

> A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

masklinn · on April 14, 2020

> Is java.lang.String still UTF-16?

Yes.

> Is there any plan to fix that?

That's not really possible as strings are defined in terms of char and guarantee O(1) access to UTF16 code units. They might try to switch to "indexed UTF8" (as pypy did in the Python ecosystem whereas "CPython proper" refused to switch to UTF8 with the Python 3 upheaval and went with the death trap that is PEP 393 instead).

nathanaldensr · on April 14, 2020

Do you have a source for this? AFAIK the .NET Framework CLR and CoreCLR both still store strings internally as UTF-16.

ChrisSD · on April 14, 2020

The closest I could find to a recommendation for UTF-8 is in UWP design guidelines: https://docs.microsoft.com/en-us/windows/uwp/design/globaliz...

However it's not quite unequivocal. Windows still uses UTF-16 in the kernel (or actually an array of 16bit integers, but UTF-16 is a very strong convention). The code page will often allow the Win32 API to perform the conversion back and forth instead of your application doing it.

mormegil · on April 14, 2020

AFAICT, it's not only "internal representation". .NET strings are defined as a sequence of UTF-16 units, including the definition of the Char type representing a single UTF-16 code unit. I can't imagine how such a change could be implemented (other than changing the internal representation but converting on all accesses which would be nonsense, I think).

leosarev · on April 14, 2020

Current plan is: https://github.com/dotnet/corefxlab/issues/2350

leosarev · on April 14, 2020

CoreCLR actively discussing introducing Utf8String type. https://github.com/dotnet/corefxlab/issues/2350

jdsully · on April 14, 2020

Are you sure? That will result in a conversion every time a string is passed to the kernel.

Windows can handle utf-8 but it is not the native character set for the platform.

JdeBP · on April 14, 2020

There's a conversion in every ...A() function. Conversion between UTF-8 and WTF-16 is just more of the same, but without codepage lookup tables. (-:

Shebanator · on April 14, 2020

WTF-16? I like it...

ekimekim · on April 14, 2020

WTF-8 and WTF-16 are a thing: https://simonsapin.github.io/wtf-8/

Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).

Then WTF-8 is what you get if you naively transform invalid UTF-16 into UTF-8. It is a superset of UTF-8.

This is very useful when dealing with applications like Java and Javascript that treat strings as sequences of 16-bit code points, even though not all such strings are valid UTF-16.

masklinn · on April 14, 2020

> Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).

If WTF-16 is the ability in potentia to store and return invalid UTF-16 without signalling errors, I don't know that there's any actual UTF-16 system out there to the possible exception of… HFS+ maybe?.

loeg · on April 16, 2020

APFS continues to normalize codepoints as well.

mark-r · on April 14, 2020

They probably still do a codepage lookup just for consistency.

bostonvaulter2 · on April 15, 2020

I hope they update the Language Server Protocol to use UTF-8 then! Nearly every language supports UTF-8 well, which is not so for UTF-16

gpvos · on April 14, 2020

Have they fixed all the bugs with that pseudocodepage?

xeeeeeeeeeeenu · on April 14, 2020

Bugs like WriteFile() reporting the wrong number of bytes written with 65001 codepage were fixed years ago.

buckminster · on April 14, 2020

That's good news. Last time I looked, more than a decade ago admittedly, that bug was WONTFIX.

In fact I was so surprised I just wrote a test program. They have fixed it!

It was the dumbest bug I ever saw in Windows. It was special case code in the console output code path of the user mode part of WriteFile. It only existed to make utf8 work, and it didn't even do that.

gpvos · on April 14, 2020

Ah, that's surprising, Microsoft was very stubbornly not doing that for at least a decade and a half.

In fact, the FAQ in TFA (questions 9 and 20) mentions that there are still problems with CP_UTF8 (65001). Is the article out of date? Can someone respond to those statements?

xeeeeeeeeeeenu · on April 14, 2020

The article is outdated, it's from 2012. Not only they fixed the problems but in Windows 10 1803 they also added an option to globally and permanently set both OEM and ANSI(!) codepages to 65001.

It can be enabled by checking "Beta: Use Unicode UTF-8 for worldwide language support" checkbox in region settings.

projektfu · on April 14, 2020

When I used to do a lot of windows programming in the late 90s, I wish that I had a sensible guide like this for handling strings. TCHAR was always a source of subtle bugs.

I suppose, though, that the underlying problem was that Microsoft was so late to implement a compatibility solution for Windows 9x. Most software of the time ended up implementing on "ANSI" multibyte character set (MBCS) just because otherwise you would need to either deploy 2 executables or do your own thunking. This solution would be a double thunk on 9x because you'd be thunking your UTF-8 to unicode and then thunking that back to MBCS.

xg15 · on April 14, 2020

> When writing a UTF-8 string to a file, it is the length in bytes which is important. Counting any other type of ‘characters’ is, on the other hand, not very helpful.

So, suppose I have a UTF-8 string of n code units (bytes) length. Unfortunately my data structure only permits strings of length m < n bytes.

How do I correctly truncate the string so it doesn't become invalid UTF-8 and won't show any unexpected gibberish when rendered? (E.g., the truncated string doesn't suddenly contain any glyphs or grapheme clusters that weren't in the original string)

seppel · on April 14, 2020

> How do I correctly truncate the string so it doesn't become invalid UTF-8 and won't show any unexpected gibberish when rendered? (E.g., the truncated string doesn't suddenly contain any glyphs or grapheme clusters that weren't in the original string)

Cropping strings is a hard problem for ASCII strings as well. It can even be a security problem if the cropped part contains important information that alters the meaning of the first part (Something like "DELETE FROM table_name [WHERE condition]" or natural language where the cropped part is the condition or the negation).

But even if you dont care about this: If you care about cropping visually nicely, you want some ellipsis at the end, you dont want to crop in the middle of the word (if possible), etc. In the end, you need some nice text processing anyway.

Or you reject strings that are too long.

samatman · on April 14, 2020

Avoiding invalid UTF-8 is easy, almost trivial: just make sure you don't truncate in the middle of a code point.

The latter is fiendishly difficult to get right in all cases, the ugliest case being emoji flags. Being all-or-nothing on both sides of a ZWJ will get you most of the way there, however.

smasher164 · on April 14, 2020

It's not though. Replacing invalid byte sequences is not terribly difficult.

https://golang.org/src/strings/strings.go?s=15854:15900#L627.

samatman · on April 15, 2020

We are agreeing, the part I was indicating is difficult is 'displaying gibberish'.

Knowing what constitutes a grapheme cluster is detailed and frequently changes.