But all mathematical symbols, except for styled math letters (which would have b...

johncolanduoni · on June 23, 2016

> How useful is it to standardize hieroglyphics, ancient greek musical notation, and emojis as standard text characters esp. without standardizing their screen representation?

In the same way it's useful to standardize letters in various alphabets without standardizing their screen representation. There is semantic content associated with each of these symbols that persists even if there is significant variation in how they are presented. Of the ones you list, emojis are the only ones where this is any more a problematic approach than it is for letters in various alphabets. And as people who don't approve of Unicode adding emojis like to point out, emojis aren't that critical so having some loss in the translation isn't a huge deal.

Remember that before emoji standardization various cell phone manufacturers (particularly in Japan if I remember correctly) started using codepoints for whatever they pleased. The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.

> styled math letters (which would have been equally well served by simply rendering them in italics or in a special math font)

That was my first reaction as well, but there are a few problems with that approach:

* Math italic characters look very different from normal italics, and are shaped and kerned very differently because they are commonly used for single-letter variables which will be juxtaposed together in expressions. If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.

* Many of the math letters and "letter-like symbols" have associated semantic content (like bold for vectors), which it makes sense to preserve. MathML alleviates this to a significant degree but I don't believe these codepoints were intended only for MathML usage.

* On the technical side, OpenType math fonts need to carry associated metadata for many of these characters. Putting them in separate fonts complicates this, since these tables need to refer to glyphs (general codepoints are unsuitable in a number of cases) and each font file would have a different glyph address space.

masklinn · on June 23, 2016

> Remember that before emoji standardization various cell phone manufacturers (particularly in Japan if I remember correctly) started using codepoints for whatever they pleased.

Things were way worse than that: to add emoji to text, NTT DoCoMo used private-use codepoints, AU used embedded image tags and Softbank wrapped emoji codes in SI/SO escape sequences.

pron · on June 23, 2016

> In the same way it's useful to standardize letters in various alphabets without standardizing their screen representation.

I disagree. Say the name of a letter in any alphabet, and people will draw it in ways that are similar enough for automatic recognition. This is not true for pictograms and emojis.

> The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.

I disagree. The alternative is a much simpler and faster standardization, of the kind I offered here: https://news.ycombinator.com/item?id=11958903 There is absolutely no need for a fixed codepoint for most of the non-BMP characters.

> If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.

So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.

Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.

johncolanduoni · on June 23, 2016

> I disagree. The alternative is a much simpler and faster standardization, of the kind I offered here: https://news.ycombinator.com/item?id=11958903 There is absolutely no need for a fixed codepoint for most of the non-BMP characters.

So you think instituting a system based on links not rotting would better preserve meaning? Not to mention that:

* Every text renderer that doesn't support your codepoint now displays a full URL, instead of a box, making text using these emojis very difficult to read.

* Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.

The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff. If the choice was "stick with UCS-2 and be totally fine language wise, or add more bits just for emojis and pictograms" I'd probably agree with you. If you think that's what happened, your timeline for this process is way off.

> So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.

Sure, but the different letter types carry crucial meaning in math formulas. "sup" in upright letters is the math operator supremum, "sup" in math italics is s * u * p. This kind of thing applies to every one of the mathematical letter variants.

> Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.

You're zooming way out on this one. Math symbols have a lot more in common with letters and "normal" symbols than "all specialized usage of human-readable data". Remember that when Unicode added the math symbols, things like MathJax were simply impossible. Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.

pron · on June 23, 2016

> Every text renderer that doesn't support your codepoint now displays a full URL, instead of a box, making text using these emojis very difficult to read.

It can still display a box.

> Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.

The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs. Their OS could provide this service for them just as it allows them the use of fonts.

> The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff.

That is a good point, one of which I was not aware, but I still don't think it justifies standardization of Chinese characters, emojis, and ancient Greek musical notation by the same standards body.

> Sure, but the different letter types carry crucial meaning in math formulas.

The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.

> Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.

I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.

johncolanduoni · on June 23, 2016

> It can still display a box.

If it knows about your new codepoint. Everyone using an implementation that doesn't yet support it is going to show the full URL. If history repeats itself, these implementations will be the majority for at least a decade.

> The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs.

Except they need to do literally nothing different. Hieroglyphs are vectorized, and emojis are either bitmapped (which existing font formats already supported) or vectorized. None of this required a single line of code in any text shaping library to change. Text shaping libraries generally don't even need to understand Unicode categories or other metadata: font files already contain all the relevant information (directionality, combining mark, etc.).

> The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.

If you scrub all bold and italics from text, do any of the words turn into different words? That's what happens with sup (or other named operators). Same thing for blackboard letters, fraktur, etc.

> I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.

My point is it was better than nothing, which was the alternative when Unicode added these. By adding mathematical characters that worked exactly the same as all other characters, we could get some of the advantages of MathML without needing everybody to implement a special math renderer or learn any new markup, just download new font files.

If getting everyone to accept MathML or something similar in an expedited fashion was a reasonable proposition, then I might agree that they should've kept it out of Unicode. Those were not the facts on the ground when this decision was made. Note that even now that we have MathML, the few browsers that support it (IIRC Firefox & Safari only) have complete shit implementations that look terrible.

The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort. Unicode's expansion into these domains required nobody to do anything differently, let alone decide to up and write an implementation of a completely different standard. I think this is a case were our alternative was to let the perfect be the enemy of the good, or at least the working.

pron · on June 23, 2016

> That's what happens with sup (or other named operators).

And that's what happens when scrubbing subscripts or binomial coefficients. When you want to represent math as text, you need to change your representation (add multiplication signs, forgo subscripts, use confusing parentheses etc.). This is still true with Unicode. The contribution of the non-BMP special math characters is quite minimal.

> The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort.

Sure, but the emoji craziness continues, and there's little sign it would ever stop. Instead of saying "this isn't text; if you want, call it 'special text', escape it, and let a different body standardize it", the body entrusted with standardizing text representation worries about how to represent a picture of two people kissing as text. What next? Kids would want to add tunes to their text messages. Would the Unicode Consortium add code points for MIDI? And maybe managers would want to standardize code points for organizational diagrams. Would that be the consortium’s responsibility, too? The BMP contains all the characters for reasonable text-art.

johncolanduoni · on June 23, 2016

Unicode has added 1791 emojis[1]. Note that it has used significantly fewer codepoints than this for emojis, because some (e.g. flags) are done via a small number of combining marks (flags use 26 code points for letters in country codes). They've been slowing down the rate of additional emojis since they started doing this. Do you really think they'll accelerate at some point and use up the almost 1 million unassigned code points Unicode has left? Not to mention the fact that many UTF8/UTF16 implementations are already fine with full 32-bit code points, instead of the 21 bits Unicode I'm using above (they have said they'll only ever use 21 bits, but the option is there).

Suffice it to say this getting "out of control" and eating up the remaining space in our lifetimes, or our children's lifetimes would be pretty impressive. This means that the worst we have to fear is more boxes, assuming OSes don't keep up with their fallback fonts. Saying adding more symbols-designed-to-be-just-kind-of-placed-in-line-with-other-symbols-like-they-always-have-been is the first step towards MIDI and organization diagrams is like saying you're vegetarian because it's a slippery slope from eating meat to eating people.

Also note that unlike every other excess of the Unicode standard, these would require massive changes to the code that handles text. This means that if Unicode decided to do this, you wouldn't have to worry about negative effects because nobody would implement it.

[1]: http://unicode.org/emoji/charts/full-emoji-list.html

pron · on June 23, 2016

I'm not afraid of running out of codepoints. I just think it is misguided that non-text is standardized as text. It just doesn't make sense. I don't know how people will communicate 50 years from now, but I think it's funny that even then, text strings would still need to support all those vegetables and hand gestures. Written text is pretty much eternal; emojis and pictograms? I doubt it. It doesn't make sense for them to get a similar treatment.

macintux · on June 23, 2016

On a long-enough time scale, emojis and pictograms is what we're relying on[1] for nuclear waste warnings.

1: http://large.stanford.edu/courses/2011/ph241/dunn2/