How a comment on Hacker News led to 4½ new Unicode characters

pja · on June 23, 2016

I got a change into Unicode 9.0 too!

It was just a tweak to emoji characters to mark them all as East Asian Full Width instead of Narrow or Ambiguous so that they displayed correctly when using a fixed width font in a terminal console. This probably only matters if you like to use emoji filenames (you mad person), but it felt like a wart so I reported it & had a short back and forth with the chair of the emoji-related subcommittee which resulted in a proposal which was eventually accepted by the committee into Unicode 9.0. The committee were great: took my tiny bug report seriously, wrote huge long treatises to justify the change & eventually voted it into the standard.

(This was pretty much my peak geek achievement of 2016 so far :) )

rspeer · on June 23, 2016

Holy crap I appreciate this change! I thought they'd never fix it because of compatibility. Thanks for the effort you put in.

It's not that I use emoji filenames, it's that I deal with real-world natural language text all the time, including at the console.

(In terms of compatibility, my text-justifying function is going to stop working correctly for the period of time between when gnome-terminal updates to Unicode 9 and when Python 3.x does. Still worth it.)

pja · on June 24, 2016

If they get their unicode data from the same, OS supplied source that supplies the wcwidth() function (or be using wcwidth() themselves) a libc update should fix both, I think.

rspeer · on June 24, 2016

They don't. Python's "unicodedata" module updates with the minor version of Python.

This is good, actually, because the meaning of a string operation should be consistent when run on the same version of Python.

(If only this applied to the "default encoding". The default encoding should be UTF-8, not whatever you get by asking the user's likely-misconfigured locale. As it is, you can't rely on the default encoding if you want your code to work consistently.)

micro2588 · on June 23, 2016

Will this eventually solve the "Julia does not like Pizza" issue (https://github.com/JuliaLang/julia/issues/3721)?

KenoFischer · on June 24, 2016

Ideally yes. I'd have to confirm that this extends to Pizza and Koalas. We fixed that issue as much as we could, even going as far as generating our own unicode width tables extracted from unifont, but it wasn't possible to fix in general without support form the terminals. Now that the standard is fixed (hopefully), I don't see a reason why the terminals wouldn't update their tables.

0xmohit · on June 24, 2016

It already has!

This comment -- https://github.com/JuliaLang/julia/issues/3721#issuecomment-...

Sharlin · on June 23, 2016

It also matters if you want to use emojis in irssi or equivalent :)

yuhong · on June 24, 2016

The fun thing is that I don't think the Windows console can handle anything outside BMP.

colejohnson66 · on June 24, 2016

Sadly, yes. But he did say "Terminal", so I assumed he meant OS X (ahem... macOS).

kens · on June 23, 2016

The success of the unicodepowersymbol proposal inspired me to suggest a couple characters to Unicode (the Bitcoin sign and IBM's group mark from 1960s mainframes, which were accepted). The point is that Unicode really is open to proposals from random people; you don't need to part of a big company to influence Unicode.

hypertexthero · on June 23, 2016

Agreed.

Although I had never worked with the Unicode Consortium, I [submitted a proposal][1] for an international symbol for an observer and it was eventually accepted.

[1]: http://hypertexthero.com/logbook/2015/01/international-symbo...

leoc · on June 23, 2016

Ooh. One reason that's nice is because it adds one of the Smalltalk-72 symbols ( http://www.esug.org/data/HistoricalDocuments/Smalltalk72/Sma... , see p. 16) that was missing from Unicode.

MichaelGG · on June 23, 2016

You don't need to be part of a big company, but it certainly helps. Especially if you want to stop a pentathalon or rifle character.

(Top result: http://www.cbc.ca/news/trending/rifle-emoji-dropped-unicode-...)

jshevek · on June 23, 2016

That's a shame.

Its one thing to have absolute, iron fisted control over your own platform - its another to intentionally seek to limit people's self expression on other platforms by influencing the standard in this way.

lilyball · on June 23, 2016

Are you saying that a company with voting rights shouldn't be allowed to have influence on what goes into Unicode? That doesn't make any sense. Also, if you read the article, Apple wasn't the only party in favor of nixing the emoji.

zeveb · on June 23, 2016

> Are you saying that a company with voting rights shouldn't be allowed to have influence on what goes into Unicode?

One might have a right to do something and yet be wrong to do it.

Apple had every right to do what they did, but they were completely, totally and undeniably in the wrong to do it. Everyone associated with their action should be ashamed. Honestly, they should all resign: their behaviour demonstrates that they have no business being associated with this sort of work.

lilyball · on June 23, 2016

Your comment is ridiculously extreme. They were not "undeniably" in the wrong. You think they were wrong, but that is an highly subjective opinion. In fact, I don't even agree that they were wrong to do this at all. I think it's perfectly reasonable to argue against the inclusion of more gun imagery in Unicode.

Also, if you think Apple was wrong, you must also think that Microsoft was (they voiced support), and everyone else at the meeting who agreed with the move. As the article says,

> Davis confirmed to BBC News on Monday that "there was consensus to remove" the emojis, but that he couldn't comment on the details.

So it's clearly not just Apple that thought this was the appropriate move.

oh_sigh · on June 23, 2016

Thank goodness for apple and other members which rejected the starter pistol/rifle proposal. Imagine how many mass shootings we've prevented by people not being able to communicate their plans using the rifle emoji.

nv-vn · on June 26, 2016

Just like how North Korea isn't "undeniably" wrong in censoring speech to such an extreme. This is a very 1984-esque "solution" to a problem -- don't want to acknowledge positive use of guns? Good news, we can just erase them from our language! The way we treat emoji has some very serious similarities to Newspeak.

lilyball · on June 26, 2016

That's absurd. Nobody's censoring anything here. Apple not wanting to add a new gun emoji is in no way preventing you from talking about guns. Emoji isn't a replacement for English and nobody is forcing you to "speak" in all emoji.

jshevek · on June 23, 2016

> Are you saying that a company with voting rights shouldn't be allowed to have influence on what goes into Unicode?

Obviously not.

dragonwriter · on June 23, 2016

You can think a company should have a say and still think it's windy to use that say in a particular way.

intopieces · on June 23, 2016

Rifle is included in Unicode 9.0, just without emoji presentation.

EdiX · on June 24, 2016

What does this mean? How can an emoji not have an emoji presentation?

intopieces · on June 24, 2016

An emoji cannot be an emoji without having emoji presentation. There is a rifle character that resembles, in form and function, the other Olympic emoji, it's just not sitting in the set reserved for emoji. Any platform that wants to pick it up can use it, they can even put it next to all their other emoji if they want to. Apple, et al did not 'prevent' Rifle from being encoded. They just decided they wouldn't pick it up as an emoji for their platform, and the Unicode conference decided to move it to a different category based on that decision.

There are literally thousands of characters that Apple doesn't encode for their platforms. I haven't heard whether Apple will be supporting:

    Osage, a Native American language
    Nepal Bhasa, a language of Nepal
    Fulani and other African languages
    The Bravanese dialect of Swahili, used in Somalia
    The Warsh orthography for Arabic, used in North and West Africa
    Tangut, a major historic script of China

All of which were added to Unicode 9.

>The two characters will still be part of the Unicode spec, but they'll be classed as black-and-white "symbols" instead of regular emoji

http://arstechnica.com/apple/2016/06/apple-and-microsoft-pus...

josephmx · on June 24, 2016

I assume they reserve the unicode character, but anyone who wants to use it decides what it looks like (so the rifle could look different on different platforms, which isn't a big issue)

intopieces · on June 24, 2016

This is always true for Emoji. The platforms always decide what their presentation of emoji will look like, just as they determine what font they will use for the unicode traditional letters. In this case, Apple and others decided they didn't want a rifle emoji, so it was moved to the 'black and white symbols' section so that those platforms wouldn't be 'missing' an emoji, which has other technical implications.

rbanffy · on June 23, 2016

Just added the group mark (and these power symbols) to https://github.com/rbanffy/3270font/tree/develop. It was fun. Thanks for giving me an excuse.

http://imgur.com/Sx0lkM8

rbanffy · on June 23, 2016

https://github.com/rbanffy/3270font/releases/tag/v1.2.15

A bit early.

PhasmaFelis · on June 23, 2016

It's really cool than these things can happen. Still, I can't help but feel like, in a couple of centuries, Unicode is gonna be a goddamn mess.

ygra · on June 23, 2016

Heck, Unicode is a mess already. But that's mostly because (a) language and scripts are messy, and (b) the original aim was to unify and encompass all existing character sets, and some of those were messy as well.

While there was considerable uproar over Emoji and there still often is over yet another fifteen symbols that everyone thinks no one would ever need or use, the bulk of the Unicode character set is still scripts for human languages. And some of those are only relevant for a very small minority, say, archaeologists. But that's fine. There's enough space, we're nowhere near to running out and Unicode enabled all sorts of cool things in computing that simply were not possible before or only with awful hacks and workarounds.

jessriedel · on June 23, 2016

> Unicode enabled all sorts of cool things in computing that simply were not possible before or only with awful hacks and workarounds.

Could you give an example? I don't know anything about this stuff.

etatoby · on June 23, 2016

A text file, a webpage, or a database table can only contain textual data in a given encoding.

That's because every byte stored in the file, for example byte number 188, either means "¼" (as it does in ISO/IEC 8859-1, aka. Latin-1 or ANSI), or it means "ỳ" (as in ISO/IEC 8859-14) or "ｼ" (in JIS X 0201, one of the many Japanese encodings that were devised over the years.)

How do you know which encoding a certain file uses? In general YOU CAN'T and this was the source of many problems and "solutions" which caused even more problems over the years.

Well then, how did you mix symbols from different alphabets, say in a dictionary or in a post that talks about them, like this very post? YOU COULDN'T, short of doing ugly hacks and other subterfuges, like using GIFs for all foreign characters.

Unicode gave a distinct number (or "codepoint") to every character and symbol known to man (within reasonable limits) and this allowed a lot of things that we take for granted nowadays, including this very post, were I just copied and pasted various symbols from their Wikipedia pages and just expect it to work.

lilyball · on June 23, 2016

Fun fact: Japanese even has a word for foreign characters, "gaiji", and it's extremely common in Japanese ePUBs to use small square images very frequently for characters not in the current font, using the term "gaiji" in the CSS class names used for these characters. And at least one mainstream ePUB reader has special code to detect these gaiji and adjust its rendering to make them behave better.

a_t48 · on June 23, 2016

Huh, you've just caused me to reexamine the word gaijin. gai (外) = outside, jin(人) = person\nationality. gaijin(外字) is outside + character. Neat!

silluk · on June 24, 2016

Similarly, the word "loanword" (used to describe words borrowed from other languages) is gairaigo 外来語 which is literally 外来 (gairai) "foreign" + 語 (go) "word/language". Japanese is filled with words where you can often figure out the meaning purely from the characters used!

dalke · on June 23, 2016

Conversion between different character codes became much easier. (Å (U+00C5) is canonically the same as Å (U+212B) but the latter exist only for round-trip compatibility.)

Unicode defines normalization algorithms (is é its own character, or e with a modifier character?).

I can have a document which combines English, Russian, Arabic, and Chinese, and expect it to be readable and editable by many different tools.

vorg · on June 23, 2016

> I can have a document which combines English, Russian, Arabic, and Chinese

English combines with top-down Chinese, and English combines with right-to-left Arabic, but top-down Chinese and right-to-left Arabic don't combine properly in the same document using Unicode -- the Arabic will be written bottom-up instead of top-down when embedded in the top-down Chinese.

dalke · on June 23, 2016

Layout is indeed hard.

I meant something simpler, like: The word 'computer' in English is 计算者 [jì suàn zhě] in Chinese, Компьютер in Russian, and حاسوب in Arabic."

Try that without Unicode.

It's of course possible with TeX, and no doubt other solutions. Which is why I added "and expect it to be readable and editable by many different tools".

(As a real-world use case, look at Knuth's "The Art of Computer Programming" and see how he credits people using their full names, in their own written language.)

Tyr42 · on June 23, 2016

Do you know when you would use 计算着 over 电脑[dian nao]? 计算着 I guess more literally translates to "one who computes", whereas 电脑 translates to "electric brain" which is a way more fun image, but I have no idea how the usage varies.

gurkendoktor · on June 23, 2016

I am not a native speaker and terrible at googling this, but I've once heard that 電腦 was originally the Taiwanese way of saying it and 計算機 the Chinese way, in the same way that 'program' is still called 程式 on one side of the strait and 程序 on the other, or 'internet' is either 網絡 or 網路. The names of movies and video games are also generally different (PRC, HK, TW).

計算機 is a calculator, not a computer, in Taiwan: https://tw.images.search.yahoo.com/search/images?p=計算機

Compare with Baidu in the PRC: http://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&w...

Edit: Oops, I didn't even realise that the parent poster asked about 計算著. That one I've never seen.

et-al · on June 23, 2016

计算着 [jisuanji] is the older usage that harkens back to the days when computers were mainframes and terminals and not in every household. Generally everyone says 电脑 [diannao] these days.

ximeng · on June 23, 2016

计算机 jisuanji would be a hardware computer, 计算着 jisuanzhe computing as an ongoing action, and 计算者 jisuanzhe computer in the sense of one who computes

dalke · on June 23, 2016

No clue. I did a copy and paste.

DarkLinkXXXX · on June 23, 2016

Isn't that not not a problem with Unicode, but the text rendering engine? Or is this indeed a spec bug?

vorg · on June 24, 2016

Right-to-left script rendering is part of the Unicode spec, whereas top-down scripting for Chinese is a font issue.

Arnt · on June 23, 2016

Someone I know worked at a company that sells a content management system with version control. Bigcos use it to keep and update lots of marketing materials and manuals. Those responsible for product x or area y can see what's been added by others, and update their own versions, and record that.

In this way, new FAQs and other updates spread to all the markets easily.

It's vastly easier to do this stuff when all the documents use the same text encoding. Even if noone can read everything, the fact that everything uses the same encoding means that any pair of languages you can read are technically readable.

akiselev · on June 23, 2016

What isn't a huge goddamn mess after several centuries? If there's one thing we humans don't deal with well, it's entropy.

js8 · on June 23, 2016

Actually, humans, just like any other living organism, deal with entropy quite well - they dump it onto others. :-)

copperx · on June 23, 2016

Onto the next generation :/

1024core · on June 24, 2016

The Pantheon in Rome.

kristianp · on June 23, 2016

What's the IBM group symbol? I can't find the words IBM in [1].

[1] http://www.unicode.org/charts/PDF/U2B00.pdf

kens · on June 23, 2016

The group mark will be U+2BD2. For details see https://github.com/shirriff/groupmark

ktsmith · on June 23, 2016

You can find it here where it says it's in Stage 6 which I would imagine why it's not in the current chart: http://unicode.org/alloc/Pipeline.html

2BD2 GROUP MARK 2015-May-05 Accepted 2016-May-29 Stage 6

The bitcoin symbol is in there too.

intopieces · on June 23, 2016

You don't even need to be part of a big company to go to their meetings. I went to their conference in San Jose just to see the proposal for the Chinese Take-out box/chopsticks/fortune cook emojis. I met a ton of nice people, too.

paxcoder · on June 23, 2016

The company logo? Why? Would you please get the copyleft glyph in it? Seems more important.

tangus · on June 23, 2016

How did the Unicode Consortium turn around. I remember 10 years ago they were refusing to add standard media icons because

>The scope of the Unicode Standard (and ISO/IEC 10646) does not extend to encoding every symbol or sign that bears meaning in the world.

>This list has been round and round and round on this -- regular as clockwork, about once a year, the topic comes up again. And I see no indication that the UTC or WG2 are any closer to concluding that bunches of icons should start being included in the character encoding standards simply on the basis of their being widespread and recognizable icons.

>Where is the defensible line between "Fast Forward" and "Women's Restroom" or "Right Lane Merge Ahead" or "Danger Crocodiles No Swimming"?

(http://www.unicode.org/mail-arch/unicode-ml/y2005-m08/0371.h...)

Now it looks they add whatever somebody thinks of. I guess it's related to the liberation from the BMP.

shakethemonkey · on June 23, 2016

>The scope of the Unicode Standard (and ISO/IEC 10646) does not extend to encoding every symbol or sign that bears meaning in the world.

Until Unicode has a half-star character, it won't even be able to encode the average newspaper.

kens · on June 23, 2016

Somebody should propose the half star (used in star ratings) to Unicode. Seriously.

namaemuta · on June 23, 2016

Multiply your rating system by 2 and you won't need half stars :)

sullyj3 · on June 24, 2016

This makes subitizing much more difficult though.

alanh · on June 24, 2016

This comment taught me a word. Separately, you are completely correct and this is extremely valid in the design of rating systems.

ape4 · on June 23, 2016

Only the star with the left side filled in. And an outline on the right.

mjevans · on June 23, 2016

I think something like an occlusion mask modifier (slice off this much from this side/corner) would be more useful.

Bromskloss · on June 23, 2016

And two thirds, and three fifths, …

dagaci · on June 23, 2016

good idea

nkrisc · on June 23, 2016

What if my products are rated in smiley-faces?

agumonkey · on June 23, 2016

This is the role of digits I believe.

kens · on June 23, 2016

Unicode is supposed to include symbols that appear in "running text", not standalone icons. So no on traffic signs for instance. (There are exceptions for historical reasons. And emoji are a totally separate story.)

andyjohnson0 · on June 23, 2016

> And emoji are a totally separate story.

Recent article on the Unicode/emoji debate:

https://www.buzzfeed.com/charliewarzel/inside-emojigeddon-th...

TorKlingberg · on June 23, 2016

Unicode 9.0 adds 7500 characters, 72 of which are emoji, so I think the "Emojigeddon" is a bit exaggerated.

ivl · on June 23, 2016

Sounds like 72 too many if you ask me.

In all seriousness, I'm not sure emoji's really belong in text encoding. Even though it's more convenient, based on where they're most frequently used I don't think they need to be universal.

msbarnett · on June 23, 2016

Your options are:

1) everybody uses them on their phones, they're in Unicode, consistent and compatible between devices and messaging programs. In the far flung future, researchers will be able to study their linguistic role in communication, confident in understanding what the characters were.

2) everybody uses them on their phones, they're proprietary fonts and codepoints (in the Unicode private use area if you're luck, just random data if you're not), there's no consistency between phone models, manufacturers, or cell networks. Future researchers can pound sand.

We were at #2 pre-Unicode. It was a goddamn mess, especially in Japan. Lord knows why anyone would prefer it. There's no value in being a snob about what kinds of incredibly frequently used characters we think are Worthy of inclusion, imo.

gurkendoktor · on June 23, 2016

There's another option:

3) People who love colourful images will use stickers in Facebook Messenger, LINE, Viber, and soon iMessage. I'm sure WeChat has them too. It's basically like 2), except we've moved from proprietary codepoints to proprietary protocols.

I don't mind characters like and or even good old ︎ (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.

But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see and whose computer/phone/browser/messenger software is too old.

nfoz · on June 23, 2016

Interesting. Did you try to include some emoji in your comment? They did not get included:

> characters like and or even good old ︎ (which

gurkendoktor · on June 25, 2016

Oops, thanks. Well, that explains why I've never seen Emoji on Hacker News. And I've missed the edit window, so I can't fix my post.

Should have been:

> I don't mind characters like ((yellow smiling Emoji)) and ((thumbs up Emoji)) or even good old ︎((pre-Emoji Unicode smiley)) (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.

> But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see ((upside-down smiling Emoji)) and whose computer/phone/browser/messenger software is too old.

TheCoelacanth · on June 23, 2016

The "universal" in Unicode means that it aspires to include all symbols used in any form of text; not that it should only include symbols that are used in all forms of text.

rudolf0 · on June 23, 2016

I kind of agree, but we've already committed at this point. No going back, really. So there's no harm in adding some more.

copperx · on June 23, 2016

I have never read a book that had a snowman in the running text, so what's the story for emoji?

kens · on June 23, 2016

Emoji were added to Unicode for compatibility with various mobile phones, so they would have a standard encoding. That's how Unicode ended up with the poop emoji for example - they didn't sit around thinking "what we really need is...". Since people really, really want more emoji, Unicode is sort of stuck constantly adding more. If you want to propose new emoji, the rules are at http://www.unicode.org/emoji/selection.html

Text symbols (as opposed to emoji) have different rules. Basically, the symbol needs to be used in "running text" (i.e. normal text), like "containers with [recycling symbol] can be recycled" or "he bid 2[club]". Traffic signs for example are not normally used in the middle of text, so they aren't encoded in Unicode. To get the Bitcoin symbol encoded, I needed to show that it was used in text, not just as a standalone icon. The full rules for symbols in Unicode are at http://www.unicode.org/pending/symbol-guidelines.html

For the snowman in particular, it was added to Unicode because it was a symbol used in the character set for Japanese TV broadcasts, see http://www.unicode.org/L2/L2007/07391-n3341.pdf

TL;DR: Don't argue "Why does Unicode have a poop emoji but no symbol for X?" - the rules are totally different for emoji and symbols.

Edit: does HN strip out arbitrary Unicode characters now? I originally had Unicode characters in place of [recycling symbol] and [club], but they disappeared when I submitted.

nitrogen · on June 23, 2016

IIRC HN might have a character whitelist to prevent overloads of combining or layout-altering characters, and not have to worry about the behavior of newly added characters. There were some comment threads a few years ago that were just stacks of hundreds of combining diacritics that would crash some rendering engines and create odd decorated text on others.

aji · on June 23, 2016

Tom Scott has a great video about the history of emoji. Basically, some companies in far east countries were encoding these icons in various proprietary codes for their messaging systems. The ecosystem became widespread and consistent enough that the unicode consortium saw it fit to include these emoji in the standard.

The snowman, on the other hand, is a weather symbol for snow, I assume. It appears alongside other symbols for meteorological phenomena, so I imagine was added around the same time and with similar reasoning: http://www.fileformat.info/info/unicode/block/miscellaneous_...

TheCoelacanth · on June 23, 2016

Books aren't the only form of running text. Emoji are quite common in some forms of textual communication.

halomru · on June 23, 2016

Emoji are widely used in text such as instant messages and forum posts.

They are mostly in Unicode for use in SMS, but there are plenty of use cases in other forms of text.

burkaman · on June 23, 2016

What about a text, or an article? Books aren't the only source of running text.

pedasmith · on June 24, 2016

And what about email? I have to support POP3 (!); my customers would be seriously unhappy if they can't sent an email with emoji.

Heck, I'd be unhappy. I love adding emoticon and emoji and fun things to my emails.

igravious · on June 23, 2016

How are traffic signs not in "running text" in books about the rules of the road and such like?

coldtea · on June 23, 2016

Running text means INSIDE text (as in: "running along" with the other characters), not "used in a book as illustration".

pc2g4d · on June 23, 2016

It seems that every symbol imaginable will be used in running text eventually. At very least, for purposes of discussing the symbol itself!

igravious · on June 23, 2016

Yes, of course. What was I thinking?

Jaruzel · on June 23, 2016

I'm not sure about the 'running text' thing, but in my view Traffic Signs are not globally universal (yet), so you'd have to have regional variants which is impractical.

coldtea · on June 23, 2016

Tons of the things in unicode are not "globally universal".

GFK_of_xmaspast · on June 23, 2016

https://en.wikipedia.org/wiki/Vienna_Convention_on_Road_Sign...

acz · on June 23, 2016

Let’s start working on "SVG over UTF" RFC, should we?

gwbas1c · on June 23, 2016

Honestly, I think "SVG over UTF" makes a lot more sense. It's impossible to make a character set that supports every character known to man, because that just adds undue effort on every computer maker, ect, to keep up.

So why don't we pick a very good set: perhaps every letter in every language in common use for the past 200 years? Then, for the oddball symbols that someone wants to mix in text, there can be some kind of SVG-like convention. This allows publishing textual information without requiring that every device maker updates their device to support a 1-off symbol.

datenwolf · on June 23, 2016

> This allows publishing textual information without requiring that every device maker updates their device to support a 1-off symbol.

The main purpose of Unicode is to encode the information. How the information is turned into its visual counterpart is outside the scope of unicode. For what it's worth this could be done by linking unicode code points to matching SVGs in a document. Wait, exactly that is already a W3C standard: https://www.w3.org/TR/SVG/fonts.html

Semiapies · on June 23, 2016

Because it's easier to throw in random icons than to actually accomplish the goal of "every letter in every language in common use for the past 200 years", or even "past 20 years".

Or, put another way:

'We have an unambiguous, cross-platform way to represent “PILE OF POO” (), while we’re still debating which of the 1.2 billion native Chinese speakers deserve to spell their own names correctly.'

https://modelviewculture.com/pieces/i-can-text-you-a-pile-of...

failrate · on June 23, 2016

This is a link by the article's author that is intended to make it easier for us to add useful symbols: https://github.com/jloughry/Unicode I recommend you use it to add any glyphs that you feel are being neglected.

kens · on June 23, 2016

That article raises an interesting issue about a character in the author's name that is missing from Unicode. Unfortunately the article is (how to put this?) not constructive. The complex reasons that Unicode excluded the character are described in [1]. If the author addresses those issues, there's a much better chance to get the desired character into Unicode.

[1] http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf

nv-vn · on June 26, 2016

Correct me if I'm wrong, but isn't the Han Unification project more about unifying semantically distinct, but visually identical characters under the same codepoint (rather than grouping together similar-looking codepoints as the article suggests)? As far as I'm aware it's more along the lines of reusing the codepoint for 'a' when encoding both English and Spanish text. Am I mistaken in thinking this?

halomru · on June 23, 2016

But if the shape of embedded in the text, font choice becomes meaningless.

> undue effort on every computer maker, ect, to keep up.

The effort to update the font files every few years? Unless you insist on supporting a new Unicode version the second it comes out, I don't see the big effort here? Of course there is effort for font makers, but this is quite centralised.

bjourne · on June 23, 2016

What about the oddest oddballs whose "symbols" are animations http://www.reactiongifs.com/r/tww.gif? They are used a lot on reddit sometimes even with sound.

hackuser · on June 23, 2016

As the story mentions regarding the off symbol (a circle), there are many visually identical code points that have different semantic meanings. But in this case, they added an additional semantic meaning to an existing code point.

So which is it? Does each code point represent a visual image? A semantic meaning? Both? It depends? Something else?

I've tried to decipher that on my own and only learned that the answer to these sorts of questions are complicated, because it's very complicated to represent all written human language via one set of rules.

So I know some of the answers to my questions above, but I'm hoping someone with real expertise can provide the fundamental rules/policies - if there are any.

EdiX · on June 23, 2016

> So which is it? Does each code point represent a visual image?

Look it's pretty simple, every code point represents a semantic meaning, except for:

1. those characters who also encode the width of their visual image (U+FF00..FFEF)

2. the one that means 'unknown' (U+FFFD)

3. those characters that change their visual representation depending on their position in the word (U+FB50..U+FDFF,U+FE70..U+FEFF)

4. those that change the visual image of another code point (U+FE00..U+FE0F)

5. those characters that have a visual image as their semantic meaning (too many to list)

6. those that are designated to have no semantic meaning at all (U+FDD0..U+FDEF)

7. those that have a meaning only in pairs (U+D800..U+DFFF)

8. miscellaneous

tragomaskhalos · on June 23, 2016

Why does this list remind me of https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevole... ? :)

EdiX · on June 23, 2016

Because Unicode is the Celestial Consortium of Benevolent Encoding.

PeterisP · on June 24, 2016

Why would #1 #3 #4 and #5 not apply?

"every code point represents a semantic meaning" is completely consistent with the notion that some code points e.g. have differing visual representation depending on their position in the word.

mrb · on June 23, 2016

"So which is it? Does each code point represent a visual image? A semantic meaning? Both? It depends? Something else?"

Well the answer is clear: each code point represents one visual image, to which is associated one or more meanings.

wodenokoto · on June 23, 2016

Try looking up han-unification and its justification and you'll see the exact opposite approach to encoding characters into unicode.

For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.

If you want to write Japanese and Chinese in the same document, you need to mark up the section to tell the system that renders it, to render different visual forms for similar codepoints depending on whether they are used in Japanese or Chinese.

thaumasiotes · on June 23, 2016

> For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.

This isn't true. 青 and 靑 are the same character written differently; they have their own codepoints. Ditto for a huge number of simplified Chinese characters; 语 is mainland Chinese and 語 is the same character in Japanese.

wodenokoto · on June 23, 2016

It is true for lots of characters (so I guess I was being a little hyperbolic when I said "all"), and you cannot rely on choosing the correct code points in order to have a text display Japanese or Chinese. You need to tell your rendering program (often through choice of font) if things are to be rendered with Japanese or Chinese forms.

I wouldn't know how to show you examples here, as 直 will 直 display the same since they have the same code point, but different number of strokes in japabese and chinese.

https://en.m.wikipedia.org/wiki/Han_unification

rspeer · on June 23, 2016

Aren't they putting the disunified characters into the U+2xxxx plane now?

Han unification is generally seen as a bad choice in retrospect, but it was something Unicode had to do when it looked like 2^16 codepoints were all they were going to get.

wodenokoto · on June 25, 2016

Never heard of that, but I would appreciate if all the characters with different glyphs had different codepoints. Do you have a source? Do you know what happens to the "unified" code-points?

footpath · on June 23, 2016

It is true to some extent. While 青 and 靑 have different codepoints, there are plenty of characters of the same codepoint that are rendered differently depends on the language specificed:

https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

Han characters that are traditionally viewed as variants of one another, or that are simplified from more complex logograms (such as 龜, which was simplified into 亀 in Japan and 龟 in mainland China) tend to have different codepoints, but the stylistically different ones usually belong to the same codepoint.

thaumasiotes · on June 23, 2016

I do know about the issue; it causes problems for me. But I couldn't let the claim that all semantically equivalent characters were unified pass.

> the stylistically different ones usually belong to the same codepoint

Fair enough. Do you happen to know why 青 and 靑 weren't unified?

msbarnett · on June 23, 2016

Han Unification "rules" were an inconsistent mess, but I do know that in Japanese 靑 was at one time a printer's simplification of 青, so you could find either in texts, and the Consortium tended to encode a character separately if you could find printed examples of both in the same language.

coddingtonbear · on June 23, 2016

That is not the compromise struck, though; there are even many Cyrillic glyphs that are visually identical to those in Latin, but assigned differing codepoints.

jhanschoo · on June 23, 2016

There are multiple reasons for that, one of which is compatibility with previous encodings and standards. If a previous encoding Unicode wanted to be compatible with encoded these as different characters, Unicode needs these to have separate code points for them too.

hvidgaard · on June 23, 2016

That is the surefire way to incorporate complexities from 2 different systems into 1.

xyproto · on June 23, 2016

Being able to easily check if a letter is between 'a' and 'z' in code is an advantage. This is only possible if the codepoints are sequential.

dalke · on June 23, 2016

It's not a big advantage. EBCDIC, for example, didn't do that, and programmers managed just fine without it.

Also, why are you doing that check? Is it to see if something is lowercase? If so, your check will get the wrong answer for lowercase letters like å.

Unicode does have a way to check if something is uppercase/lowercase, when that distinction exists. This is in UnicodeData.txt.

msbarnett · on June 23, 2016

> It's not a big advantage. EBCDIC, for example, didn't do that, and programmers managed just fine without it.

You might try asking an old IBM programmer just how "fine" they felt dealing with EBCDIC...

dalke · on June 23, 2016

EBCDIC had many problems, agreed. One is the non-contiguous range. Another is the many variants (https://en.wikipedia.org/wiki/EBCDIC_code_pages lists 11, x2 for the Euro Update).

But how serious is the problem? How many times do you need to test if a given character is one of the 26 allowed letters of the English alphabet, and where you implement it by testing it against the range?

Typically you write it as "islower_english(c)" once, and be done with it. Is that really hard?

If you do think that's a serious problem, then what of those programmers who need to test for lowercase letters in "España", "München", "Diyarbakır", and "façade"?

msbarnett · on June 23, 2016

I mean, if you really want to get into it, it was a huge pain in the ass at a time when paying the cost of a call to islower_english was much more expensive than a hardware less-than instruction.

We've broadly moved beyond that, but there's still value in grouping sets together in a way that makes certain kinds of frequent tests less computationally expensive than they would be if codepoints were randomly distributed.

dalke · on June 23, 2016

Agreed. Though to point out, I believe xyproto's comment refers to the present.

EDIT: Plus, if it were that important, IBM could implement the function in hardware. (Perhaps they did.)

hvidgaard · on June 23, 2016

I didn't dispute that. I just state that trying to remain compatible for the sake of being compatible is a great way to design a convuluted and difficult to understand standard.

gpderetta · on June 23, 2016

Of course, but lack backward compatibility is a great way to make sure a standard is not adopted. For example he reason that UTF-8 'won' is that it has a great backward compatibility story with other ASCII based encodings and systems.

bambax · on June 23, 2016

Is it? Couldn't Unicode have pointers or links, where a codepoint "exists" with no content and only links to another?

(I don't know anything about Unicode, so maybe it already has that.)

edent · on June 23, 2016

Semantically, yes. In the code tables you'll see that that "opposite" symbols have links to each other.

Programatically, it is much easier to say "does a character lie between 0x12 and 0xBC" than to create a function like `isSymbolForTrafficInEurope()`

bambax · on June 23, 2016

A pointer would let you ask "does a character lie between 0x12 and 0xBC" but would not hold the character itself; it would make possible to implement different "characters" with the same representation.

Natanael_L · on June 23, 2016

Perhaps Unicode could just have tables listing all relevant sequences of symbols, instead. So "latin letters lowercase" would list the codepoints for a-z in order, for example. Would no longer matter if the codepoints themselves are sequential or not.

(And relevant to my country, "Swedish characters lowercase" would map to latin letters lowercase + åäö.)

ygra · on June 24, 2016

Characters have a script associated with them (e.g. Latin), and caseness is also part of a character's properties.

Now, language-specific subsets¹ of those are a bit iffy to deal with. Especially when text can contain loan words from other languages, so in my experience it's rarely a useful thing to ask for.

¹ Yes, subsets. Latin letters lowercase is not the set abcdefghijklmnopqrstuvwxyz. It is the set

abcdefghijklmnopqrstuvwxyzªºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ āăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĳĵķĸĺļľŀłńņňŉŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵ ŷźżžſƀƃƅƈƌƍƒƕƙƚƛƞơƣƥƨƪƫƭưƴƶƹƺƽƾƿǆǉǌǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǳǵǹǻǽǿ ȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳȴȵȶȷȸȹȼȿɀɂɇɉɋɍɏɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢ ɣɤɥɦɧɨɩɪɫɬɭɮɯɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾɿʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏʐʑʒʓʕʖʗʘʙʚʛʜʝʞʟ ʠʡʢʣʤʥʦʧʨʩʪʫʬʭʮʯʰʱʲʳʴʵʶʷʸˠˡˢˣˤᴀᴁᴂᴃᴄᴅᴆᴇᴈᴉᴊᴋᴌᴍᴎᴏᴐᴑᴒᴓᴔᴕᴖᴗᴘᴙᴚᴛᴜᴝ ᴞᴟᴠᴡᴢᴣᴤᴥᴬᴭᴮᴯᴰᴱᴲᴳᴴᴵᴶᴷᴸᴹᴺᴻᴼᴽᴾᴿᵀᵁᵂᵃᵄᵅᵆᵇᵈᵉᵊᵋᵌᵍᵎᵏᵐᵑᵒᵓᵔᵕᵖᵗᵘᵙᵚᵛᵜᵢᵣᵤ ᵥᵫᵬᵭᵮᵯᵰᵱᵲᵳᵴᵵᵶᵷᵹᵺᵻᵼᵽᵾᵿᶀᶁᶂᶃᶄᶅᶆᶇᶈᶉᶊᶋᶌᶍᶎᶏᶐᶑᶒᶓᶔᶕᶖᶗᶘᶙᶚᶛᶜᶝᶞᶟᶠᶡᶢᶣᶤᶥᶦ ᶧᶨᶩᶪᶫᶬᶭᶮᶯᶰᶱᶲᶳᶴᶵᶶᶷᶸᶹᶺᶻᶼᶽᶾḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇ ṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕẖẗẘẙẚẛẜẝẟạảấầẩẫậắằẳẵặẹ ẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿⁱⁿₐₑₒₓₔₕₖₗₘₙₚₛₜⅎↄⱡⱥⱦⱨⱪ ⱬⱱⱳⱴⱶⱷⱸⱹⱺⱻⱼⱽꜣꜥꜧꜩꜫꜭꜯꜰꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝰꝱꝲꝳꝴꝵ ꝶꝷꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓꞡꞣꞥꞧꞩꟸꟹꟺﬀﬁﬂﬃﬄﬅﬆａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ

How do you condense that again into language-specific subsets? Every letter that appears in a word in a dictionary? Then at least é belongs to German as well, even though it's usually not considered part of the German Latin subset. Unicode stays clear of that issue by simply not defining what script subsets a character belongs to (rightfully so, IMHO).

iopq · on June 23, 2016

Yes, but they have alternate italic forms, for example. Sure, some one of the glyphs like с doesn't have an alternate italic form. Since the other ones do, it would be weird to only assign a separate codepoint to some of them and overlap the others. It would be a workable solution, but still weird.

Asooka · on June 23, 2016

You mean like Han unification?

iopq · on June 30, 2016

Which is a bad idea because the characters don't look right unless you use a Japanese font. But if you want to write an article comparing Japanese and Chinese characters, you have to use two different fonts.

jomamaxx · on June 23, 2016

"many visually identical code points"

The emoji code points can be represented differently on different systems given their meaning.

So it makes sense to have different emojis for different 'meanings'.

The 'moon' switch here does no mean 'moon' - it means 'standby' or whatever.

It may look noticeably different on different systems.

Think from a design perspective: you have 5 emojis to represent 'clouds, sky, earth' etc. - and the a different set of 5 to represent 'on, off, sleep, shutdown'. Those icons will be markedly different in terms of representation, groupings, colour coding, underlying functionality if they are integrated into an experience in any meaningful way.

Text your car with the 'shutdown' symbol to tell it to shut down.

Your bot texts your friend with a moon symbol to tell him you're asleep. Or whatever.

xg15 · on June 23, 2016

But that doesn't explain the inconsistency in the current case.

So if a system wants to render "on" differently than "straight vertical line", that's possible.

However, if "off" should be rendered differently than "circle", that's not possible. (Or only possible with out-of-band information or modifier characters which would still have to be defined)

msbarnett · on June 23, 2016

Yeah.

It's a mess. If you want to write a document in Japanese that talks about a Chinese character which is written differently than its Japanese version, you can, or can't, achieve this in Unicode, depending on the character, its history, and the mood of the consortium the day it was assigned.

The reality is that Unicode is governed by people, some of those people are grumpy reductionists who push for a minimum of symbols and a maximum of meaning-overloads, and others are more liberal and tend to advocate the opposite, and the result is a compromise, and is in areas very messy.

PeterisP · on June 24, 2016

Do note that they did not include a generic "off" symbol, they included the IEEE 1621 off symbol - which must be rendered as a circle; while on the other hand the IEEE 1621 on symbol must be rendered in a manner that is often different from just "straight vertical line" in particular regarding the corners of that line.

coldtea · on June 23, 2016

>As the story mentions regarding the off symbol (a circle), there are many visually identical code points that have different semantic meanings.

So? How is that different from any regular character in real life?

101 for example means the number 101, an introductory class in university, slang for "anything introductory" in general, etc.

And let's not get started on the meanings of letters, e.g. a and e.

peterburkimsher · on June 23, 2016

TIL that the SI units all have Unicode symbols. http://www.marathon-studios.com/unicode/categories/So/Other_...

If people actually used these, it would make searching text for formulae much easier. Wikipedia editors and academic publishers, please note.

Also, there's no Unicode for screwdriver. Perhaps iFixit would like to campaign for that?

Congratulations on getting the power symbols in! When @edent writes "Will update ... when I stop dancing", was it "I got the power"?

legulere · on June 23, 2016

The SI ones are for use together with Chinese/Japanese/Korean to fit together with their square characters.

I don't see how using them for anything else would have any use. I never searched for units when searching for formulas

dalke · on June 23, 2016

I didn't see special Unicode symbols for the SI units in the link you gave or in a more general search. I found no match for "ampere" which is the SI base unit for electric current, or for "candela", used for luminous intensity.

BTW, just because a character exist doesn't mean it's the best choice for ordinary use. As https://en.wikipedia.org/wiki/%C3%85#Symbol_for_.C3.A5ngstr.... points out:

> Unicode also has encoded U+212B Å ANGSTROM SIGN. However, that is canonically equivalent to the ordinary letter Å. The duplicate encoding at U+212B is due to round-trip mapping compatibility with an East-Asian character encoding, but is otherwise not to be used.

agumonkey · on June 23, 2016

I'm a bit confused about Unicode. It was a repository of linguistic symbols, not raw symbols. More and more it looks like wingdings. Isn't this putting burden on font support and Text processing (what's the lexicographic order of such symbols, using the abstract name ?) ?

wmil · on June 23, 2016

They want every symbol used in a document to have a unique encoding, so that you can change fonts without losing meaning. Fonts like wingdings are a horrible hack.

The idea is one (complex) encoding that will represent the info until the end of time. It creates a lot of trouble, but it's still a good idea.

thaumasiotes · on June 23, 2016

Technically, glyphs are supposed to meet some standards, like being shown in use in running text, before they can be added to unicode. It's not supposed to be a repository of every picture anyone ever dreamed up.

The standards are not applied consistently. Even leaving emoji out of it, the chinese "character" 囍 never occurs in running text, but there it is in unicode.

ximeng · on June 23, 2016

I don't think it's true that 囍 never occurs in running text - it's used in company names which would be used in text. It would be odd not to have an encoding for such a common character.

thaumasiotes · on June 23, 2016

All right, I spent some time trying to find the requirement. I did not find it, but my tentative conclusion is that it does not apply to chinese characters.

FROM MEMORY, a while back there was an article on HN complaining that emoji seemed to magically bypass the requirements other characters needed to meet for inclusion in unicode, and that in fact they were commonly in violation. The taco symbol was called out as an example. I can no longer find this article, but it mentioned the running text requirement, and -- I believe -- specifically indicated that use in names does not count as use in running text. (For an idea of why that might be the case, check out http://tvtropes.org/pmwiki/pmwiki.php/Main/LuckyCharmsTitle .)

HOWEVER, I was not even able to find, on the unicode web site, any discussion of a running text requirement at all, for any kind of symbol. Some example proposals do refer to "running text" by name, but they don't indicate why. The example proposal given for adding characters to an existing block ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2934.pdf , suggested as a prototype in http://www.unicode.org/faq/char_proposal.html ) does not mention "running text" at all, and doesn't appear to go to much trouble to document it, although some such documentation is given. The rough guidelines for character proposals at http://unicode.org/pending/proposals.html do not refer to "running text" at all, but they do suggest that, late in the process (specifically, on a proposal summary form, which is different from, and subsequent to, an actual proposal), "references to dictionaries and descriptive texts establishing authoritative information" are required.

I conclude that the Unicode standard's preferred criterion for chinese character inclusion is "would an authoritative chinese dictionary include this character", and while the answer to that question for 囍 is not unambiguous -- a lot of dictionaries don't include it -- it's easy to imagine that some do.

I would appreciate a pointer to the actual running text requirements, as well as what they are supposed to apply to, if anyone can provide that.

> It would be odd not to have an encoding for such a common character.

Outside of its use as a wedding decoration, which is plainly nonlinguistic, how common is it?

barkingcat · on June 23, 2016

That character is extremely common. It's not just a one off wedding celebration thing.

Also used in new year celebrations. Basically every single year you'd see tons of these printed.

https://en.wikipedia.org/wiki/Double_Happiness_(calligraphy)

It's a pun actually, double happiness.

thaumasiotes · on June 23, 2016

Every year you see tons of 福倒 printed too (福, but upside down). That has yet to receive a code point.

msbarnett · on June 23, 2016

You're not wrong, but I don't know if it's a great example.

CJK characters are, broadly, an example of the Unicode Consortium trying to be way too reductive about what they'd accept, leading to a lot of bad decisions like Han Unification, which caused a lot of damage and which the Consortium has generally now backed away from and recognized as a bad idea.

So, yes, if you look closely at CJK character sets in Unicode, you can find a lot of decision making that appears to contradict decision making elsewhere in the standard. This is in large part because the decisions they made wrt CJK characters turned out to be largely wrong, and they've since changed their approach.

thaumasiotes · on June 23, 2016

Unification is a mistake. But 囍 has nothing to do with unification. Do you think that 福倒 should have a code point? Would it be considered one of the chinese characters (very iffy) or one of the holiday symbols?

The 天书 ( https://en.wikipedia.org/wiki/A_Book_from_the_Sky ), by design, consists solely of chinese characters that don't exist. (Theoretically. A couple of them, by oversight, did exist.) They are still recognizably "chinese characters" by virtue of being composed of the same components. Should they have unicode points?

囍 plainly exists, but has no textual use. Is it more similar to 靑 or to ️U+2764 "heavy black heart"?

msbarnett · on June 23, 2016

> Unification is a mistake. But 囍 has nothing to do with unification.

I'm not saying it does, I'm saying Unification illustrates the fact that the Consortium's decision-making with respect to CJK has changed over time, has frequently been illogical, and shouldn't be pointed at as an example of anything good or sane or worthy of precedent.

The fact that 囍 has a code point but 福倒 doesn't have a codepoint is another example of the Consortium being unnecessarily reductive and intransigent about CJK.

> Do you think that 福倒 should have a code point?

Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.

> Should they have unicode points?

I'd lean towards no, as they're one-offs, not something broader that people want to discuss and use in text. But I'd be ok with adding them, too. We're not running out of space. There's no value in making CJK so much harder to interop with than everything else, in general.

thaumasiotes · on June 23, 2016

>> Do you think that 福倒 should have a code point?

> Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.

This doesn't make any sense. We talk about things in text by using words, not direct representations. A dog emoji is not necessary or desirable for discussing dogs in text, and a 福倒 emoji is not necessary or desirable for discussing 福倒s in text.

Should the wikipedia page https://en.wikipedia.org/wiki/Statue_of_Liberty be edited to replace the cumbersome phrase "statue of liberty" with the more modern and convenient U+1F5FD 'STATUE OF LIBERTY'?

msbarnett · on June 23, 2016

> A dog emoji is not necessary or desirable for discussing dogs in text, and a 福倒 emoji is not necessary or desirable for discussing 福倒s in text.

"Necessary" is an ill-defined and reductive way of looking at communication. History has shown us that you can't draw bright lines between things you, in the abstract, have decided are the "necessary" subset, and expect the world to follow along.

Linguists have come to understand that you can only describe and follow human, behaviour, not prescribe it.

Anyways, humans plainly found it necessary to annotate their text messages with pictoral indicators of their mood, to the point where it became so widely spread and such a mess that we felt it desirable to standardize the code-point representations. That it isn't desirable in all circumstances or appropriate in all registers of formality does not mean that it isn't an emergent behaviour which will continue to arise whether or not it is "necessary".

tl;dr I don't really give a shit that "dog emoji" isn't appropriate for an academic text on canine surgery. It's more than sufficient to me that it is used millions of times in text messages between regular human beings. Text needn't be formal text to deserve respect in encoding.

thaumasiotes · on June 23, 2016

> "Necessary" is an ill-defined and reductive way of looking at communication.

I took "If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way" as implying that the two clauses were related to each other. Saying "if we want to be able to talk about it" means you're talking about what's necessary for that purpose.

ximeng · on June 23, 2016

Interesting comments, thanks. It is used in company names, which would make it awkward not to have an encoding for it: there are many characters used just in names in Chinese that would leave locations and people having unencodable names if the characters were not in Unicode.

thaumasiotes · on June 23, 2016

> there are many characters used just in names in Chinese that would leave locations and people having unencodable names if the characters were not in Unicode

I was under the impression that this describes the current state of affairs, and has since before Unicode came around. I know I've read an article about someone whose 姓 was 马 and whose personal name was a character composed of three 马 stacked left-to-right (which might have been pronounced cheng?) getting harassed because the government couldn't encode the name.

ximeng · on June 23, 2016

It may be the case, but usage of many of these characters is rarer than 囍. Looking through many of the characters in the cjk Unicode extensions it is not hard to conclude that characters like 福倒 should exist. E.g. 2010f and 20114 which look like 了 and 予 upside down. Or 255d0 which is 石磊 joined together.

kens · on June 24, 2016

The character 囍 appears in other standards (big5, JIS, ISO2022_JP), so Unicode is basically obligated to include it for compatibility.

The running text requirement is for symbols, so it doesn't apply to Chinese: http://www.unicode.org/pending/symbol-guidelines.html

thaumasiotes · on June 24, 2016

As to compatibility, excellent point.

The word "running" doesn't appear on that page. (Actually, no requirements at all appear on that page; it speaks strictly in terms of strengthening or weakening the case for inclusion, not disqualifying.) Can you explain briefly why that page is evidence that the running text requirement does not apply to Chinese, and where it specifies what the running text requirement is?

Alternatively, what requirements do apply to Chinese, and would they preclude an invented character like one with 女 on the left and 离 on the right?

kens · on June 25, 2016

I thought the symbol guideline page discussed "running text", but I guess not. Apparently the "running text" requirement isn't part of the published criteria even though it is enforced in discussion.

barkingcat · on June 23, 2016

That character is super common. Way more common than any emoji I'd expect (even teens using emojis won't outweigh the number of weddings and new years in Chinese speaking countries).

thaumasiotes · on June 23, 2016

The question was whether it can meet the Unicode inclusion requirements, not whether it's common.

barkingcat · on June 23, 2016

Well, even if it doesn't meet unicode inclusion requirements, it is necessary for printing in one of the largest markets in the world. Without that character in unicode, Chinese display systems and printers probably won't use unicode at all (and before unicode they used some standard of their own) - meaning the question is whether unicode wants to be relevant or not, not whether the inclusion requirements fit.

thaumasiotes · on June 23, 2016

Again, it's not a character that appears in text. This:

> Without that character in unicode, Chinese display systems and printers probably won't use unicode at all

is baseless. In its current uses, it doesn't appear on display systems and when printed it is almost always designed as an image, not printed as part of a font. Compare: http://pic10.nipic.com/20100928/5211371_231333032314_2.jpg

ximeng · on June 23, 2016

This is used in discussions of the character, would you not consider that text? it does seem to have a more figurative than literal reference than most characters, in a way that I am not sure how to translate into English.

thaumasiotes · on June 23, 2016

Use of 囍 in discussions of the character 囍 can be reasonably considered nominal use (that is, use within a name). The use of a concrete object to directly represent itself isn't really the same thing as the use of language to refer to a concrete object.

edit: I'd be interested in hearing your thoughts about "it does seem to have a more figurative than literal reference than most characters, in a way that I am not sure how to translate into English", in Chinese if necessary. (No guarantee I'll understand it, but I'm interested.)

ximeng · on June 23, 2016

How about usage in this passage: 囍事

http://www.chinatimes.com/newspapers/20160623000760-260115

In response to your edit, I mean that it has cultural resonance that is unusually strong in relation to its linguistic overtones, in many ways similar to the semantic timbre of a character like 福. The level of abstraction is different from English because of the ideographic nature of characters that means the visual appearance is emphasised, so the boundary that you pick out between reference and referent is more blurred.

I'm not really clear why exactly this character isn't more widely used in text, but I feel this might not be a bright dividing line from more common characters. I think inclusion of the the 福倒 is a harder case to make, but the examples I quoted elsewhere in this thread make me think it should be included. Perhaps not what you were hoping for in terms of elaboration, the problem is more conceptual fuzziness on my side perhaps than language of expression.

thaumasiotes · on June 23, 2016

Assuming that 囍事 in that passage refers to "a wedding", first I'd admit that that passes pretty much any test of "linguistic use in running text".

Having said that, I note that 喜事 appears in my dictionaries with the gloss "wedding" (well, "any occasion meriting joy, particularly a wedding"), 囍事 does not, and since 囍 is a symbol of weddings which is generally assumed by the Chinese to have the same pronunciation as 喜 it makes for very natural wordplay to substitute it into the word for wedding. I would draw a pretty close analogy with the $ of "Micro$oft" -- it's use in running text, but it shouldn't be taken as evidence that $ is a letter in English.

barkingcat · on June 24, 2016

You just aren't getting it. Do you actually know Chinese, or are you just looking things up in a dictionary?

It is pretty natural to jump from 喜 to 囍 because that is how Chinese works. You take radicals, and you bundle them up. You have the "busho" system where people in the past bundled up little bits and pieces and form new words. No reason why people in the present can't do the same.

Re: Micro$oft being outrageous if $ becomes a part of the alphabet. You are misapplying an English oriented viewpoint. In Chinese, there is no objection to forming words in that way, by incorporating radicals together. It's similar in theme to how in German, you can just keep stringing words together to form larger words. In fact, I actually think in the future, words like Micro$oft should entitle $ to become part of the alphabet! That's a very Chinese way of looking at things.

Language is not static. Systems that try to encode language are descriptive. They can never be prescriptive - otherwise we as a civilization die.

If 囍 wants to be a character point, let it be one. If 福倒 wants to be one, there should be one. Isn't the point of unicode to have enough space to include all these kinds of language artifacts (artifact as in a cultural / historical item thought up by humans) in order so people can uniquely reference each one? They are distinct logical units.

If the unicode rulebooks are too rigid, the rules need to change or the approach needs to change. It's useless to try to argue that xyz character in another language shouldn't/can't be a character - people will just stop using unicode if it doesn't suit their needs.

Reeks of colonialism, that's what it is.

EDIT: as an additional gloss, here's why I think 喜 and 囍 are sometimes used differently, even though by the dictionary definition they seem to be the same. I will explain why I think logically they are different concepts.

喜 is happiness, delight, joy. It is probably an adjective in the English sense (I can't map grammar rules through different languages easily).

事 is an occurrence, an item, something that happens.

When you put them together,

喜事 literally means something happy is happening.

The cultural meaning has turned that into a connotation of "wedding", but it could actually be a ton of happy things. Promotions, and yes - one other really big thing in a person's life: having a baby.

有喜 (means "having happiness") is the traditional way of referring to a woman being pregnant

http://baike.baidu.com/view/301829.htm

You can turn that into 家有喜事 - meaning home having something happy - as in this household is having a baby. And you can use it without the 有 - and just use 喜事 to refer to having a baby.

This is different from a wedding.

囍 is a modification of 喜, by doubling up the character and treating it as a radical, people are referring to the idea that there are "two people having happiness" - like a doubled amount of happiness.

In the article linked http://www.chinatimes.com/newspapers/20160623000760-260115 - the 囍事 is used to specifically identify the "wedding" type of 喜事 - it's like trying to avoid the ambiguity and double-entendres that Chinese writing typically embraces and just presents things matter of fact, which is ideal because the article is a newspaper article about customs of towns. Not really something you want people to have multiple interpretations like an essay or a poem, for example.

So logically, there is a difference when trying to use 喜 vs 囍 and I actually really appreciate the author's use of the double version in the text.

I know that not everyone reads these characters in this way, but I do - and I'm sure other people will notice this too. It's the best part of Chinese - not knowing, and not seeing the ambiguity, and one day, someone tells you about it .. and you're like - OMG that's what that means ...

For my earlier indication that this type of character modification is common in chinese:

木 = wood

林 = common last name Lin, also means forest (uncommon on its own)

森 = common character for forest.

The English word "forest" is usually 森林

It's just a doubling and trippling of the 木 radical.

What does it matter that this character is super old - people thousands of years ago thought this up.

Also, if this character weren't so old, would you say that 森 and 林 are both forests and thus don't need separate character points in unicode? That's outrageous!

So now we have a modern version of this modification 喜 -> 囍

And I showed how I think they are different logical concepts.

The link http://www.chinatimes.com/newspapers/20160623000760-260115 showed how it can be used in typographical context.

hmm what's the issue with it being a unicode character point?

POST Edit

In the writing of this post, I think I've come to identify Chinese as an "ambiguity-first" language - I learned Chinese as my mother tongue, but stopped at a elementary school level, and switch over to learning English to a Bachelor's degree level.

In Chinese, puns, double-entendres, and ambiguity just "happens" by default, and you have to work your way to be crystal clear.

English is more straight-forward, with a speaker having to try to make puns or double-entendres.

In the case of 囍, it's a reduction in scope. Modern Chinese people had to create a new word just to narrow down the meaning of 喜 - so that it specifically refers to weddings.

Your whole line of thinking was that 喜 already had meanings inclusive of wedding, so 囍 can't possibly add any more meaning when it also means wedding. In actuality, it took away a bunch of extraneous connotations, and in Chinese, the reduction in complexity is so valuable that it's worth a new word.

I think that it's a mistake to try to over-literate and reduce languages into a set of rulebooks for character encoding - that's all I am trying to put forth - it's best for the person or peoples who speak the language to come up with the encoding for it. I have an elementary school knowledge of Chinese and already I am kinda miffed at why people have an objection to 喜 vs 囍

Imagine how the people who have Bachelor's degrees in Chinese must feel.

Arnt · on June 23, 2016

Go read http://unicode.org/standard/principles.html. The committee follows its principles intelligently.

In this case, the codepoints were added in part because the proposers could show many printed works (user manuals, I guess) that included sentences such as "to turn the foobar on, press the ■ button", which shows that the glyph between "the" and "button" is in some way like the surrounding glyphs. Chessmen were added for similar reasons, even though very few people actually read either user manuals or chess literature.

the_mitsuhiko · on June 23, 2016

The difference between an icon and a letter is small and unclear. & is a symbol but was considered a letter as an example. Chinese characters are words etc.

agumonkey · on June 23, 2016

Good point. Letters.. punctuation.. symbol .. the lines are blurry. If I may I'd say that & is a symbol that represent a grammar connective. Which is a generic abstraction and won't cause explosion like having symbols for every word out there.

the_mitsuhiko · on June 23, 2016

> may I'd say that & is a symbol that represent a grammar connective

Then what about §? or $? Or %? The list is endless.

agumonkey · on June 23, 2016

You're talking about potential concatenation or are these used (never seen them). If the former then I think it's a bit outside the problem scope.

_pfxa · on June 23, 2016

That's what happens when you put good things at the hands of WWW.

cheez · on June 23, 2016

We may think that we are enlightened beings but the fact is that pictures comprise a lot of how we communicate now and in the past. Are emojis that different from hieroglyphics?

agumonkey · on June 23, 2016

Don't confuse 'intelligence' and how you communicate. Math texts are "emoji" fest with crude grammar but they represent subtle abstractions.

mcv · on June 23, 2016

I agree. I do see the usefulness of these symbols, but I'm not sure why emojis need official support.

Semiapies · on June 23, 2016

Last I checked, Unicode don't actually have anything like coverage of the entirety of every script and alphabet. On the other hand, approving emoji and random icons delights Westerners.

macspoofing · on June 23, 2016

>Last I checked, Unicode don't actually have anything like coverage of the entirety of every script and alphabet

Because the Unicode standards body doesn't want them in, or because those scripts don't have champions pushing for their inclusion?

>On the other hand, approving emoji and random icons delights Westerners.

Westerners? Notwithstanding the fact that emoji icons came from Japan, I'm fairly certain emotive icons are popular globally.

Semiapies · on June 23, 2016

Well, we know emoji has at least one champion, here.

acdha · on June 23, 2016

When was the last time you checked? See e.g. http://www.unicode.org/charts/ and especially http://unicode.org/alloc/Pipeline.html – it's not everything in human history but what's left out are increasingly obscure.

xigency · on June 23, 2016

This article has some examples - https://modelviewculture.com/pieces/i-can-text-you-a-pile-of...

There is also some discussion here - https://news.ycombinator.com/item?id=9219162

In this instance, someone is complaining that they cannot type their name on a computer.

Semiapies · on June 23, 2016

You don't have to point to human history, though that's a good source of missing scripts. Waving off scripts actually in use as "increasingly obscure", while cheering Unicode throwing in any icon random geeks pitch to them, misses the purpose of Unicode.

acdha · on June 23, 2016

You're tossing that assertion around without supporting it – what commonly used characters are not in Unicode? How many people use them? Are they not in Unicode because nobody cares or because there is a lack of someone authoritative helping codify the list or contentious disagreements about some aspects of that work?

Semiapies · on June 23, 2016

Seriously, you're able to Google up those other links, but you somehow can't find the (non-exhaustive) Unsupported Scripts list or the Proposed New Scripts pages on the Unicode site? And, without knowing the situation for any of them, you're going to throw out excuses for why the absences don't matter?

These aren't characters, but entire scripts that are not part of the standard. Nor are major scripts like kanji complete.

But, hey. Power button icon.

acdha · on June 23, 2016

Again, you're the one making the claim. Can you precisely state what you believe to be the problem and cite some sources that this is a major problem and that nobody is working on it?

More importantly, ask why it seems unreasonable that a small number of very widely-used ISO standard symbols were incorporated quickly? Wouldn't that be the most reasonable expectation since it lacks the political heat of e.g. Han unification and doesn't require any research or debate to establish that they are used, have a precise meaning, and are not covered by existing codepoints?

Semiapies · on June 24, 2016

I've already stated my complaint: that getting gratuitous icons into Unicode is easier than actual scripts for human languages. Since you're having Google issues, I'll link you to a page I already mentioned, which it self links to other relevant pages http://unicode.org/standard/unsupported.html

I love the echoing nature of these counter-arguments, that a problem doesn't even exist unless it's "major" and "nobody is working on it". I wonder how many actual different human beings have responded to me in this thread...

acdha · on June 24, 2016

You might find your conversations work better if you respond to what people are actually saying rather than repeating yourself or assuming that other people don't know how to use Google, particularly after they've already sent you links which comprehensively disprove your assertion by demonstrating how many new characters are being added and that emoji constitute less than 1% of the 7,500 new characters in Unicode 9.0.

Since you're hung up on http://unicode.org/standard/unsupported.html, let's read it and see how many languages are missing:

* Loma: 250K speakers, and this is to codify characters used for personal correspondence in the 1930-40s: http://www.unicode.org/L2/L2010/10005-n3756-loma.pdf * Naxi Dongba: a pictographic script used by priests in an ethnic group of roughly 300K people: http://www.unicode.org/L2/L2011/11178-n4043.pdf

Your original claim was that “Unicode don't actually have anything like coverage of the entirety of every script and alphabet” but you're arguing about things which affect something like 0.008% of people – not even their primary usage – and for which there is work in progress to support!

Nobody is saying that Unicode is complete, but like any other human effort there's a limited amount of time to work on things. At some point things which are used daily by billions of people are going to get prioritized over things which are used infrequently by thousands of people, and it's hard to argue that this is wrong even if you – like me – want to have 100% of human language represented in Unicode.

Semiapies · on June 24, 2016

"You might find your conversations work better if you respond to what people are actually saying rather than repeating yourself"

You're going to seriously say that after your last few posts? Two posts into this exchange, you moved the goalposts, and you hammered that button repeatedly.

But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.

And the limited time to work on these things is exactly the issue. There are scripts not yet in the standard and major language scripts that aren't complete - but we've got "pile of poo" and a slew of emoji. And now, we've got four power button icons that a handful of people demanded.

acdha · on June 24, 2016

> You're going to seriously say that after your last few posts? Two posts into this exchange, you moved the goalposts, and you hammered that button repeatedly.

You started this conversation with “Unicode don't actually have anything like coverage of the entirety of every script and alphabet.” It's hardly moving the goalposts to question how complete Unicode has to be to qualify as “anything like” or how much weight usage should have.

> But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.

Before you could call that proof, you have to clearly articulate the questions it could answer. Note that my first comment indicated a clear understanding of how Unicode works – the process is not in question here, only the thresholds you haven't articulated. All I've been trying to get you to state is precisely what your rules would be for coverage of human languages before we can add anything else and how much usage should factor into that. There's also a much harder question of trying to come up with a rule which to say why a pictograph, the phaistos disc symbols, etc. are valid for inclusion but a modern symbol used millions of times a day around the world to communicate is not?

While thinking about this, it's also worth remembering that despite your apparent belief that emoji are a Western novelty, the question was how to improve Unicode adoption in Japan and that required having an answer for the millions of people who were using systems which relied on non-standard encodings and by most accounts Japanese carriers were resistant to adopting Unicode without having a standard to replace those ad-hoc systems. I think that decision should have been handled differently (i.e. assigning an emoji plane) but it was driven by understandable technical reasons affecting large numbers of people on a daily basis. Since that decision was made, the additional cost to add a small number of non-controversial additions which do not require scholarly research or documentation does not seem excessive — we are, after all, talking about a small percentage of the new symbols in Unicode 9.0.

Animats · on June 23, 2016

But why? The trend towards putting icons into Unicode may be a mistake. Unless it's a symbol one uses in a sentence, there's no real reason to have it in Unicode. Unicode should not be viewed as a standard clip art library.

lifthrasiir · on June 23, 2016

I believe they are much more legitimate symbols to put to Unicode than Emoji. They are standardized (IEEE 1621), they frequently appear in the running text (of particular kind, but widespread enough), and they have distinct semantics from existing characters and symbols. In terms of worthiness they are on par with math symbols, enough for justifying the inclusion.

Ruud-v-A · on June 23, 2016

I hear you. If you want pictures then use a markup language. Unfortunately it is too late now. We finally had an almost universally supported character set, and then we ruined it with levitating men in business suits. Recent Unicode versions introduce far more technical challenges than they solve. For instance, now that code points can come with colour, there are conflicting requirements between the requested text colour and the intrinsic colour of a symbol.

slapresta · on June 23, 2016

This is about icon fonts and font rendering and has nothing to do with Unicode. Code points don't have colour; there's no colour requirements anywhere in the spec.

Your fonts don't have to support the entirety of Unicode. That's why we have font stacks and fallbacks.

Ruud-v-A · on June 23, 2016

U+1F499 BLUE HEART U+1F49A GREEN HEART U+1F49B YELLOW HEART U+1F49C PURPLE HEART U+1F53D DOWN-POINTING SMALL RED TRIANGLE U+1F536 LARGE ORANGE DIAMOND U+1F537 LARGE BLUE DIAMOND ...

slapresta · on June 24, 2016

That doesn't directly translate to a color requirement in the font, though.