It was just a tweak to emoji characters to mark them all as East Asian Full Width instead of Narrow or Ambiguous so that they displayed correctly when using a fixed width font in a terminal console. This probably only matters if you like to use emoji filenames (you mad person), but it felt like a wart so I reported it & had a short back and forth with the chair of the emoji-related subcommittee which resulted in a proposal which was eventually accepted by the committee into Unicode 9.0. The committee were great: took my tiny bug report seriously, wrote huge long treatises to justify the change & eventually voted it into the standard.
(This was pretty much my peak geek achievement of 2016 so far :) )
Holy crap I appreciate this change! I thought they'd never fix it because of compatibility. Thanks for the effort you put in.
It's not that I use emoji filenames, it's that I deal with real-world natural language text all the time, including at the console.
(In terms of compatibility, my text-justifying function is going to stop working correctly for the period of time between when gnome-terminal updates to Unicode 9 and when Python 3.x does. Still worth it.)
If they get their unicode data from the same, OS supplied source that supplies the wcwidth() function (or be using wcwidth() themselves) a libc update should fix both, I think.
They don't. Python's "unicodedata" module updates with the minor version of Python.
This is good, actually, because the meaning of a string operation should be consistent when run on the same version of Python.
(If only this applied to the "default encoding". The default encoding should be UTF-8, not whatever you get by asking the user's likely-misconfigured locale. As it is, you can't rely on the default encoding if you want your code to work consistently.)
Ideally yes. I'd have to confirm that this extends to Pizza and Koalas. We fixed that issue as much as we could, even going as far as generating our own unicode width tables extracted from unifont, but it wasn't possible to fix in general without support form the terminals. Now that the standard is fixed (hopefully), I don't see a reason why the terminals wouldn't update their tables.
The success of the unicodepowersymbol proposal inspired me to suggest a couple characters to Unicode (the Bitcoin sign and IBM's group mark from 1960s mainframes, which were accepted). The point is that Unicode really is open to proposals from random people; you don't need to part of a big company to influence Unicode.
Although I had never worked with the Unicode Consortium, I [submitted a proposal][1] for an international symbol for an observer and it was eventually accepted.
Its one thing to have absolute, iron fisted control over your own platform - its another to intentionally seek to limit people's self expression on other platforms by influencing the standard in this way.
Are you saying that a company with voting rights shouldn't be allowed to have influence on what goes into Unicode? That doesn't make any sense. Also, if you read the article, Apple wasn't the only party in favor of nixing the emoji.
> Are you saying that a company with voting rights shouldn't be allowed to have influence on what goes into Unicode?
One might have a right to do something and yet be wrong to do it.
Apple had every right to do what they did, but they were completely, totally and undeniably in the wrong to do it. Everyone associated with their action should be ashamed. Honestly, they should all resign: their behaviour demonstrates that they have no business being associated with this sort of work.
Your comment is ridiculously extreme. They were not "undeniably" in the wrong. You think they were wrong, but that is an highly subjective opinion. In fact, I don't even agree that they were wrong to do this at all. I think it's perfectly reasonable to argue against the inclusion of more gun imagery in Unicode.
Also, if you think Apple was wrong, you must also think that Microsoft was (they voiced support), and everyone else at the meeting who agreed with the move. As the article says,
> Davis confirmed to BBC News on Monday that "there was consensus to remove" the emojis, but that he couldn't comment on the details.
So it's clearly not just Apple that thought this was the appropriate move.
Thank goodness for apple and other members which rejected the starter pistol/rifle proposal. Imagine how many mass shootings we've prevented by people not being able to communicate their plans using the rifle emoji.
Just like how North Korea isn't "undeniably" wrong in censoring speech to such an extreme. This is a very 1984-esque "solution" to a problem -- don't want to acknowledge positive use of guns? Good news, we can just erase them from our language! The way we treat emoji has some very serious similarities to Newspeak.
That's absurd. Nobody's censoring anything here. Apple not wanting to add a new gun emoji is in no way preventing you from talking about guns. Emoji isn't a replacement for English and nobody is forcing you to "speak" in all emoji.
An emoji cannot be an emoji without having emoji presentation. There is a rifle character that resembles, in form and function, the other Olympic emoji, it's just not sitting in the set reserved for emoji. Any platform that wants to pick it up can use it, they can even put it next to all their other emoji if they want to. Apple, et al did not 'prevent' Rifle from being encoded. They just decided they wouldn't pick it up as an emoji for their platform, and the Unicode conference decided to move it to a different category based on that decision.
There are literally thousands of characters that Apple doesn't encode for their platforms. I haven't heard whether Apple will be supporting:
Osage, a Native American language
Nepal Bhasa, a language of Nepal
Fulani and other African languages
The Bravanese dialect of Swahili, used in Somalia
The Warsh orthography for Arabic, used in North and West Africa
Tangut, a major historic script of China
All of which were added to Unicode 9.
>The two characters will still be part of the Unicode spec, but they'll be classed as black-and-white "symbols" instead of regular emoji
I assume they reserve the unicode character, but anyone who wants to use it decides what it looks like (so the rifle could look different on different platforms, which isn't a big issue)
This is always true for Emoji. The platforms always decide what their presentation of emoji will look like, just as they determine what font they will use for the unicode traditional letters. In this case, Apple and others decided they didn't want a rifle emoji, so it was moved to the 'black and white symbols' section so that those platforms wouldn't be 'missing' an emoji, which has other technical implications.
Heck, Unicode is a mess already. But that's mostly because (a) language and scripts are messy, and (b) the original aim was to unify and encompass all existing character sets, and some of those were messy as well.
While there was considerable uproar over Emoji and there still often is over yet another fifteen symbols that everyone thinks no one would ever need or use, the bulk of the Unicode character set is still scripts for human languages. And some of those are only relevant for a very small minority, say, archaeologists. But that's fine. There's enough space, we're nowhere near to running out and Unicode enabled all sorts of cool things in computing that simply were not possible before or only with awful hacks and workarounds.
A text file, a webpage, or a database table can only contain textual data in a given encoding.
That's because every byte stored in the file, for example byte number 188, either means "¼" (as it does in ISO/IEC 8859-1, aka. Latin-1 or ANSI), or it means "ỳ" (as in ISO/IEC 8859-14) or "シ" (in JIS X 0201, one of the many Japanese encodings that were devised over the years.)
How do you know which encoding a certain file uses? In general YOU CAN'T and this was the source of many problems and "solutions" which caused even more problems over the years.
Well then, how did you mix symbols from different alphabets, say in a dictionary or in a post that talks about them, like this very post? YOU COULDN'T, short of doing ugly hacks and other subterfuges, like using GIFs for all foreign characters.
Unicode gave a distinct number (or "codepoint") to every character and symbol known to man (within reasonable limits) and this allowed a lot of things that we take for granted nowadays, including this very post, were I just copied and pasted various symbols from their Wikipedia pages and just expect it to work.
Fun fact: Japanese even has a word for foreign characters, "gaiji", and it's extremely common in Japanese ePUBs to use small square images very frequently for characters not in the current font, using the term "gaiji" in the CSS class names used for these characters. And at least one mainstream ePUB reader has special code to detect these gaiji and adjust its rendering to make them behave better.
Similarly, the word "loanword" (used to describe words borrowed from other languages) is gairaigo 外来語 which is literally 外来 (gairai) "foreign" + 語 (go) "word/language". Japanese is filled with words where you can often figure out the meaning purely from the characters used!
Conversion between different character codes became much easier. (Å (U+00C5) is canonically the same as Å (U+212B) but the latter exist only for round-trip compatibility.)
Unicode defines normalization algorithms (is é its own character, or e with a modifier character?).
I can have a document which combines English, Russian, Arabic, and Chinese, and expect it to be readable and editable by many different tools.
> I can have a document which combines English, Russian, Arabic, and Chinese
English combines with top-down Chinese, and English combines with right-to-left Arabic, but top-down Chinese and right-to-left Arabic don't combine properly in the same document using Unicode -- the Arabic will be written bottom-up instead of top-down when embedded in the top-down Chinese.
I meant something simpler, like: The word 'computer' in English is 计算者 [jì suàn zhě] in Chinese, Компьютер in Russian, and حاسوب in Arabic."
Try that without Unicode.
It's of course possible with TeX, and no doubt other solutions. Which is why I added "and expect it to be readable and editable by many different tools".
(As a real-world use case, look at Knuth's "The Art of Computer Programming" and see how he credits people using their full names, in their own written language.)
Do you know when you would use 计算着 over 电脑[dian nao]?
计算着 I guess more literally translates to "one who computes", whereas 电脑 translates to "electric brain" which is a way more fun image, but I have no idea how the usage varies.
I am not a native speaker and terrible at googling this, but I've once heard that 電腦 was originally the Taiwanese way of saying it and 計算機 the Chinese way, in the same way that 'program' is still called 程式 on one side of the strait and 程序 on the other, or 'internet' is either 網絡 or 網路. The names of movies and video games are also generally different (PRC, HK, TW).
计算着 [jisuanji] is the older usage that harkens back to the days when computers were mainframes and terminals and not in every household. Generally everyone says 电脑 [diannao] these days.
Someone I know worked at a company that sells a content management system with version control. Bigcos use it to keep and update lots of marketing materials and manuals. Those responsible for product x or area y can see what's been added by others, and update their own versions, and record that.
In this way, new FAQs and other updates spread to all the markets easily.
It's vastly easier to do this stuff when all the documents use the same text encoding. Even if noone can read everything, the fact that everything uses the same encoding means that any pair of languages you can read are technically readable.
You don't even need to be part of a big company to go to their meetings. I went to their conference in San Jose just to see the proposal for the Chinese Take-out box/chopsticks/fortune cook emojis. I met a ton of nice people, too.
How did the Unicode Consortium turn around. I remember 10 years ago they were refusing to add standard media icons because
>The scope of the Unicode Standard (and ISO/IEC 10646) does
not extend to encoding every symbol or sign that bears
meaning in the world.
>This list has been round and round and round on this -- regular as clockwork, about once a year, the topic comes up again. And I see no indication that the UTC or WG2 are any closer to concluding that bunches of icons should start being included in the character encoding standards simply on the basis of their being widespread and recognizable icons.
>Where is the defensible line between "Fast Forward" and
"Women's Restroom" or "Right Lane Merge Ahead" or
"Danger Crocodiles No Swimming"?
Unicode is supposed to include symbols that appear in "running text", not standalone icons. So no on traffic signs for instance. (There are exceptions for historical reasons. And emoji are a totally separate story.)
In all seriousness, I'm not sure emoji's really belong in text encoding. Even though it's more convenient, based on where they're most frequently used I don't think they need to be universal.
1) everybody uses them on their phones, they're in Unicode, consistent and compatible between devices and messaging programs. In the far flung future, researchers will be able to study their linguistic role in communication, confident in understanding what the characters were.
2) everybody uses them on their phones, they're proprietary fonts and codepoints (in the Unicode private use area if you're luck, just random data if you're not), there's no consistency between phone models, manufacturers, or cell networks. Future researchers can pound sand.
We were at #2 pre-Unicode. It was a goddamn mess, especially in Japan. Lord knows why anyone would prefer it. There's no value in being a snob about what kinds of incredibly frequently used characters we think are Worthy of inclusion, imo.
3) People who love colourful images will use stickers in Facebook Messenger, LINE, Viber, and soon iMessage. I'm sure WeChat has them too.
It's basically like 2), except we've moved from proprietary codepoints to proprietary protocols.
I don't mind characters like and or even good old ︎ (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.
But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see and whose computer/phone/browser/messenger software is too old.
Oops, thanks. Well, that explains why I've never seen Emoji on Hacker News. And I've missed the edit window, so I can't fix my post.
Should have been:
> I don't mind characters like ((yellow smiling Emoji)) and ((thumbs up Emoji)) or even good old ︎((pre-Emoji Unicode smiley)) (which has always been too tiny for its own good). These work in black and white, in different artistic styles, and they're a fairly limited set.
> But now we're going down the road where we get new stuff like tacos and unicorns every year. And even though Unicode is an industry standard, the pictures need to look like Apple's bitmaps to avoid confusion, and the Unicode standard changes so often that you have to manually keep track of who can already see ((upside-down smiling Emoji)) and whose computer/phone/browser/messenger software is too old.
The "universal" in Unicode means that it aspires to include all symbols used in any form of text; not that it should only include symbols that are used in all forms of text.
Emoji were added to Unicode for compatibility with various mobile phones, so they would have a standard encoding. That's how Unicode ended up with the poop emoji for example - they didn't sit around thinking "what we really need is...". Since people really, really want more emoji, Unicode is sort of stuck constantly adding more. If you want to propose new emoji, the rules are at http://www.unicode.org/emoji/selection.html
Text symbols (as opposed to emoji) have different rules. Basically, the symbol needs to be used in "running text" (i.e. normal text), like "containers with [recycling symbol] can be recycled" or "he bid 2[club]". Traffic signs for example are not normally used in the middle of text, so they aren't encoded in Unicode. To get the Bitcoin symbol encoded, I needed to show that it was used in text, not just as a standalone icon. The full rules for symbols in Unicode are at http://www.unicode.org/pending/symbol-guidelines.html
TL;DR: Don't argue "Why does Unicode have a poop emoji but no symbol for X?" - the rules are totally different for emoji and symbols.
Edit: does HN strip out arbitrary Unicode characters now? I originally had Unicode characters in place of [recycling symbol] and [club], but they disappeared when I submitted.
IIRC HN might have a character whitelist to prevent overloads of combining or layout-altering characters, and not have to worry about the behavior of newly added characters. There were some comment threads a few years ago that were just stacks of hundreds of combining diacritics that would crash some rendering engines and create odd decorated text on others.
Tom Scott has a great video about the history of emoji. Basically, some companies in far east countries were encoding these icons in various proprietary codes for their messaging systems. The ecosystem became widespread and consistent enough that the unicode consortium saw it fit to include these emoji in the standard.
The snowman, on the other hand, is a weather symbol for snow, I assume. It appears alongside other symbols for meteorological phenomena, so I imagine was added around the same time and with similar reasoning: http://www.fileformat.info/info/unicode/block/miscellaneous_...
I'm not sure about the 'running text' thing, but in my view Traffic Signs are not globally universal (yet), so you'd have to have regional variants which is impractical.
Honestly, I think "SVG over UTF" makes a lot more sense. It's impossible to make a character set that supports every character known to man, because that just adds undue effort on every computer maker, ect, to keep up.
So why don't we pick a very good set: perhaps every letter in every language in common use for the past 200 years? Then, for the oddball symbols that someone wants to mix in text, there can be some kind of SVG-like convention. This allows publishing textual information without requiring that every device maker updates their device to support a 1-off symbol.
> This allows publishing textual information without requiring that every device maker updates their device to support a 1-off symbol.
The main purpose of Unicode is to encode the information. How the information is turned into its visual counterpart is outside the scope of unicode. For what it's worth this could be done by linking unicode code points to matching SVGs in a document. Wait, exactly that is already a W3C standard: https://www.w3.org/TR/SVG/fonts.html
Because it's easier to throw in random icons than to actually accomplish the goal of "every letter in every language in common use for the past 200 years", or even "past 20 years".
Or, put another way:
'We have an unambiguous, cross-platform way to represent “PILE OF POO” (), while we’re still debating which of the 1.2 billion native Chinese speakers deserve to spell their own names correctly.'
This is a link by the article's author that is intended to make it easier for us to add useful symbols: https://github.com/jloughry/Unicode
I recommend you use it to add any glyphs that you feel are being neglected.
That article raises an interesting issue about a character in the author's name that is missing from Unicode. Unfortunately the article is (how to put this?) not constructive. The complex reasons that Unicode excluded the character are described in [1]. If the author addresses those issues, there's a much better chance to get the desired character into Unicode.
Correct me if I'm wrong, but isn't the Han Unification project more about unifying semantically distinct, but visually identical characters under the same codepoint (rather than grouping together similar-looking codepoints as the article suggests)? As far as I'm aware it's more along the lines of reusing the codepoint for 'a' when encoding both English and Spanish text. Am I mistaken in thinking this?
But if the shape of embedded in the text, font choice becomes meaningless.
> undue effort on every computer maker, ect, to keep up.
The effort to update the font files every few years? Unless you insist on supporting a new Unicode version the second it comes out, I don't see the big effort here? Of course there is effort for font makers, but this is quite centralised.
What about the oddest oddballs whose "symbols" are animations http://www.reactiongifs.com/r/tww.gif? They are used a lot on reddit sometimes even with sound.
As the story mentions regarding the off symbol (a circle), there are many visually identical code points that have different semantic meanings. But in this case, they added an additional semantic meaning to an existing code point.
So which is it? Does each code point represent a visual image? A semantic meaning? Both? It depends? Something else?
I've tried to decipher that on my own and only learned that the answer to these sorts of questions are complicated, because it's very complicated to represent all written human language via one set of rules.
So I know some of the answers to my questions above, but I'm hoping someone with real expertise can provide the fundamental rules/policies - if there are any.
"every code point represents a semantic meaning" is completely consistent with the notion that some code points e.g. have differing visual representation depending on their position in the word.
Try looking up han-unification and its justification and you'll see the exact opposite approach to encoding characters into unicode.
For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.
If you want to write Japanese and Chinese in the same document, you need to mark up the section to tell the system that renders it, to render different visual forms for similar codepoints depending on whether they are used in Japanese or Chinese.
> For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.
This isn't true. 青 and 靑 are the same character written differently; they have their own codepoints. Ditto for a huge number of simplified Chinese characters; 语 is mainland Chinese and 語 is the same character in Japanese.
It is true for lots of characters (so I guess I was being a little hyperbolic when I said "all"), and you cannot rely on choosing the correct code points in order to have a text display Japanese or Chinese. You need to tell your rendering program (often through choice of font) if things are to be rendered with Japanese or Chinese forms.
I wouldn't know how to show you examples here, as 直 will 直 display the same since they have the same code point, but different number of strokes in japabese and chinese.
Aren't they putting the disunified characters into the U+2xxxx plane now?
Han unification is generally seen as a bad choice in retrospect, but it was something Unicode had to do when it looked like 2^16 codepoints were all they were going to get.
Never heard of that, but I would appreciate if all the characters with different glyphs had different codepoints. Do you have a source? Do you know what happens to the "unified" code-points?
It is true to some extent. While 青 and 靑 have different codepoints, there are plenty of characters of the same codepoint that are rendered differently depends on the language specificed:
Han characters that are traditionally viewed as variants of one another, or that are simplified from more complex logograms (such as 龜, which was simplified into 亀 in Japan and 龟 in mainland China) tend to have different codepoints, but the stylistically different ones usually belong to the same codepoint.
Han Unification "rules" were an inconsistent mess, but I do know that in Japanese 靑 was at one time a printer's simplification of 青, so you could find either in texts, and the Consortium tended to encode a character separately if you could find printed examples of both in the same language.
That is not the compromise struck, though; there are even many Cyrillic glyphs that are visually identical to those in Latin, but assigned differing codepoints.
There are multiple reasons for that, one of which is compatibility with previous encodings and standards. If a previous encoding Unicode wanted to be compatible with encoded these as different characters, Unicode needs these to have separate code points for them too.
But how serious is the problem? How many times do you need to test if a given character is one of the 26 allowed letters of the English alphabet, and where you implement it by testing it against the range?
Typically you write it as "islower_english(c)" once, and be done with it. Is that really hard?
If you do think that's a serious problem, then what of those programmers who need to test for lowercase letters in "España", "München", "Diyarbakır", and "façade"?
I mean, if you really want to get into it, it was a huge pain in the ass at a time when paying the cost of a call to islower_english was much more expensive than a hardware less-than instruction.
We've broadly moved beyond that, but there's still value in grouping sets together in a way that makes certain kinds of frequent tests less computationally expensive than they would be if codepoints were randomly distributed.
I didn't dispute that. I just state that trying to remain compatible for the sake of being compatible is a great way to design a convuluted and difficult to understand standard.
Of course, but lack backward compatibility is a great way to make sure a standard is not adopted. For example he reason that UTF-8 'won' is that it has a great backward compatibility story with other ASCII based encodings and systems.
A pointer would let you ask "does a character lie between 0x12 and 0xBC" but would not hold the character itself; it would make possible to implement different "characters" with the same representation.
Perhaps Unicode could just have tables listing all relevant sequences of symbols, instead. So "latin letters lowercase" would list the codepoints for a-z in order, for example. Would no longer matter if the codepoints themselves are sequential or not.
(And relevant to my country, "Swedish characters lowercase" would map to latin letters lowercase + åäö.)
Characters have a script associated with them (e.g. Latin), and caseness is also part of a character's properties.
Now, language-specific subsets¹ of those are a bit iffy to deal with. Especially when text can contain loan words from other languages, so in my experience it's rarely a useful thing to ask for.
¹ Yes, subsets. Latin letters lowercase is not the set abcdefghijklmnopqrstuvwxyz. It is the set
How do you condense that again into language-specific subsets? Every letter that appears in a word in a dictionary? Then at least é belongs to German as well, even though it's usually not considered part of the German Latin subset. Unicode stays clear of that issue by simply not defining what script subsets a character belongs to (rightfully so, IMHO).
Yes, but they have alternate italic forms, for example. Sure, some one of the glyphs like с doesn't have an alternate italic form. Since the other ones do, it would be weird to only assign a separate codepoint to some of them and overlap the others. It would be a workable solution, but still weird.
Which is a bad idea because the characters don't look right unless you use a Japanese font. But if you want to write an article comparing Japanese and Chinese characters, you have to use two different fonts.
The emoji code points can be represented differently on different systems given their meaning.
So it makes sense to have different emojis for different 'meanings'.
The 'moon' switch here does no mean 'moon' - it means 'standby' or whatever.
It may look noticeably different on different systems.
Think from a design perspective: you have 5 emojis to represent 'clouds, sky, earth' etc. - and the a different set of 5 to represent 'on, off, sleep, shutdown'. Those icons will be markedly different in terms of representation, groupings, colour coding, underlying functionality if they are integrated into an experience in any meaningful way.
Text your car with the 'shutdown' symbol to tell it to shut down.
Your bot texts your friend with a moon symbol to tell him you're asleep. Or whatever.
But that doesn't explain the inconsistency in the current case.
So if a system wants to render "on" differently than "straight vertical line", that's possible.
However, if "off" should be rendered differently than "circle", that's not possible. (Or only possible with out-of-band information or modifier characters which would still have to be defined)
It's a mess. If you want to write a document in Japanese that talks about a Chinese character which is written differently than its Japanese version, you can, or can't, achieve this in Unicode, depending on the character, its history, and the mood of the consortium the day it was assigned.
The reality is that Unicode is governed by people, some of those people are grumpy reductionists who push for a minimum of symbols and a maximum of meaning-overloads, and others are more liberal and tend to advocate the opposite, and the result is a compromise, and is in areas very messy.
Do note that they did not include a generic "off" symbol, they included the IEEE 1621 off symbol - which must be rendered as a circle; while on the other hand the IEEE 1621 on symbol must be rendered in a manner that is often different from just "straight vertical line" in particular regarding the corners of that line.
I didn't see special Unicode symbols for the SI units in the link you gave or in a more general search. I found no match for "ampere" which is the SI base unit for electric current, or for "candela", used for luminous intensity.
> Unicode also has encoded U+212B Å ANGSTROM SIGN. However, that is canonically equivalent to the ordinary letter Å. The duplicate encoding at U+212B is due to round-trip mapping compatibility with an East-Asian character encoding, but is otherwise not to be used.
I'm a bit confused about Unicode. It was a repository of linguistic symbols, not raw symbols. More and more it looks like wingdings. Isn't this putting burden on font support and Text processing (what's the lexicographic order of such symbols, using the abstract name ?) ?
They want every symbol used in a document to have a unique encoding, so that you can change fonts without losing meaning. Fonts like wingdings are a horrible hack.
The idea is one (complex) encoding that will represent the info until the end of time. It creates a lot of trouble, but it's still a good idea.
Technically, glyphs are supposed to meet some standards, like being shown in use in running text, before they can be added to unicode. It's not supposed to be a repository of every picture anyone ever dreamed up.
The standards are not applied consistently. Even leaving emoji out of it, the chinese "character" 囍 never occurs in running text, but there it is in unicode.
I don't think it's true that 囍 never occurs in running text - it's used in company names which would be used in text. It would be odd not to have an encoding for such a common character.
All right, I spent some time trying to find the requirement. I did not find it, but my tentative conclusion is that it does not apply to chinese characters.
FROM MEMORY, a while back there was an article on HN complaining that emoji seemed to magically bypass the requirements other characters needed to meet for inclusion in unicode, and that in fact they were commonly in violation. The taco symbol was called out as an example. I can no longer find this article, but it mentioned the running text requirement, and -- I believe -- specifically indicated that use in names does not count as use in running text. (For an idea of why that might be the case, check out http://tvtropes.org/pmwiki/pmwiki.php/Main/LuckyCharmsTitle .)
HOWEVER, I was not even able to find, on the unicode web site, any discussion of a running text requirement at all, for any kind of symbol. Some example proposals do refer to "running text" by name, but they don't indicate why. The example proposal given for adding characters to an existing block ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2934.pdf , suggested as a prototype in http://www.unicode.org/faq/char_proposal.html ) does not mention "running text" at all, and doesn't appear to go to much trouble to document it, although some such documentation is given. The rough guidelines for character proposals at http://unicode.org/pending/proposals.html do not refer to "running text" at all, but they do suggest that, late in the process (specifically, on a proposal summary form, which is different from, and subsequent to, an actual proposal), "references to dictionaries and descriptive texts establishing authoritative information" are required.
I conclude that the Unicode standard's preferred criterion for chinese character inclusion is "would an authoritative chinese dictionary include this character", and while the answer to that question for 囍 is not unambiguous -- a lot of dictionaries don't include it -- it's easy to imagine that some do.
I would appreciate a pointer to the actual running text requirements, as well as what they are supposed to apply to, if anyone can provide that.
> It would be odd not to have an encoding for such a common character.
Outside of its use as a wedding decoration, which is plainly nonlinguistic, how common is it?
You're not wrong, but I don't know if it's a great example.
CJK characters are, broadly, an example of the Unicode Consortium trying to be way too reductive about what they'd accept, leading to a lot of bad decisions like Han Unification, which caused a lot of damage and which the Consortium has generally now backed away from and recognized as a bad idea.
So, yes, if you look closely at CJK character sets in Unicode, you can find a lot of decision making that appears to contradict decision making elsewhere in the standard. This is in large part because the decisions they made wrt CJK characters turned out to be largely wrong, and they've since changed their approach.
Unification is a mistake. But 囍 has nothing to do with unification. Do you think that 福倒 should have a code point? Would it be considered one of the chinese characters (very iffy) or one of the holiday symbols?
The 天书 ( https://en.wikipedia.org/wiki/A_Book_from_the_Sky ), by design, consists solely of chinese characters that don't exist. (Theoretically. A couple of them, by oversight, did exist.) They are still recognizably "chinese characters" by virtue of being composed of the same components. Should they have unicode points?
囍 plainly exists, but has no textual use. Is it more similar to 靑 or to ️U+2764 "heavy black heart"?
> Unification is a mistake. But 囍 has nothing to do with unification.
I'm not saying it does, I'm saying Unification illustrates the fact that the Consortium's decision-making with respect to CJK has changed over time, has frequently been illogical, and shouldn't be pointed at as an example of anything good or sane or worthy of precedent.
The fact that 囍 has a code point but 福倒 doesn't have a codepoint is another example of the Consortium being unnecessarily reductive and intransigent about CJK.
> Do you think that 福倒 should have a code point?
Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.
> Should they have unicode points?
I'd lean towards no, as they're one-offs, not something broader that people want to discuss and use in text. But I'd be ok with adding them, too. We're not running out of space. There's no value in making CJK so much harder to interop with than everything else, in general.
> Yes. If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way.
This doesn't make any sense. We talk about things in text by using words, not direct representations. A dog emoji is not necessary or desirable for discussing dogs in text, and a 福倒 emoji is not necessary or desirable for discussing 福倒s in text.
Should the wikipedia page https://en.wikipedia.org/wiki/Statue_of_Liberty be edited to replace the cumbersome phrase "statue of liberty" with the more modern and convenient U+1F5FD 'STATUE OF LIBERTY'?
> A dog emoji is not necessary or desirable for discussing dogs in text, and a 福倒 emoji is not necessary or desirable for discussing 福倒s in text.
"Necessary" is an ill-defined and reductive way of looking at communication. History has shown us that you can't draw bright lines between things you, in the abstract, have decided are the "necessary" subset, and expect the world to follow along.
Linguists have come to understand that you can only describe and follow human, behaviour, not prescribe it.
Anyways, humans plainly found it necessary to annotate their text messages with pictoral indicators of their mood, to the point where it became so widely spread and such a mess that we felt it desirable to standardize the code-point representations. That it isn't desirable in all circumstances or appropriate in all registers of formality does not mean that it isn't an emergent behaviour which will continue to arise whether or not it is "necessary".
tl;dr I don't really give a shit that "dog emoji" isn't appropriate for an academic text on canine surgery. It's more than sufficient to me that it is used millions of times in text messages between regular human beings. Text needn't be formal text to deserve respect in encoding.
> "Necessary" is an ill-defined and reductive way of looking at communication.
I took "If we want to be able to talk about it in text (like now), I want to be able to encode it in a standardized way" as implying that the two clauses were related to each other. Saying "if we want to be able to talk about it" means you're talking about what's necessary for that purpose.
Interesting comments, thanks. It is used in company names, which would make it awkward not to have an encoding for it: there are many characters used just in names in Chinese that would leave locations and people having unencodable names if the characters were not in Unicode.
> there are many characters used just in names in Chinese that would leave locations and people having unencodable names if the characters were not in Unicode
I was under the impression that this describes the current state of affairs, and has since before Unicode came around. I know I've read an article about someone whose 姓 was 马 and whose personal name was a character composed of three 马 stacked left-to-right (which might have been pronounced cheng?) getting harassed because the government couldn't encode the name.
It may be the case, but usage of many of these characters is rarer than 囍. Looking through many of the characters in the cjk Unicode extensions it is not hard to conclude that characters like 福倒 should exist. E.g. 2010f and 20114 which look like 了 and 予 upside down. Or 255d0 which is 石磊 joined together.
The word "running" doesn't appear on that page. (Actually, no requirements at all appear on that page; it speaks strictly in terms of strengthening or weakening the case for inclusion, not disqualifying.) Can you explain briefly why that page is evidence that the running text requirement does not apply to Chinese, and where it specifies what the running text requirement is?
Alternatively, what requirements do apply to Chinese, and would they preclude an invented character like one with 女 on the left and 离 on the right?
I thought the symbol guideline page discussed "running text", but I guess not. Apparently the "running text" requirement isn't part of the published criteria even though it is enforced in discussion.
That character is super common. Way more common than any emoji I'd expect (even teens using emojis won't outweigh the number of weddings and new years in Chinese speaking countries).
Well, even if it doesn't meet unicode inclusion requirements, it is necessary for printing in one of the largest markets in the world. Without that character in unicode, Chinese display systems and printers probably won't use unicode at all (and before unicode they used some standard of their own) - meaning the question is whether unicode wants to be relevant or not, not whether the inclusion requirements fit.
This is used in discussions of the character, would you not consider that text? it does seem to have a more figurative than literal reference than most characters, in a way that I am not sure how to translate into English.
Use of 囍 in discussions of the character 囍 can be reasonably considered nominal use (that is, use within a name). The use of a concrete object to directly represent itself isn't really the same thing as the use of language to refer to a concrete object.
edit: I'd be interested in hearing your thoughts about "it does seem to have a more figurative than literal reference than most characters, in a way that I am not sure how to translate into English", in Chinese if necessary. (No guarantee I'll understand it, but I'm interested.)
In response to your edit, I mean that it has cultural resonance that is unusually strong in relation to its linguistic overtones, in many ways similar to the semantic timbre of a character like 福. The level of abstraction is different from English because of the ideographic nature of characters that means the visual appearance is emphasised, so the boundary that you pick out between reference and referent is more blurred.
I'm not really clear why exactly this character isn't more widely used in text, but I feel this might not be a bright dividing line from more common characters. I think inclusion of the the 福倒 is a harder case to make, but the examples I quoted elsewhere in this thread make me think it should be included. Perhaps not what you were hoping for in terms of elaboration, the problem is more conceptual fuzziness on my side perhaps than language of expression.
Assuming that 囍事 in that passage refers to "a wedding", first I'd admit that that passes pretty much any test of "linguistic use in running text".
Having said that, I note that 喜事 appears in my dictionaries with the gloss "wedding" (well, "any occasion meriting joy, particularly a wedding"), 囍事 does not, and since 囍 is a symbol of weddings which is generally assumed by the Chinese to have the same pronunciation as 喜 it makes for very natural wordplay to substitute it into the word for wedding. I would draw a pretty close analogy with the $ of "Micro$oft" -- it's use in running text, but it shouldn't be taken as evidence that $ is a letter in English.
You just aren't getting it. Do you actually know Chinese, or are you just looking things up in a dictionary?
It is pretty natural to jump from 喜 to 囍 because that is how Chinese works. You take radicals, and you bundle them up. You have the "busho" system where people in the past bundled up little bits and pieces and form new words. No reason why people in the present can't do the same.
Re: Micro$oft being outrageous if $ becomes a part of the alphabet. You are misapplying an English oriented viewpoint. In Chinese, there is no objection to forming words in that way, by incorporating radicals together. It's similar in theme to how in German, you can just keep stringing words together to form larger words. In fact, I actually think in the future, words like Micro$oft should entitle $ to become part of the alphabet! That's a very Chinese way of looking at things.
Language is not static. Systems that try to encode language are descriptive. They can never be prescriptive - otherwise we as a civilization die.
If 囍 wants to be a character point, let it be one. If 福倒 wants to be one, there should be one. Isn't the point of unicode to have enough space to include all these kinds of language artifacts (artifact as in a cultural / historical item thought up by humans) in order so people can uniquely reference each one? They are distinct logical units.
If the unicode rulebooks are too rigid, the rules need to change or the approach needs to change. It's useless to try to argue that xyz character in another language shouldn't/can't be a character - people will just stop using unicode if it doesn't suit their needs.
Reeks of colonialism, that's what it is.
EDIT: as an additional gloss, here's why I think 喜 and 囍 are sometimes used differently, even though by the dictionary definition they seem to be the same. I will explain why I think logically they are different concepts.
喜 is happiness, delight, joy. It is probably an adjective in the English sense (I can't map grammar rules through different languages easily).
事 is an occurrence, an item, something that happens.
When you put them together,
喜事 literally means something happy is happening.
The cultural meaning has turned that into a connotation of "wedding", but it could actually be a ton of happy things. Promotions, and yes - one other really big thing in a person's life: having a baby.
有喜 (means "having happiness") is the traditional way of referring to a woman being pregnant
You can turn that into 家有喜事 - meaning home having something happy - as in this household is having a baby. And you can use it without the 有 - and just use 喜事 to refer to having a baby.
This is different from a wedding.
囍 is a modification of 喜, by doubling up the character and treating it as a radical, people are referring to the idea that there are "two people having happiness" - like a doubled amount of happiness.
In the article linked http://www.chinatimes.com/newspapers/20160623000760-260115 - the 囍事 is used to specifically identify the "wedding" type of 喜事 - it's like trying to avoid the ambiguity and double-entendres that Chinese writing typically embraces and just presents things matter of fact, which is ideal because the article is a newspaper article about customs of towns. Not really something you want people to have multiple interpretations like an essay or a poem, for example.
So logically, there is a difference when trying to use 喜 vs 囍 and I actually really appreciate the author's use of the double version in the text.
I know that not everyone reads these characters in this way, but I do - and I'm sure other people will notice this too. It's the best part of Chinese - not knowing, and not seeing the ambiguity, and one day, someone tells you about it .. and you're like - OMG that's what that means ...
For my earlier indication that this type of character modification is common in chinese:
木 = wood
林 = common last name Lin, also means forest (uncommon on its own)
森 = common character for forest.
The English word "forest" is usually 森林
It's just a doubling and trippling of the 木 radical.
What does it matter that this character is super old - people thousands of years ago thought this up.
Also, if this character weren't so old, would you say that 森 and 林 are both forests and thus don't need separate character points in unicode? That's outrageous!
So now we have a modern version of this modification 喜 -> 囍
And I showed how I think they are different logical concepts.
hmm what's the issue with it being a unicode character point?
POST Edit
In the writing of this post, I think I've come to identify Chinese as an "ambiguity-first" language - I learned Chinese as my mother tongue, but stopped at a elementary school level, and switch over to learning English to a Bachelor's degree level.
In Chinese, puns, double-entendres, and ambiguity just "happens" by default, and you have to work your way to be crystal clear.
English is more straight-forward, with a speaker having to try to make puns or double-entendres.
In the case of 囍, it's a reduction in scope. Modern Chinese people had to create a new word just to narrow down the meaning of 喜 - so that it specifically refers to weddings.
Your whole line of thinking was that 喜 already had meanings inclusive of wedding, so 囍 can't possibly add any more meaning when it also means wedding. In actuality, it took away a bunch of extraneous connotations, and in Chinese, the reduction in complexity is so valuable that it's worth a new word.
I think that it's a mistake to try to over-literate and reduce languages into a set of rulebooks for character encoding - that's all I am trying to put forth - it's best for the person or peoples who speak the language to come up with the encoding for it. I have an elementary school knowledge of Chinese and already I am kinda miffed at why people have an objection to 喜 vs 囍
Imagine how the people who have Bachelor's degrees in Chinese must feel.
In this case, the codepoints were added in part because the proposers could show many printed works (user manuals, I guess) that included sentences such as "to turn the foobar on, press the ■ button", which shows that the glyph between "the" and "button" is in some way like the surrounding glyphs. Chessmen were added for similar reasons, even though very few people actually read either user manuals or chess literature.
The difference between an icon and a letter is small and unclear. & is a symbol but was considered a letter as an example. Chinese characters are words etc.
Good point. Letters.. punctuation.. symbol .. the lines are blurry. If I may I'd say that & is a symbol that represent a grammar connective. Which is a generic abstraction and won't cause explosion like having symbols for every word out there.
We may think that we are enlightened beings but the fact is that pictures comprise a lot of how we communicate now and in the past. Are emojis that different from hieroglyphics?
Last I checked, Unicode don't actually have anything like coverage of the entirety of every script and alphabet. On the other hand, approving emoji and random icons delights Westerners.
You don't have to point to human history, though that's a good source of missing scripts. Waving off scripts actually in use as "increasingly obscure", while cheering Unicode throwing in any icon random geeks pitch to them, misses the purpose of Unicode.
You're tossing that assertion around without supporting it – what commonly used characters are not in Unicode? How many people use them? Are they not in Unicode because nobody cares or because there is a lack of someone authoritative helping codify the list or contentious disagreements about some aspects of that work?
Seriously, you're able to Google up those other links, but you somehow can't find the (non-exhaustive) Unsupported Scripts list or the Proposed New Scripts pages on the Unicode site? And, without knowing the situation for any of them, you're going to throw out excuses for why the absences don't matter?
These aren't characters, but entire scripts that are not part of the standard. Nor are major scripts like kanji complete.
Again, you're the one making the claim. Can you precisely state what you believe to be the problem and cite some sources that this is a major problem and that nobody is working on it?
More importantly, ask why it seems unreasonable that a small number of very widely-used ISO standard symbols were incorporated quickly? Wouldn't that be the most reasonable expectation since it lacks the political heat of e.g. Han unification and doesn't require any research or debate to establish that they are used, have a precise meaning, and are not covered by existing codepoints?
I've already stated my complaint: that getting gratuitous icons into Unicode is easier than actual scripts for human languages. Since you're having Google issues, I'll link you to a page I already mentioned, which it self links to other relevant pages http://unicode.org/standard/unsupported.html
I love the echoing nature of these counter-arguments, that a problem doesn't even exist unless it's "major" and "nobody is working on it". I wonder how many actual different human beings have responded to me in this thread...
You might find your conversations work better if you respond to what people are actually saying rather than repeating yourself or assuming that other people don't know how to use Google, particularly after they've already sent you links which comprehensively disprove your assertion by demonstrating how many new characters are being added and that emoji constitute less than 1% of the 7,500 new characters in Unicode 9.0.
Your original claim was that “Unicode don't actually have anything like coverage of the entirety of every script and alphabet” but you're arguing about things which affect something like 0.008% of people – not even their primary usage – and for which there is work in progress to support!
Nobody is saying that Unicode is complete, but like any other human effort there's a limited amount of time to work on things. At some point things which are used daily by billions of people are going to get prioritized over things which are used infrequently by thousands of people, and it's hard to argue that this is wrong even if you – like me – want to have 100% of human language represented in Unicode.
"You might find your conversations work better if you respond to what people are actually saying rather than repeating yourself"
You're going to seriously say that after your last few posts? Two posts into this exchange, you moved the goalposts, and you hammered that button repeatedly.
But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.
And the limited time to work on these things is exactly the issue. There are scripts not yet in the standard and major language scripts that aren't complete - but we've got "pile of poo" and a slew of emoji. And now, we've got four power button icons that a handful of people demanded.
> You're going to seriously say that after your last few posts? Two posts into this exchange, you moved the goalposts, and you hammered that button repeatedly.
You started this conversation with “Unicode don't actually have anything like coverage of the entirety of every script and alphabet.” It's hardly moving the goalposts to question how complete Unicode has to be to qualify as “anything like” or how much weight usage should have.
> But at least you actually looked at the proof you repeatedly demanded, even I had already mentioned the pages. You didn't bother reading much of it, or to note that goes well beyond a couple scripts on that page to other incomplete scripts and as-yet entirely unimplemented scripts. But you at least made that minimum effort.
Before you could call that proof, you have to clearly articulate the questions it could answer. Note that my first comment indicated a clear understanding of how Unicode works – the process is not in question here, only the thresholds you haven't articulated. All I've been trying to get you to state is precisely what your rules would be for coverage of human languages before we can add anything else and how much usage should factor into that. There's also a much harder question of trying to come up with a rule which to say why a pictograph, the phaistos disc symbols, etc. are valid for inclusion but a modern symbol used millions of times a day around the world to communicate is not?
While thinking about this, it's also worth remembering that despite your apparent belief that emoji are a Western novelty, the question was how to improve Unicode adoption in Japan and that required having an answer for the millions of people who were using systems which relied on non-standard encodings and by most accounts Japanese carriers were resistant to adopting Unicode without having a standard to replace those ad-hoc systems. I think that decision should have been handled differently (i.e. assigning an emoji plane) but it was driven by understandable technical reasons affecting large numbers of people on a daily basis. Since that decision was made, the additional cost to add a small number of non-controversial additions which do not require scholarly research or documentation does not seem excessive — we are, after all, talking about a small percentage of the new symbols in Unicode 9.0.
But why? The trend towards putting icons into Unicode may be a mistake. Unless it's a symbol one uses in a sentence, there's no real reason to have it in Unicode. Unicode should not be viewed as a standard clip art library.
I believe they are much more legitimate symbols to put to Unicode than Emoji. They are standardized (IEEE 1621), they frequently appear in the running text (of particular kind, but widespread enough), and they have distinct semantics from existing characters and symbols. In terms of worthiness they are on par with math symbols, enough for justifying the inclusion.
I hear you. If you want pictures then use a markup language. Unfortunately it is too late now. We finally had an almost universally supported character set, and then we ruined it with levitating men in business suits. Recent Unicode versions introduce far more technical challenges than they solve. For instance, now that code points can come with colour, there are conflicting requirements between the requested text colour and the intrinsic colour of a symbol.
This is about icon fonts and font rendering and has nothing to do with Unicode. Code points don't have colour; there's no colour requirements anywhere in the spec.
Your fonts don't have to support the entirety of Unicode. That's why we have font stacks and fallbacks.
U+1F499 BLUE HEART
U+1F49A GREEN HEART
U+1F49B YELLOW HEART
U+1F49C PURPLE HEART
U+1F53D DOWN-POINTING SMALL RED TRIANGLE
U+1F536 LARGE ORANGE DIAMOND
U+1F537 LARGE BLUE DIAMOND
...
What is "plain text format" though? If 'text' isn't limited to Western ASCII characters (which it very obviously shouldn't be considering many people use other character sets), then the idea of a text standard should be to encode all the glyphs people use, so "plain text format" becomes a canonical list of all the communicative symbols in all languages. That's what Unicode aims to be.
In my opinion, if they're used for communication, it doesn't seem unreasonable that such a canon of characters should include universal iconographic symbols like the standby icon.
If I understand correctly, Unicode only provides the semantic meaning, not the actual rendering. The font provides information for how to render it. Am I right?
Yes, though they provide guidance on how they should look, and in most cases, there is only one font on the system that has a glyph for some of these more esoteric code points, and it usually provides a reasonable representation.
I already ranted about unicode earlier today, my main argument is, that unicode is what happens if everybody qualified thinks: "That's a great idea, of course you have to handle X and Y and Z and I just remember that I forgot to fill out several warranty cards."
This blog post is a nice example, I have absolutely no idea how these new code points are supposed to look like, since I only spend an afternoon to implement the unicode best practices from the Arch wiki, instead of subscribing to some unicode standard mailing list. (Except the one symbol which was redefined to a symbol that does not carry the semantic meaning of "standby symbol" anywhere outside of the unicode standard.)
In my opinion there are two ways forward, one burn the entire thing. Or alternatively, force the unicode committee to produce an authoritative and complete font, in triplicate, and in their own blood.
The Unicode tables include examples for all graphical code points: http://unicode.org/charts/. If you really wanted you can make them into a font (most of them seem to be vectorized), but since I'm guessing you see most of the added code points as useless why do you care if they show up as boxes? What harm is this stuff causing or going to cause to the standard? We have hundreds of thousands of unassigned code points.
Meanwhile, a lot of the "Ys and Zs" added to Unicode have proved to be extremely useful. Unicode's math operator and letter-styling support is what made MathJax (and more generally MathML) possible. They've also helped big time when it comes to accessibility (e.g. screen readers) for mathematics on the internet. Should we have shunted that off to another standard and made the creators of screen readers completely restructure their offerings so they can deal with Unicode characters and "Mathicode" characters? Assuming anyone bothered to implement it, how would that be better than just adding a Unicode category and spending a meager amount of space?
Your answer illustrates my point perfectly, first the lack of a reasonable fall back mechanism. Of course I can start a hex editor, get the utf-8 encoding and then look up the code point ( and theoretically add that character to a open font). A default font would just ship with every OS out there, and suddenly there would be a working fall back.
Second mathematical symbols, consider the case were I get a text file considering mostly of ASCII 7 and some mathematical symbols which may render as mathematical symbols or as Chinese characters, since there is no way to specify the encoding in a text file and so I have to guess the encoding. (That is not helped by the roughly 17 standardized encodings that mostly agree with utf-8.)
> Second mathematical symbols, consider the case were I get a text file considering mostly of ASCII 7 and some mathematical symbols which may render as mathematical symbols or as Chinese characters, since there is no way to specify the encoding in a text file and so I have to guess the encoding.
What does that have to do with Unicode adding anything? Are you really claiming that if we threw out Unicode like you recommend, and (if I'm understanding your point correctly) choose an encoding for the new version that looks nothing like ASCII the encoding mess would get better? I think continuing the migration of most transmission of text to UTF-8 and explicitly specifying encodings for everything that needs to stick with Latin-1, etc. is a better option, unless you propose codifying the new encoding in law to force adoption.
I checked your link, then I proceeded to the terms of service and IANAL there they claim that all 'unicode software' is basically MIT licensed. (Please check with a lawyer before you conclude that word is MIT licensed.)
Fonts
The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be
expected in actual fonts. The particular fonts used in these charts were provided to the Unicode Consortium by a number
of different font designers, who own the rights to the fonts.
See http://www.unicode.org/charts/fonts.html for a list.
Terms of Use
You may freely use these code charts for personal or internal business uses only. You may not incorporate them either
wholly or in part into any product or publication, or otherwise distribute them without express written permission from
the Unicode Consortium. However, you may provide links to these charts.
The fonts and font data used in production of these code charts may NOT be extracted, or used in any other way in any
product or publication, without permission or license granted by the typeface owner(s).
The Unicode Consortium is not liable for errors or omissions in this file or the standard itself. Information on characters
added to the Unicode Standard since the publication of the most recent version of the Unicode Standard, as well as on
characters currently being considered for addition to the Unicode Standard can be found on the Unicode web site.
Now you can try to track down the actual typeface owners one by one (this alone seems hopeless) and even then I doubt that you can get permission from all of them.
But all mathematical symbols, except for styled math letters (which would have been equally well served by simply rendering them in italics or in a special math font), were already in the original 16-bit Unicode, as are any characters humans normally associate with "text" (all alphabets except for Egyptian hieroglyphics and other extinct alphabets). How useful is it to standardize hieroglyphics, ancient greek musical notation, and emojis as standard text characters esp. without standardizing their screen representation?
> How useful is it to standardize hieroglyphics, ancient greek musical notation, and emojis as standard text characters esp. without standardizing their screen representation?
In the same way it's useful to standardize letters in various alphabets without standardizing their screen representation. There is semantic content associated with each of these symbols that persists even if there is significant variation in how they are presented. Of the ones you list, emojis are the only ones where this is any more a problematic approach than it is for letters in various alphabets. And as people who don't approve of Unicode adding emojis like to point out, emojis aren't that critical so having some loss in the translation isn't a huge deal.
Remember that before emoji standardization various cell phone manufacturers (particularly in Japan if I remember correctly) started using codepoints for whatever they pleased. The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.
> styled math letters (which would have been equally well served by simply rendering them in italics or in a special math font)
That was my first reaction as well, but there are a few problems with that approach:
* Math italic characters look very different from normal italics, and are shaped and kerned very differently because they are commonly used for single-letter variables which will be juxtaposed together in expressions. If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.
* Many of the math letters and "letter-like symbols" have associated semantic content (like bold for vectors), which it makes sense to preserve. MathML alleviates this to a significant degree but I don't believe these codepoints were intended only for MathML usage.
* On the technical side, OpenType math fonts need to carry associated metadata for many of these characters. Putting them in separate fonts complicates this, since these tables need to refer to glyphs (general codepoints are unsuitable in a number of cases) and each font file would have a different glyph address space.
> Remember that before emoji standardization various cell phone manufacturers (particularly in Japan if I remember correctly) started using codepoints for whatever they pleased.
Things were way worse than that: to add emoji to text, NTT DoCoMo used private-use codepoints, AU used embedded image tags and Softbank wrapped emoji codes in SI/SO escape sequences.
> In the same way it's useful to standardize letters in various alphabets without standardizing their screen representation.
I disagree. Say the name of a letter in any alphabet, and people will draw it in ways that are similar enough for automatic recognition. This is not true for pictograms and emojis.
> The alternative to Unicode not standardizing them was to have a repeat of the OEM font gold rush in the SMP.
I disagree. The alternative is a much simpler and faster standardization, of the kind I offered here: https://news.ycombinator.com/item?id=11958903 There is absolutely no need for a fixed codepoint for most of the non-BMP characters.
> If your goal is to be able to preserve some math formulas in a purely line based text format, preserving this aspect makes a big difference in readability.
So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.
Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.
> I disagree. The alternative is a much simpler and faster standardization, of the kind I offered here: https://news.ycombinator.com/item?id=11958903 There is absolutely no need for a fixed codepoint for most of the non-BMP characters.
So you think instituting a system based on links not rotting would better preserve meaning? Not to mention that:
* Every text renderer that doesn't support your codepoint now displays a full URL, instead of a box, making text using these emojis very difficult to read.
* Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.
The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff. If the choice was "stick with UCS-2 and be totally fine language wise, or add more bits just for emojis and pictograms" I'd probably agree with you. If you think that's what happened, your timeline for this process is way off.
> So is rendering text in Arial vs. Comic Sans, but they haven't made separate codepoints for those.
Sure, but the different letter types carry crucial meaning in math formulas. "sup" in upright letters is the math operator supremum, "sup" in math italics is s * u * p. This kind of thing applies to every one of the mathematical letter variants.
> Also, where this makes a lot of difference, would count as "specialized usage". I don't think it makes sense to have a single universal standard to standardize all specialized usage of human-readable data.
You're zooming way out on this one. Math symbols have a lot more in common with letters and "normal" symbols than "all specialized usage of human-readable data". Remember that when Unicode added the math symbols, things like MathJax were simply impossible. Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.
> Every text renderer that doesn't support your codepoint now displays a full URL, instead of a box, making text using these emojis very difficult to read.
It can still display a box.
> Instead of making implementation easy by requiring nothing new of text shaping libraries, they now have to be able to both connect to the internet and tie into a file cache.
The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs. Their OS could provide this service for them just as it allows them the use of fonts.
> The supplementary space was already there when we got to emojis, and UTF-8 and UTF-16 already had to deal with SMP codepoints for some of the less common CJK characters. Not everything above 0xFFFF is "weird" non-human language stuff.
That is a good point, one of which I was not aware, but I still don't think it justifies standardization of Chinese characters, emojis, and ancient Greek musical notation by the same standards body.
> Sure, but the different letter types carry crucial meaning in math formulas.
The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.
> Being able to write at least some formulas, which consist of letters and symbols, without losing tons of semantic information seems like exactly the kind of thing character encodings should do.
I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.
If it knows about your new codepoint. Everyone using an implementation that doesn't yet support it is going to show the full URL. If history repeats itself, these implementations will be the majority for at least a decade.
> The question is implementation of what. I think that it is not an onerous requirement from applications that need to display emojis or ancient Egyptian hieroglyphs.
Except they need to do literally nothing different. Hieroglyphs are vectorized, and emojis are either bitmapped (which existing font formats already supported) or vectorized. None of this required a single line of code in any text shaping library to change. Text shaping libraries generally don't even need to understand Unicode categories or other metadata: font files already contain all the relevant information (directionality, combining mark, etc.).
> The use of italics in text may also carry crucial meaning. But if a textual representation as sup is supported, I don't see why the specialized rendering should be supported, too, but for math and not plain text.
If you scrub all bold and italics from text, do any of the words turn into different words? That's what happens with sup (or other named operators). Same thing for blackboard letters, fraktur, etc.
> I agree, but I think that that semantic information is preserved when writing N or NN instead of 𝑵 or ℕ. Considering that Unicode isn't enough to write most mathematical formulas convenient forms anyway and requires a specialized renderer anyway, I don't see the reason for this extra effort.
My point is it was better than nothing, which was the alternative when Unicode added these. By adding mathematical characters that worked exactly the same as all other characters, we could get some of the advantages of MathML without needing everybody to implement a special math renderer or learn any new markup, just download new font files.
If getting everyone to accept MathML or something similar in an expedited fashion was a reasonable proposition, then I might agree that they should've kept it out of Unicode. Those were not the facts on the ground when this decision was made. Note that even now that we have MathML, the few browsers that support it (IIRC Firefox & Safari only) have complete shit implementations that look terrible.
The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort. Unicode's expansion into these domains required nobody to do anything differently, let alone decide to up and write an implementation of a completely different standard. I think this is a case were our alternative was to let the perfect be the enemy of the good, or at least the working.
> That's what happens with sup (or other named operators).
And that's what happens when scrubbing subscripts or binomial coefficients. When you want to represent math as text, you need to change your representation (add multiplication signs, forgo subscripts, use confusing parentheses etc.). This is still true with Unicode. The contribution of the non-BMP special math characters is quite minimal.
> The crux of this is that making changes to a standard to support something new and propagating the changes is super hard; getting new standards to get accepted and implemented is a Herculean effort.
Sure, but the emoji craziness continues, and there's little sign it would ever stop. Instead of saying "this isn't text; if you want, call it 'special text', escape it, and let a different body standardize it", the body entrusted with standardizing text representation worries about how to represent a picture of two people kissing as text. What next? Kids would want to add tunes to their text messages. Would the Unicode Consortium add code points for MIDI? And maybe managers would want to standardize code points for organizational diagrams. Would that be the consortium’s responsibility, too? The BMP contains all the characters for reasonable text-art.
Unicode has added 1791 emojis[1]. Note that it has used significantly fewer codepoints than this for emojis, because some (e.g. flags) are done via a small number of combining marks (flags use 26 code points for letters in country codes). They've been slowing down the rate of additional emojis since they started doing this. Do you really think they'll accelerate at some point and use up the almost 1 million unassigned code points Unicode has left? Not to mention the fact that many UTF8/UTF16 implementations are already fine with full 32-bit code points, instead of the 21 bits Unicode I'm using above (they have said they'll only ever use 21 bits, but the option is there).
Suffice it to say this getting "out of control" and eating up the remaining space in our lifetimes, or our children's lifetimes would be pretty impressive. This means that the worst we have to fear is more boxes, assuming OSes don't keep up with their fallback fonts. Saying adding more symbols-designed-to-be-just-kind-of-placed-in-line-with-other-symbols-like-they-always-have-been is the first step towards MIDI and organization diagrams is like saying you're vegetarian because it's a slippery slope from eating meat to eating people.
Also note that unlike every other excess of the Unicode standard, these would require massive changes to the code that handles text. This means that if Unicode decided to do this, you wouldn't have to worry about negative effects because nobody would implement it.
I'm not afraid of running out of codepoints. I just think it is misguided that non-text is standardized as text. It just doesn't make sense. I don't know how people will communicate 50 years from now, but I think it's funny that even then, text strings would still need to support all those vegetables and hand gestures. Written text is pretty much eternal; emojis and pictograms? I doubt it. It doesn't make sense for them to get a similar treatment.
I think that the BMP -- Basic Multilingual Plane or the first 16-bit of Unicode characters -- is pretty reasonable, and covers fairly well everything we may consider as text (all alphabets in current use plus mathematical symbols). Anything beyond that, from emojis and pictograms to ancient Greek musical notation is pretty... weird.
I think it would have made much more sense to have something like image tags: a special codepoint would introduce a link to a URL containing a sequence of glyphs, followed by an index into that sequence. Those glyphs would be guaranteed not to change (in any meaningful way), and devices would be free to cache them. This way, anything that isn't real text, would standardize representation, too, instead of just a vague "meaning". Another standard could relate those glyphs to one another in some way, giving them standard semantics and means of translation (i.e. "Egyptian hieroglyphics"). This would also allow each of those (emojis or hieroglyphics) to evolve their standards independent of a single universal standard that means little.
That's nice, except the BMP doesn't encode all of Chinese, and it includes a number of weird control characters for compatibility with ASCII. Like Vertical Tab. Who uses Vertical Tab anymore?
The dream of a 16-bit Unicode washed up on the rocks of CJK scripts. It's dead and it isn't going to be revived. You can argue for a simpler standard, with fewer assigned codepoints, but the original BMP isn't it and was never going to be it.
I mostly agree. In my opinion, separating code points from encodings (without providing a replacement for txt files) was the original sin of unicode. That just adds a lot of complexity with very little gain. (Suppose we would have standardized from ASCII to a 64 bit per character standard encoding, then text files would grow by a factor of 8 and nothing important would have happened.^1)
^1 I am pretty sure that some use cases actually profit a lot, but neither text file formats (since most text does not contain 2^64 different characters zip would work nicely) nor networking (since most data on the internet is either video or torrents) seem to be among them. So probably they would not be huge fields.
It's not bad, but it's complicated, as it requires an O(n) algorithm to jump to a specific character. Unicode should have been capped at 16 bits, and doubling text files in size is fine. An alternate representation of simplified UTF-8 would have kept compatibility with old ASCII files.
Doubling text files is a waste for the most part, but what makes it tolerable is compression. Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode. 24 bit is sufficiently out of allignment with modern hardware and algorithms, so 32 bit it is for efficiency. That is now 4 times the size, and compression is now a requirement.
In any case, Utf8 has a place, and if you want easy manipulation and search, convert it to Utf32 - it's fixed width.
> Still 16 bits would not be enough, it's only 65536 different code pages, less than half of what is currently in Unicode.
But that's only because Unicode has significantly ventured well beyond what we consider to be text. The BMP is enough to represent all text (including math).
BMP is mostly filled with symbols from logographic languages, and currently has around 100 free code points. 16 bits simply isn't enough for the scope of capturing all written language of the history.
But I don't think all written languages in history should have the same treatment when it comes to standardized data representation, or should all be standardized by the same body. It's OK to have alphabets no one has used for thousands of years other than specialized researchers standardized separately from Latin or Chinese alphabets.
But if you don't what would the point of a unified standard be? What should happen to glyphs that noone have used outside of research for 100 years? 1000 years?
They should all be standardized under separate standards. You can call them "extended text", if you like. Mathematical notation isn't really supported by Unicode, either (just mathematical symbols), and that's fine. Math should have its own standard, and so should hieroglyphs, emojis, and musical notation.
What happens if you want to use two different extended text code points in one blog post? How would they interact? How do they avoid assigning the same codepoint to different symbols.
How would browsers support this? What's the actual plan, not just a handwave? Do you think it'd be more efficient to have to support 6 different standards than one?
> What happens if you want to use two different extended text code points in one blog post?
There are no codepoints if it's not text. How do you use codepoints for embedding a video or a picture on your blog? You don't! But, if you want to treat something as if it were text, then I suggested doing something similar to an XML namespace: "the next segment is hieroglyphics, you can get their glyphs from here, and these are their indices...". That "extended text" is still not text, and it still doesn't use any Unicode codepoints, but it can work according to similar principles.
> Do you think it'd be more efficient to have to support 6 different standards than one?
Then why don't we let the Unicode Consortium take over standardizing video or audio? If something isn't text, why is it standardized by a text standardization body?
If you think of Unicode as a standard that enables the ease of exchange of textual data among parties, would that change your mind? People have been putting emojis alongside text (think SMS) long before Unicode put them into the standard at the request of Google and Apple.
People have been putting images alongside text since the invention of print. That still doesn't make the images text. So it's really great that for a few years now some people are embedding icons in their SMS messages, but I think that incorporating those fashionable 2-5-year-old icons into a standard that's mostly about standardizing hundreds-of-years-old text doesn't make much sense.
You can exchange text with embedded icons just as easily without requiring OS vendors to come up with their own versions of vaguely-defined pictures by... simply embedding pictures.
I can already see the people on whatever would be the HN 20 years from now complaining how "bloated" Unicode is, full of thousands of symbols that no one ever uses, and calling to replace the whole thing, costing the industry even more money to replace a standard yet again.
If compression would be a requirement for 32 bits, then it was definitely a requirement for 8-bit text 25 years ago when memory and bandwidth were typically 1/1000th of today. Of course it often wasn't, and isn't. And where it is, it's still a requirement with UTF-8.
> it requires an O(n) algorithm to jump to a specific character
If you are trying to index into a string by "character" you are almost certainly already doing it wrong. Meaningful indexing almost always has to be by grapheme cluster. See Swift's string API as a great example of this done right.
ASCII is only 7 bit so UTF-8 is fully compatible, at least when compared to ISO 2022 and other similar horrors from the same era. Are you thinking of other encodings?
Emojis and hieroglyphs aside, 16bit was not enough for CJK characters. It's a real world problem---a character set that can't spell people's name or location can't be universally adopted.
How do you define "character"? You typically need to take into account grapheme clusters / composed character sequences, which make jump-to-character O(n) regardless of encoding.
Well, if you want to know what they can look like, the blogpost has images, an embedded webfont and links to the reference font for the new symbols. And AFAIK providing reference images that are freely usable is required for all new symbol proposals.
It does not work on Firefox with noScript. (And as a matter of fact, prohibiting random blog posts from rendering unicode, executing complex numerical calculations on my graphics card or delivering exploits is kind of the purpose of noScript...)
NoScript does not prevent pages from "rendering Unicode", whatever that's supposed to mean. I use it myself, and the only thing it does is selectively block Javascript.
Legitimate question: Why is Unicode littered with all those useless symbols?
I can see the reasoning behind the standard (or very common) symbols or things like emoji, but having every possible glyph in UTF8 seems like a horrible waste.
What if we want to add new glyphs in the next 10 years for emerging standards?
having every possible glyph in UTF8 seems like a horrible waste
A horrible waste of what? Unicode 9.0 encodes 128,172 characters, of a possible total 1,112,064 code points. The addressable space is 11.52% full. Clearly there's enough left to keep adding more and more characters for a really long time.
If your complaint is that it's a waste of resources, time, etc - surely it's up to the people who are members of the consortium to decide how they want to spend their energy?
I believe it's an issue with time resources. I would argue that new emoji characters are one of the less important uses of Unicode [1]. You are right that it is entirely up to the members of the Unicode consortium to manage their efforts themselves, but that doesn't mean we can't complain about that. I see a lot of these cases as bikeshedding
People aren't fungible and there's no special set of people who have been specially blessed to decide what's important all-up. Each area of Unicode is handled by a different set of people who are experts in different areas.
If a number of people who use language "a" know that Unicode isn't handling their language, some of them need to step up and provide a solution. Part of that stepping up might be as easy as complaining about the problems they are running into :-), but eventually for a solution to emerge, some set of people need to step forward and handle the Unicode research and paperwork.
So? In this 11% it already covers almost all written languages in use and tons of dead ones. It also has lots of classic symbols, from math, book ornaments, standard typographic stuff (left arrow, etc.).
So all the basics are covered.
We could cover the rest 89% with variations of the turd emoticon and we'll still be perfectly fine.
The original UTF-16 can only represent 65536 code points, what is less than half the number of unicode codes today. It was broken at the expansion around a decade ago.
There's a new, incompatible ("mostly compatible" may explain it better) UTF-16 encoding that represent all unicode codes, but well two formats with the same name is even more broken than only a broken one.
UTF-32 will suffer the same fate as UTF-16 if unicode expands. And UTF-8 is capable of representing an absolutely huge number of codes, requiring only non-breaking extensions.
True but I think that is also problematic.
The more symbols there are the less likely it is that every font covers all the symbols that someone might find important.
For instance, I often get e-mails with characters that can't be displayed because my standard fonts (on linux) don't have them. The "missing unicode symbol" icon is the new "picture not found" icon[0] In fact, it is even worse as there is no alt attribute that delineates what I am supposed to see.
> the less likely it is that every font covers all the symbols that someone might find important
That is a meaningless requirement. The symbols I use on a daily basis already don't exist in a single font. Operating systems handle font fallback just fine.
to make sure that even the most idiosyncratic choice of characters can be displayed everywhere according to the intention of the person who picked them.
> The more symbols there are the less likely it is that every font covers all the symbols that someone might find important.
Why would every font need to cover them in the first place? Just one is enough with symbolic characters, sufficient to display them. Text renderers can deal with that.
No font coverage is also fine, at least the data is preserved.
Which ones do you find useless? And why do you find them less useful than those, in my eyes, extremely useless emojis? I'd say a power on and power off symbol are pretty useful things to have in a font.
At some point, someone realizes that there is need to standardize fixed practical subset of Unicode that contains all essential symbols over the world so that all devices that comply with the standard can __actually__ interchange text in readable, printable and visually presentable form.
It's nice to have catalogue of symbols and tight encoding for them, but full support of Unicode encoding has very little to do with support for Unicode in an application.
Yes, basically. I think we want to have normalization and other simplifications and restrictions to processing. What I have in mind is restricted standard subset bundle that makes it possible to send text that ends looking right in every device in the future.
Imagine that you are developing wristband device and you can buy ASIC or FPGA chip module that eats grapheme clusters and spits out bitmap for the right glyph every time.
The only problem I see is OSX/iOS, Windows, and Android don't ship with some universal, but shitty, font that has every single last glyph ever, always immediately updated to the new Unicode standard.
You mean 0x23E9 to 0x23FA, just before these new power symbols? I only noticed them because the unicode power symbol site has an image of what comes before their symbols.
Unicode symbols... seems like we should've developed them the way languages develop: start with the most important symbols, ones for food, water, shelter, danger, etc, then expanded them into the abstract mess they are today.
Emoji were not developed haphazardly. They evolved naturally in Japan, then were adopted by the rest of the world. That is why there are so many Asia / Japan themes in the standard emoji set. The problem is Westerners don't understand the Japanese emotion behind the symbol. The symbol for bookbag looks exactly like a Japanese school kid's backpack. It's why there is a kimono. Bamboo wind chimes. Tsunami. Shinkansen... I could go on and on.
In some respect, they are getting jumbled up because of international pressures for the base emoji set to be stretched into a be-all for the global market. An example is Taco. There are tacos in Japan. They are hard to find and when you do find one, you definitely don't want to eat one there. Mexican food is one of the rare cuisines the Japanese don't do better.
I hope that ligatures will be more popularized than using characters like "½", because it is very difficult to find them in text with standard ASCII characters, i.e. in Firefox by typing 1/2 in quick find (ctrl+f).
I'm always wierded out by that, because it implies we should support the full gamut of math - superscripted/subscripted text, large fractions, the text above and below the epsilon in discrete sums, etc.
For the same reason that they also include symbols for many obsolete (dead) languages and writing systems, as well as (per a comment above) a character used by the 1959 IBM 1401 computer (https://github.com/shirriff/groupmark). The need to be able to discuss a certain technology in writing does not disappear just because it is out of general use.
I was actually wondering about the electrical symbols for logic gates, such as AND, OR, NOR, XOR, NOT, etc. I would hope they were universally accepted by now and would help when writing books or describing logic. A quick Duck Duck search revealed nothing...?
Electrical symbols in general don't do so well when scaled down to the size of text. Plus, it is very uncommon to encounter the electrical gate symbols inline with text - usually the symbols are sitting in a separate circuit diagram.
Nevertheless, Unicode does have all of the logical symbols from mathematics, which are pretty commonly understood:
I was wondering why they would have snowmen in the language. And then it occurred to me that maybe, since the unicode set has so much room for characters, that they were planning to allow cross-language communication through emoticons.
Think about it, if you can represent anything human with emoticons. Then you can communicate through emoticons only! Maybe that's what the ancient Egyptians were hopping for?
Actually there is something like that in the internationalised DNS standards (that is, internet domain names that look like .xn--xyz-abc sometimes and .日本 other times.) There's a blacklist of certain Unicode characters that are disallowed in domain names because they resemble more commonly used characters. See https://en.wikipedia.org/wiki/IDN_homograph_attack
You don't have to be able to read the word ON to recognize it as a symbol. "Circle next to zigzag-thing" is as good as circle with line sticking out of it.
There's also precedent. Continental Europe standardized on "STOP" on stop signs back in the 70's, even though no continental language has "stop" in it.
Excuse me? No continental European language has the word stop in it? You should probably learn more languages before making claims like that.
Stop is a word in Dutch (the first recognisable use of the word I could find dates back to 1287). And German has stopf. I couldn't find a date for that, because my German isn't good enough to read their etymology dictionary, but it's source is Old High Germanic, so it's safe to say that stop has been around in continental Europe in Germanic languages since Medieval time at the very least.
Edit: A quick further Google search reveals that Norwegian and Danish (and thus most likely Sweden too) have the variation "stoppe" derived from Low German.
Seems a bit disingenuous to claim it was adopted despite "not being an existing word" if two languages use the exact rendering on the sign and 4 others use/have words that are so closely related that they have the same spelling, but 1 or 2 additional letters.
Further edit: French apparently adopted the word stop from English...in 1792
The German Wikipedia, at https://de.wikipedia.org/wiki/Stoppschild , gives an example of "stop" used in the Protectorate of Bohemia and Moravia (after the German occupation of Czechoslovakia) in 1939, though it says the sign was an imported variant.
> As in Dutch stoppen, the sense “to stop” is figurative from water flow being stopped by plugging. Only in this figurative meaning has the form been adopted into standard German proper, under the reinforcing influence of English to stop.
https://en.wiktionary.org/wiki/stop gives a list of continental European languages where 'stop' is part of the language. Nearly all borrow from the English. Not Dutch, however.
German stop signs used the word "HALT" before. My German dictionary defines "stopf" as darning yarn, and "stoppen" as stop. Not quite the same spelling. Typing "stop" into French Google translate autocompletes to "stoppé". I wouldn't be in the least surprised if the American spelling crept into many European languages as a result of the sign being ubiquitous for 41 years - it would be surprising if it didn't.
I wouldn't underestimate the influence on the language of the American occupation after the war, either, nor the global influence of American business since the war. English words have crept in everywhere.
Because it's a zero and the | is a one. Arabic numerals are more universal, and the convention of 0 for off and 1 for on was established precisely to avoid picking a language. Then the combined glyph for an on/off button was created, along with the similar broken circle glyph for on/standby. Those have squarish proportions, so the corresponding 0 and 1 glyphs are needed to match those proportions. Hence the four symbols now existing in Unicode (well, 3½)
> the convention of 0 for off and 1 for on was established precisely to avoid picking a language
I figured that politics was the case. What is less supportable is working backwards from that to concoct a rational reason. People who understand digital electronic conventions are highly unlikely to not recognize "ON" and "OFF". Furthermore, anyone who does not know either would find "ON" just as easy to learn as "|".
We could also use "OHM" instead of Ω and "EUR" instead of €, but symbols provide more concise representations of meaning, especially in the case of a combined ON/OFF button. Much easier to fit the universal circle with 1 in it than "ON/OFF". The broken circle with 1 in it is way smaller than "ON/STANDBY". The other two symbols are then necessary to keep the proportions consistent across all four related glyphs.
That's the only argument I've seen for it that makes any sense. I'll counter by saying only "ON" is needed, not "ON/OFF", as the off part is implicit. Same goes for STANDBY.
This has been standard practice for a long time, I'm not just making things up on the fly. BTW, SBY is a standard abbreviation for STANDBY used by the military, if space is a problem. And SBY is a lot more google-able than an icon.
It's not obvious that the "0" is a zero, as opposed to the letter "O" (cyrillic, latin) or the something entirely misleading, as this gesture: http://i.imgur.com/6KZ1nKG.jpg
I suspect the vertical slash | has as many lookalikes as the O, but I won't go there. I do want to mention that, as you demonstrated, the "1" that means on is more often represented as the Latin I or just a vertical stroke as in |, making it just as hard to ID as a numeral, especially if you don't know that the 1/0 are derived from the binary logic gates.
I would argue that the words "ON" and "OFF" when seen as glyphs are much less ambiguous than I/O.
It can be treated as a glyph. Why would it be inappropriate?
And besides, languages the world over use plenty of words borrowed from English, and English itself is loaded with borrow words from other languages.
I've thought the mania for icons to replace common words since the Mac to be silly. Why is a picture of a Kleenex box more understandable than 'PRINT'? I have no idea what half the icons on my iPhone mean.
No way to google icons, either. I know, I'm supposed to learn them by pressing them to see what happens, but as someone who has learned not to learn how to operate machinery that way, I find it distasteful.
I'm still waiting for a Unicode codepoint for Love Symbol #2 (aka The Artist Formerly Known as Prince). There are codepoints for dead Chinese emperors, there should be one for Prince.
I'll add these (and the IBM-related symbols @kens mentioned, which are specially appropriate) to https://github.com/rbanffy/3270font for the next release (this weekend, I think - still lots of Cyrillic cleanup to do in the develop branch).
When do we stop handing out IPv4s by the hundreds of thousands? "When we run out". When do we stop doing 120km/h? "When we run out of road". When do we stop spending money? "When we're flat out broke".
(Note: I am not advocating for or against the unicode case, I'm making a point about the specific mindset used to justify this)
I phrased it rather crudely. I really meant "when there is no more to add". With the caveat that running out is the other option.
I think the question represents an equally terrible mindset, "when do we stop?" How can I answer that, honestly? Is there a number? Or is it when a certain date has come and gone? How do we pick that number? Or that date? Why do we need to stop? I understand why we might want to rate-limit the adoption of new glyphs, but I don't see why we'd ever draw a line in the sand and say, "Ok, now it's frozen." Imagine if that had happened before the creation of the Euro as a currency.
How would search work with SVG? What about support for screen readers or translation? Having standard symbols like this means that all of those are easily solved without needing something like an image classifier.
I've been saying for a while now that the proposed conlang block (for Klingon, Tolkien's Elvish scripts, et al) was shut down prior to expanding into the Astral Plane and it's past time to revisit that proposal seeking a good spot in the Astral Plane. Similar academic criteria used for encoding deceased and historic languages can be applied to conlang proposals.
maybe there should be like a universal unicode for all the icons that apps might need that represent common functionality and they should be animated to represent an either on or off state for some of them. also top brand square logos should be added into the unicode as well. and then different forks/variations can be submitted for the unicode and be accepted if they are useful and good looking. also they should all be black and white and have the same theme/look similar. u like my idea?? :DD
https://materialdesignicons.com/ makes a font available with a reasonably large selection of brand and functionality icons, using unicode private ranges (also svg or whatever).
I don't think i personally agree with brand logos in Unicode.
> Ask the people behind your Operating System and those who design your favourite fonts to start supporting Unicode 9!
Surely not every font has to create glyphs for every Unicode character...how does that work? Is there a kind of "fall-back" font for characters not implemented?
Systems have a way of finding fall-back fonts. At least on OS X (err, macOS) they will satisfy glyph requests in order from the font you specify, some built-in fonts, and then any fonts that have the required glyphs. Finally, a "last resort" font is used to fill in any remaining glyphs[1].
A result of this is that if you ask for a character such as "𓀴" (U+13034 EGYPTIAN HIEROGLYPH A044) in a monospace font, the symbol you get back can be variable width.
I have a font for that, but the character is unreadably small while it's perfectly fine for latin characters.
Many other unicode symbols also suffer from this problem. E.g. ␀ is the printable version of the unprintable NUL (\0) control character, but it's so small at 13.3px / 10pt CSS font size that it's difficult to distinguish from the other control pictures.
I didn't know there's a whole ligature as a single code point U+FDFD available, thanks sscotth. The code point is also easy to remember.
(It, accidentally, also most probably represents the words first said by the Orlando shooter when calling 911, when the translation in the transcription is compared.)
That looks like your browser is somehow misinterpreting the character set of the page. What you're seeing appears to be something like code page 437 (https://en.wikipedia.org/wiki/Code_page_437) rather than unicode.
What browser/os are you using? Is it possible you're going through a proxy that is altering the HTTP headers?
The basic problem is that Unicode characters (which may consist of as many codepoints as you like) have varying width. For instance Chinese has characters that are displayed over a width of two normal monospace characters.
On my Mac it gets displayed with "Noto Sans Egyptian Hieroglyhs" which appears to be included in the system (/Library/Application Support/Apple/Fonts/Language Support) but can be downloaded for other platforms from here https://www.google.com/get/noto/
There are some, and I have tried to make one myself [1] to learn OpenType and scripts. It is hard and takes much time and effort to get a good one (Noto fonts [2] are probably the closest font that covers the entire Unicode and looks great). I admit that it is still hard even when you don't have the looks-good requirement.
This isn't even possible for most font formats. There are over 100k assigned code points, but TTF and OTF both use unsigned 16-bit glyph indices, giving a maximum of 65,536 glyphs per font. Not all code points map to glyphs, but some glyphs are also not directly associated to code points (for example, some Unicode combining marks require separate glyphs in certain combinations).
It hasn't disappeared (Google Ngram viewer shows a consistent occurrence of "led" since 1800). It's just a common error due to (1) inconsistency with "read" (same spelling for present and past tense) and (2) the noun "lead" being pronounced "led".
I wonder if HN is going to complain about the frivolity and uselessness of these the way they incessantly complain about emoji every time a Unicode thread comes up?
It was just a tweak to emoji characters to mark them all as East Asian Full Width instead of Narrow or Ambiguous so that they displayed correctly when using a fixed width font in a terminal console. This probably only matters if you like to use emoji filenames (you mad person), but it felt like a wart so I reported it & had a short back and forth with the chair of the emoji-related subcommittee which resulted in a proposal which was eventually accepted by the committee into Unicode 9.0. The committee were great: took my tiny bug report seriously, wrote huge long treatises to justify the change & eventually voted it into the standard.
(This was pretty much my peak geek achievement of 2016 so far :) )