Emojis paved the way for UTF-8 everywhere

Freak_NL · on Nov 17, 2020

Even more specifically, emoji paved the way for proper support of Unicode characters from beyond the Basic Multilingual Plane (BMP).

There are 16 of these planes. This first block of 65,536 characters is what you can encode with only two bytes (e.g. UTF-16), and it includes most of what anyone alive needs to encode their languages adequately enough. For a long while anything encoded beyond this block had only limited support, and plenty of bugs and limitations meant that using it was tricky (well, it worked fine in LaTeX of course; via xelatex for example). This was back in 2008/2009.

Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.

Then came the emoji additions (a core set is part of the BMP and came from Japanese telecom standards), and support is now ubiquitous. Using UTF-8 is a no-brainer for most applications, and a good things that is too!

jesuscyborg · on Nov 17, 2020

The historic planes beyond the basic multilingual plane are usually referred to as the "astral planes" which includes things like gothic, runes, alchemy, egyptian, and emoji https://justine.storage.googleapis.com/astralplanes.txt

derefr · on Nov 17, 2020

And the etymology of this being that Dungeons and Dragons has a "Prime Material Plane" and an "Astral Plane", where the Astral Plane connects the PMP to various "Outer Planes" made of ridiculous not-oft-encountered stuff.

But whoever came up with this cute analogy got the analogy wrong — the higher Unicode planes are analogous to the "outer planes" themselves; while the "astral plane" would be some sort of glue allowing you to access these outer planes from within the BMP. Like... surrogate-pair characters! One could nickname the reserved surrogate-pair range in the BMP, the "astral projection" range ;)

kens · on Nov 17, 2020

"Astral plane" predates Dungeons and Dragons by centuries. Looking at old discussions, I couldn't find any evidence that Unicode's usage is connected with D&D.

Early discussion of "astral character" or "astral plane" for the Unicode supplementary planes at: https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024... Even earlier 1998 use: https://www.unicode.org/L2/L1998/98354.pdf

Sniffnoy · on Nov 17, 2020

The term "astral plane" is older than D&D, and I would assume they took it from the more general usage, not the specific usage in D&D. https://en.wikipedia.org/wiki/Astral_plane

GauntletWizard · on Nov 17, 2020

I’ve met several of the Unicode standard committee - They’re nerds. The kind of nerds for whom “Astral Plane” is a multilayered joke. It’s not not about the general usage, but nor is it not about the D&D term.

chrisseaton · on Nov 17, 2020

> you can encode with only two bytes (e.g. UTF-16)

UTF-16 is variable width, not two bytes, and it can encode any Unicode character.

toxik · on Nov 17, 2020

OP probably meant UCS-2.

lifthrasiir · on Nov 18, 2020

> Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.

Unfortunately, this has been false for a long time. BMP turned out to be not even barely enough even by Unicode 3.0 [1] where the initial set of Unicode emoji (722 characters) would barely fit in the undesignated area of BMP. Many important characters, starting from a larger set like CJKV and eventually to almost everything by Unicode 6.0 [2], got allocated in SMP and SIP instead as a result. HKSCS additions in the CJK Unified Ideograph Extension B block (U+20000..2A6FF) are notable examples.

[1] https://www.unicode.org/roadmaps/bmp/bmp-3-0.html

[2] https://www.unicode.org/roadmaps/bmp/bmp-6-0-0.html

Freak_NL · on Nov 18, 2020

CJK Unified Ideograph Extension B/C/D are all pretty exotic though. In normal daily use you won't encounter them, because input methods rarely offer them and people simply don't need them. These are important characters, but only a handful of them will ever be used by the average writer of Chinese or Japanese.

I was using some of these (from B and probably C) for very specific purposes at that time, and general support was a long way off in 2009 (although already good on GNU/Linux distributions).

bawolff · on Nov 17, 2020

You're mixing up ucs-2 and utf-16.

Robin_Message · on Nov 17, 2020

To expand on this comment, UCS-2 defines a fixed-length, 2-byte encoding of Unicode. It can therefore only represent the first 65536 characters in the Basic Multilingual Plane (BMP).

UTF-16 allows representing characters outside of the BMP by using a reserved area to split a single codepoint into two surrogates that form a pair.

This makes UTF-16 complicated and in some ways worse than UTF-8: the encoding is longer for many typical texts, but is still not fixed-width. The bug you typically see is that codepoints outside of the BMP are munged when clipping the text to a certain length (or reversing it, but that doesn't happen in real systems generally.)

seba_dos1 · on Nov 17, 2020

The reason why some older mobile phones struggle with SMS containing emojis instead of just displaying tofus in place of unsupported characters is that there's no way to send emojis in accordance to SMS standard - it defines the encoding to be UCS-2. In order to put emojis in SMS, newer phones send the messages as UTF-16 instead, technically violating the standard, which can break some parsers that only expect UCS-2 to be there.

lokedhs · on Nov 18, 2020

UTF-16 is the worst of both worlds when compared to UTF-8 and UTF-32. The only reason it exists (and, unfortunately prevalent) is because a number of popular technologies (Java, Javascript, Windows) thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Now, the issue of clipping or reversing strings is a problem not just because of encoding. It simply doesn't work even with UTF-32. You're going to end up cutting off combining characters for example. Manipulating strings is very difficult, and software should never really try to do it unless they know what they are doing, and even then you need to use a library to help you do it.

innocenat · on Nov 18, 2020

> thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Not sure what you mean by 'being smart' when all of those were released before Unicode 2.0.

lokedhs · on Nov 18, 2020

I said they thought they were smart. I'm not going to judge whether it actually was smart based on the situation then.

That said, UTF-8 was already 4 years old by the time Java came out. Surrogate pairs was added to Unicode in 1996, one year prior to the release of Java.

I joined Sun Microsystems around that time, and Unicode really wasn't a thing in the Solaris world for a few more years, so the fact that people wasn't aggressively pushing good Unicode support at the time is understandable. People just didn't have much experience with it.

ucarion · on Nov 17, 2020

To that point, what are systems supposed to make of UTF-8 strings encoding codepoints in the surrogate pair range? Is that well-defined?

In other words, to what extent are surrogate pairs a UTF-16 thing, rather than a Unicode thing that exists to accommodate for UCS-2 -> UTF-16?

SloopJon · on Nov 18, 2020

Surrogates are technically a UTF-16 only thing. Realizing that sometimes they nevertheless escape out into the wild, WTF-8 defines a superset of UTF-8 that encodes them:

https://simonsapin.github.io/wtf-8/

To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.

I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.

A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.

account42 · on Nov 19, 2020

Specifically, filenames on Windows are not UTF-16 (or UCS-12) but rather WTF-16 - like UTF-16 but with possibly unmatched surragate pairs. WTF-8 provides an 8-bit encoding for such filenames that matches UTF-8 wherever the original was valid UTF-16 while converting the rest in the most straightforward way possible, menaing you need less code to go from WTF-16 to WTF-8 than going from UTF-16 to UTF-8 while rejecting invalid characters.

ChrisSD · on Nov 18, 2020

It's invalid according to the spec. They are permanently reserved code points for use in UTF-16.

> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

They could be replaced by the replacement character to produce a valid string.

a1369209993 · on Nov 17, 2020

Nitpick: UCS-2 actually isn't fixed-length either, eg "ẍ̊" (small x+umlaut+ring above) is two code units (1E8D 030A) or possibly three (0078 0308 030A).

ygra · on Nov 17, 2020

UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point). Of course, to represent a grapheme cluster, more than one code point may be needed, but that's true of Unicode in general.

a1369209993 · on Nov 17, 2020

> that's true of Unicode in general.

Yes, that was rather my point: if you're using a Unicode-based character encoding, you're going to have variable-width characters regardless, so you might as well use UTF-8.

> UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point).

Sure, but that's a implementaion detail of the mapping from characters (at the application level) to bytes (at the physical(-ish) representation level).

3131s · on Nov 18, 2020

> This first block of 65,536 characters

This is also a little confusing, since "block" already has a specific meaning in relation to Unicode.

hinkley · on Nov 17, 2020

UTF-8 is simple enough to implement and yet I've seen it done improperly more than once.

The problem with UTF-8 is that the density is really good for North America and Western Europe but drops off quite a bit for other languages, and you have to trade CPU for bandwidth (eg, gzip) to do much about it.

Japan has several encodings (though shiftJIS is the only one that I can recall) that use escape characters to switch code pages. As long as you don't switch too rapidly between kanji and borrow words, it's more compact, but more complex to implement (I would say less so than implementing gzip but if you aren't using zlib, one of the most portable libraries in existence, you have much bigger issues than character encoding).

UTF-8 takes 3 bytes for all of the first block. Only the first 2048 characters fit into 2 bytes, which is mostly European languages.

Freak_NL · on Nov 17, 2020

Outside of embedded software this really isn't that much of a problem any more.

Taking a random Wikipedia page as sample I get 46kB (UTF-8) versus 35kB (Shift-JIS). A random Japanese text from Project Gutenberg is roughly ⅔ of the size of the UTF-8 text in Shift-JIS.

Those are impressive enough numbers, but add just a single photograph to the Wikipedia page and it doesn't matter at all. Text is just pretty efficient, even if you use an encoding that supports every language in the world.

crazygringo · on Nov 17, 2020

First, that's because European languages have small alphabets. It's not like Chinese or Japanese with their many thousands of characters could have fit in those 2,048 spots anyways. So it makes sense to allocate the small common alphabets there.

Second, text is so comparatively tiny relative to photos, video, code, etc. that it really doesn't matter at all anyways.

Third, text is often zipped as well. It's often zipped over HTTP. It's zipped when it sits inside of an EPUB. It's zipped when it sits inside a Word document. You can even configure MySQL to zip text fields in a database. Basically, whenever space is an issue, you can fix it.

So it's hard to see how this is any problem in practice at all, when phones and computers mostly ship with 32 GB of SSD minimum.

Dylan16807 · on Nov 18, 2020

The density drops off but it's still good density. It's not a problem. And the amount of CPU you need to do the bit shifts is negligible.

Shift JIS requires you to track extra context when you're actively using the text. That's going to take extra space. I bet that in most of the situations where shift JIS meaningfully wins out, you could get more benefit by using a combination of UTF-8 and Zstd.

kens · on Nov 17, 2020

An entertaining article, but it's not historically accurate. If you look at measured usage, UTF-8 took off around 2005 and was the dominant web encoding by 2008. Emojis weren't added to Unicode until 2010, at which point UTF-8 usage continued to increase at exactly the same rate as before.

https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowt...

mpol · on Nov 17, 2020

Hmm, in MySQL land you have utf8, which means utf8mb3, and utf8mb4. Only the latter supports Emoji. It is only in the last releases that utf8mb4 is supported and also the default character set.

I work with WordPress a lot, and up to 3 years ago it was quite common for MySQL setups at shared hosting providers to only support utf8mb3. And Emoji support really did help here to move it forward.

riteshpatel · on Nov 18, 2020

We changed our tables to utf8mb4 specifically for emoji support

redisman · on Nov 17, 2020

Does anyone know the real reason? Languages upgraded string to default to utf8? Browsers changed their defaults?

MrRadar · on Nov 18, 2020

It's probably down to browser/OS support and the increasing importance of user-generated content on the web. Windows only got Unicode support by default on the consumer side with Windows XP which launched in 2001. Mac OS X was also released in 2001 and had Unicode support from the start (unlike Classic Mac OS which had it bolted on with poor app support). Similarly, according to this old blog post I dug up[1] Internet Explorer only got good Unicode support with 5.5 in 2000, and Netscape never had good Unicode support. Firefox, which had Unicode support from the start, was released around 2002. By 2006, which is clearly an inflection point on that graph, browsers and OSes with poor Unicode support were highly obsolete and global social media platforms like Myspace, Facebook, and Twitter, which needed Unicode to enable sharing of user posts between users writing in different languages, were on the ascendance.

[1] https://www.worldtimzone.com/2002/09/24/

larzang · on Nov 18, 2020

Dealing with the hell of mixed ISO-8859-1/15 vs CP-1252 vs UTF-8 vs simple ASCII was enough to make me an early embracer of UTF-everywhere despite mostly only having to deal with English language sources. "Someone copy/pasted from Word and broke the database again" are words you don't want to ever have to hear.

UTF-8 was the sane refactoring after the initial incompatible parallel standards were established.

crazygringo · on Nov 17, 2020

Yes, very misleading clickbait title.

I was expecting some actual insightful anecdote about a pivotal choice by a major tech company that made all the difference. But nothing at all.

an_opabinia · on Nov 17, 2020

The ascendency of the CJK market, followed by Google Chrome, paved the way for UTF-8 everywhere.

The more interesting thing is why basically no one uses Eastern ideograms in the West, except maybe the Korean ideogram for crying (ㅠㅠ) and rarely, other kaomoji-like stuff. Some kanji also tell visual stories, and most children learn them just fine, so it’s not as simple as accessibility. Borrowing kanji was also anticipated by many sci fi writers and yet is not to be.

josefx · on Nov 17, 2020

> followed by Google Chrome

I will bite, wtf has Chrome to do with UTF-8? As far as I can find the last browser to struggle with it was IE5, IE6 was released almost a decade before Chrome was a thing.

belval · on Nov 18, 2020

And even then that's not true, parts of the Chromium projects still use UTF-16 internally.

csande17 · on Nov 18, 2020

Aren't JavaScript strings defined to be UTF-16? So parts of every browser will use UTF-16, forever!

chungy · on Nov 18, 2020

UCS-2, actually, but for practical purposes, everyone treats them as UTF-16 as long as they are also valid UTF-16.

chrismorgan · on Nov 18, 2020

I’d quibble over that characterisation. JavaScript uses neither UCS-2 nor UTF-16. Rather, its strings are a sequence of 16-bit Unicode code points, with surrogates handled according to UTF-16 rules, saving that unmatched surrogates are permitted to remain (that is, JavaScript strings aren’t necessarily valid Unicode).

There is no requirement whatsoever that the browser actually use 16-bit code units to represent strings. This is what Simon Sapin achieved for Servo with the WTF-8 encoding, which extends UTF-8 to allow surrogates, including unmatched surrogates: it makes it able to represent these strings with 8-bit code units, commonly halving memory usage and improving the speed of various sorts of linear processing (though at the cost of random access speed which becomes O(n)).

jandrese · on Nov 17, 2020

I'm a little sad that the cute Japanese 「quote characters」 have not gotten traction. I'd love to be able to use those in code.

andrewl-hn · on Nov 17, 2020

First of all, they ARE getting traction. Many Youtubers and Twitch streamers started to use them in stream / video titles. I haven't seen corner barackets at all ten years ago, and these days I see them in use at least once a week.

Some programming languages also start adopting them, too. Raku is the one I know (it allows French and German quotes, too). Maybe Julia, too? I think some language communities tend to be more open to widespread Unicode usage in source code than others.

Isthatablackgsd · on Nov 17, 2020

Those are called Corner Bracket. It took me a while to find out that I have to have CJK font installed in my computer to use the corner bracket. And the file size of CJK font family are huge! More than 100MB.

boogies · on Nov 17, 2020

I see them and the only CJK font on my PC is GNU Unifont, which is only ~12MB for the TTF version IIRC, and smaller for other formats.

Isthatablackgsd · on Nov 17, 2020

Oh GNU Unifont is new for me, thanks for sharing that information. I used other source for CJK (the one that are 100MB)to ensure that I have every single possible uncommon character/glyph installed without chasing for more fonts. One source that have it all in one file. I discovered this because other sources don't always have a full set.

account42 · on Nov 19, 2020

GNU Unifont is a bitmap font with a tiny resolution. Nice to have as a last level fallback, but too ugly for any characters you actually see more than once in a fortnight.

db48x · on Nov 17, 2020

Perl and Raku can probably use them; they've long supported all possible paired quoting characters.

jrochkind1 · on Nov 17, 2020

Huh, I didn't know about those. I've been using euro-style « and » though, to be able to copy-paste things that already include " and ', and still delimit what I am quoting.

Freak_NL · on Nov 17, 2020

Guillemets are used in many languages like «this», but 'euro-style' is a bit of a misnomer. They are used all over the world, and in many European languages different pairs are used, such as guillemets the »other« way around, and „this” matching pair.

bloak · on Nov 17, 2020

See this rather nice map:

https://jakubmarian.com/map-of-quotation-marks-in-european-l...

I like to call «this» the Swiss system, because in Switzerland they use it for four different official languages.

skipnup · on Nov 17, 2020

At least in Germany the closing quotation mark is the other way around like „this“

microtherion · on Nov 17, 2020

I believe it's „this“, actually (U-201E to start, U-201C to end), but the distinction between all those quotation marks is hellishly difficult, and I bet native speakers get it wrong all the time.

Once upon a time, I wanted to rely on these distinctions in a TTS frontend to distinguish between 5" floppy disk and "Mambo No. 5"

I soon realized that people use quotation marks and dashes in such a random manner that insisting on treating the semantics literally would create more confusion than it would resolve.

Edit: Now that my comment has posted, I realized you used the correct characters, but the font on my browser rendered them in a seemingly incorrect way, so they looked like double primes.

chrismorgan · on Nov 18, 2020

I take great pride in always typing things like 5¼″ and “Mambo № 5” with the exact Unicode scalars that I desire. I love my Compose key. (I have Compose+'+` produce ′ (prime), Compose+"+` produce ″ (double prime), Compose+:+: produce “, Compose+;+; produce ‘, Compose+"+" produce ”, Compose+'+' produce ’, Compose+N+o produce №, and Compose+1+4 produce ¼.)

throw0101a · on Nov 17, 2020

> I've been using euro-style « and » though, to be able to copy-paste things that already include " and ', and still delimit what I am quoting.

That would really be handy on the CLI instead of doing a bunch of escaping with backslashes.

SahAssar · on Nov 17, 2020

That's just pushing the problem one level down, no?

throw0101a · on Nov 19, 2020

Sometimes it would be enough.

I've had to run double-depth for loops which then execute an awk script remotely. Three would have been fine.

lmm · on Nov 18, 2020

If you have 3 different sets of quoting characters then you can nest things 3-deep without escaping. How deeply do you actually need to nest?

nneonneo · on Nov 17, 2020

Out of curiosity, which Korean ideograph would that be? Korean doesn’t use ideograms (much) anymore; they use an alphabet packed into syllabic blocks.

The character I can think of that kind of matches the description is 囧, which is a Chinese character.

kevin_thibedeau · on Nov 17, 2020

It's a jamo component used to compose full hangul characters.

https://en.wikipedia.org/wiki/List_of_Hangul_jamo

masklinn · on Nov 17, 2020

> Out of curiosity, which Korean ideograph would that be?

Yu. Having two of them looks like a crying face. Although tha (ಥ) is also a common component of crying face (ಥ_ಥ). They're talking about kaomoji which use various non-latin or fullwidth symbols (though you're right that they're largely not ideograms) to compose pretty extensive "smileys" e.g. the look of disapproval uses kannada, denko uses greek and katakana, ...

mattnewton · on Nov 17, 2020

Shameless plug ア( ͡° ͜ʖ ͡°)ア

The Gboard keyboard on android has a tab for many of these common “emoticon” faces / character sequences. If you open the emoji picker on the keyboard and then tap the far right bottom tab icon “:-)”

They can get very elaborate though, these are just very basic common faces.

masklinn · on Nov 17, 2020

> The Gboard keyboard on android has a tab for many of these common “emoticon” faces / character sequences. If you open the emoji picker on the keyboard and then tap the far right bottom tab icon “:-)”

iOS also has that on the standard Japanese "Kana" keyboard (and possibly others), under the "^_^" key.

reificator · on Nov 17, 2020

Windows has this as well, just hit Windows + ; and go to the ;-) tab.

CountSessine · on Nov 18, 2020

The more interesting thing is why basically no one uses Eastern ideograms in the West

¯\_(ツ)_/¯

I don't know!

(Yeah - I know - it's not an ideogram)

tasogare · on Nov 18, 2020

> basically no one uses Eastern ideograms in the West

That’s false. There is a lot of students of CJK languages in European universities as Japanese and more recently Korean as popular language. Chinese and Japanese can be learn in some high-school too. There are publishers publishing books containing such characters (e.g. You Feng in Paris). And of course the diaspora and heritage learners... That a lot of people, even more in countries with significant East-Asian population like Australia or the US.

SpicyLemonZest · on Nov 17, 2020

It's accessibility in the sense that computer input methods popular in the West can't generate them. As far as I know, there's no way to get my computer or phone keyboards to produce 水 without switching to one of the CJK input modes.

nneonneo · on Nov 17, 2020

On Mac, at least, the "Emoji keyboard" accessible through Cmd+Ctrl+Space in all standard text controls makes it possible to add basically any character in Unicode if you know its name. For example, you can type "water" to get ⽔ (along with other characters, like the water droplet emojis). I use this often to type the Greek beta symbol, for example.

tzs · on Nov 18, 2020

...and then when you look it up that way, if you note that it is U+2F54, and if you have your your input source set to "Unicode Hex Input", you can enter ⽔ by holding down the option key (⌥) and typing 2f54.

As far as I know this only works for characters in the BMP, but for such characters it may be faster than picking them from that cmd-ctrl-space popup.

If you don't have that input source available, you can add it via the "Input Sources" tab of Keyboard preferences. There you can also enable showing the input menu in the menu bar giving you an easy way to switch between "Unicode Hex Input" and your normal input source.

layoutIfNeeded · on Nov 17, 2020

On Windows you can use Alt + numpad keys for entering the character code. https://en.m.wikipedia.org/wiki/Alt_code

hprotagonist · on Nov 17, 2020

I am 100% convinced, tangentially, that mobile OS point releases use emoji as user bait to have a stronger guarantee of regular security updates.

"update now to get access to :burrito: and :taco:! (also fix the following 12 CVEs that 90% of our userbase doesn't know about or read)"

xxpor · on Nov 17, 2020

If this was ever a conscious decision, it was genuinely brilliant. Much better to have a carrot rather than the stick of "you'll get hacked (except probably not)"

black_puppydog · on Nov 17, 2020

You mean much better to have a :carrot: ? :slight_smile:

In any case, this must be the most ridiculous carrot yet. :D

pen2l · on Nov 17, 2020

As a millennial whose only chatting activities are confined to irc and doesn’t really get emojis — could someone please articulate this phenomenon in terms that would be meaningful to me?

I’ve seen sentences sometimes where words are actually replaced with emojis... is this how some subset of people actually communicate online or that’s just for some effect of irony?

eat_veggies · on Nov 17, 2020

It can be a way of adding intonation and other affective, out-of-band communication back into text. I never really see people use them purely to replace words one-for-one. But interspersed throughout a message, emojis can, as you say, add irony, but also communicate that a message that could be taken as ironic isn't, or communicate some other subtext.

Text is a flattening of speech, and emojis can add some of those missing dimensions back -- and, like our IRL verbal cues, tics, and gestures, they can be hard to decode if you're not "in" on the game.

zck · on Nov 17, 2020

> I never really see people use them purely to replace words one-for-one.

This article does that. Here it might be more often than the author would normally do, but I've seen things like that non-ironically.

> To stay relevant in the age of social media you had to support emoji or you were in the .

EDIT: HN stripped the emoji. The end of that sentence was "or you were :skull: in the :droplet:", read as "or you were dead in the water".

lovegoblin · on Nov 17, 2020

> Here it might be more often than the author would normally do, but I've seen things like that non-ironically.

Maybe not ironically, but doing that definitely adds informality - exactly the kind of out-of-band context that GP is talking about.

542458 · on Nov 17, 2020

IMHO it’s a few things:

It’s an ingroup signal, indicating (by “correctly” using emoji) that you’re part of a specific subculture to the recipient. “Praying_emojii flame_emojii YASS” indicates to me that the writer is young and hip. “Do you want to have a BBQ bbq_emojii ?” says to me that the writer is older and less hip.

It can be hard to communicate tone through writing. Emojis allow one to instantly mark a piece of writing as informal/non-serious with minimal effort. This includes irony - “eggplant_emojii” is often a non-serious reply indicating that “I am jokingly acting like this this is sexual or attractive”

It’s a proxy for longer writing. “Thumbsup_emoji” is a substitute for some marginally harder to articulate feelings of “looks good / I like it”

Of course, there are many subcultures that use emoji in different ways and as proxies for other things as well. At a previous employer we’d often just send “taco_emoji?” to ask who was buying lunch. It’s the sort of thing that can be used/abused in many different ways.

lovegoblin · on Nov 17, 2020

I really strongly recommend 'Because Internet: Understanding the New Rules of Language' by Grechen McCulloch. It's a lighthearted linguistic look at how our written communication has changed over the last couple decades since the advent of mainstream internet access.

hprotagonist · on Nov 17, 2020

ironic effect is, of course, communication all by itself.

My use of emoji in messaging applications is primarily limited to quick rebus-like reaction replies to meme images, a quick and dirty reaction to a message in slack expressing some vague emotional response, or making complicated fart-jokes with my partner that rely on a lot of out of band information.

(i also use IRC on the daily, and have been known to use emoji there too, so these communication forms are not disjoint)

ziml77 · on Nov 17, 2020

I've only seen this on Twitter and assumed it was to fit posts within the character limit.

Wowfunhappy · on Nov 17, 2020

On my Mac, I spent an evening figuring out how Apple's emoji picker works and backporting all the emoji's instead. But I'm not entirely sure what this says about me.

https://forums.macrumors.com/threads/updating-maverickss-emo...

skocznymroczny · on Nov 17, 2020

Can I have an update that brings back pistols instead of water guns?

samatman · on Nov 17, 2020

Easily†, for yourself.

Emoji are just fonts, and with some search engine sleuthing, you can find an OG pistol emoji out there, pop open your system emoji font in an editor, and replace the water pistol with a pistol pistol.

Everyone else will still see the Nerfed version, of course.

†For some value of easy

Andrew_nenakhov · on Nov 17, 2020

But but water guns protect you from launching a mass-shooting attack! Don't expose yourself to such hateful symbols!

jtsiskin · on Nov 18, 2020

Not tangentially; per the article “And it's amusing to see [Apple using new emojis](https://www.engadget.com/2017-10-31-apple-ios-11-1-new-emoji...) as a carrot to get people to install the latest security patches.”

capableweb · on Nov 17, 2020

:burrito: and :taco: are not emojis as such though, they are more alike good old "smileys" that turns characters into images. Supporting emojis would be to support the UTS#51 https://www.unicode.org/reports/tr51/ natively.

hprotagonist · on Nov 17, 2020

that's what i meant, but i was using the shortcode in my message here.

"add a glyph to the emoji keyboard" is more precise.

elFarto · on Nov 17, 2020

I'm not sure HN will even allow emojis:

Burrito: Taco:

edit No, it stripped those :'(.

madeofpalk · on Nov 17, 2020

wait they remove emojis from comments?

airstrike · on Nov 17, 2020

Not every emoji ⌛

deathanatos · on Nov 17, 2020

I'm guessing that that's b/c that's one of those that is inside the BMP?

Dylan16807 · on Nov 18, 2020

It removes almost all of the ones inside and almost all of the ones outside.

seba_dos1 · on Nov 17, 2020

Just as they remove plenty of other Unicode characters here.

clon · on Nov 17, 2020

With 10%, I believe you are overguesstimating the share of users that know/care about CVE-s.

Sargos · on Nov 17, 2020

That was his point. Users don't know/care about CVEs but love want/want the new emojis. Everyone wins.

gumby · on Nov 17, 2020

What an insightful observation!

It has a lot of follow on implications too.

Andrew_nenakhov · on Nov 17, 2020

In russian segment of internet, various cyrillic encodings (win, dos, mac, koi-8) were a huge problem, and only UTF-8 finally solved it long before emojis became a thing.

It is known that developers from English-speaking countries are generally oblivious to encoding problems. Probably they could get by on ASCII far longer than the rest of the world, so no wonder that they might confuse cause and effect in this case.

DangerousPie · on Nov 17, 2020

Can confirm, at least anecdotally. As someone who runs a website but doesn't have nearly enough time to do everything he wants to do, upgrading to UTF-8 (and specifically utf8mb4) was never a priority - until my users starting using emojis and breaking things left, right and center.

CodesInChaos · on Nov 17, 2020

Emoji probably contributed to widespread support of supplemental planes (fixing systems which treated UTF-16 as UCS-2), but I doubt they contributed much to UTF-8's popularity.

masklinn · on Nov 17, 2020

> fixing systems which treated UTF-16 as UCS-2

Or treated UTF8 as the nonsense that MySQL's utf8 is (it's 3-bytes utf8 aka only the BMP, and silently drops anything from the first non-BMP codepoint).

mfontani · on Nov 17, 2020

Indeed. To ensure MySQL stores "real" UTF-8 one has to use "utf8mb4" instead of "utf8", which just rolls off the tongue and backwards (in)compatibility seems to be the reason why one can't just DWIM things backwards... "utf8mb4 or bust" it is, then!

jcims · on Nov 17, 2020

Now we just need emoji's for ip addresses so we can move to ipv6

djxfade · on Nov 17, 2020

There's no place like 1⃣2⃣7⃣·0⃣·0⃣·1⃣

Edit: I didn't realize HN doesn't support emojis

mettamage · on Nov 17, 2020

Truthfully, I'm happy HN doesn't support it. While I love emojis normally, emoticons are enough for me :)

jcims · on Nov 18, 2020

works in safari on ios fwiw :)

1996 · on Nov 17, 2020

Yes. We shouldn't dismiss the weight of non-technical people voting with their dollars.

They may not care about i18n, but they do care about cute emoticons.

So we get not just unicode support everywhere, but also character pickers inside the keyboard.

And now we also get to benefit from unicode for things we find pretty - for example, the famous powerline https://github.com/powerline/powerline

Information density matters, and I can't wait for someone to replace "old" color coding of files (.dircolors in batch) by 1 emoticon : a music note for music files, etc.

whateveracct · on Nov 17, 2020

And yet HN won't let me use them

grawprog · on Nov 17, 2020

I'm glad about that. I find it weird reading through comment threads or forums and seeing mobile phone emojis scattered through. I find them distracting.

I'm not really too sure why. I don't mind them in personal messages or texts and stuff, but seeing them on public pages just kind of annoys me for some reason.

masklinn · on Nov 17, 2020

> I'm glad about that.

I'm not because it's completely arbitrary about it e.g. you can include 🁓, 🀕, ⏱, 🃅, box drawing, or Z̸̠̽a̷͍̟̱͔͛͘̚ĺ̸͎̌̄̌g̷͓͈̗̓͌̏̉o̴̢̺̹̕ but not trigrams, die faces, box elements, musical notes or flags. They just whitelisted/blacklisted entire blocks and called it a day.

Which obviously is par for the course when it comes to HN's comment box, the markup system is even more half-assed.

grawprog · on Nov 17, 2020

Sounds like they blacklisted things likely to clutter up the comment threads and left things unlikely to be used.

Country flags seem like they could be used for political trolling.

Die faces could lead to weird rolling threads or other things.

Musical notes, you got me, can't really think of anything too bad for those.

The markup's not great, but too much formatting is distracting. I personally prefer the limited options. You focus more on the content of your comment than making it look pretty.

The only thing i really despise about hn's formatting is the code blocks or whatever they are, the one on mobile that vanishes off the side and you have to scroll horizontally to read everything. I really can't stand when people use those for quotes.

Other than that though, hn's formatting makes everything uniform and fairly easy to read through. There's no fancy nonsense getting in the way of things.

Actually, that's part of why those code block things piss me off, they're probably the fanciest piece of formatting you can do and all it does is obstruct information and make me waste time while reading.

masklinn · on Nov 17, 2020

> Sounds like they blacklisted things likely to clutter up the comment threads and left things unlikely to be used.

That’s not really believable given how arbitrary it is.

> Die faces could lead to weird rolling threads or other things.

As if tiles or playing cards could not be used that way.

> The markup's not great, but too much formatting is distracting.

The problem is that despite having only two directives half of HN’s markup is actively detrimental: because there is no escaping, no inline literals, and the parsing is sub-par, in my experience the “emphasis” directive causes issues more often than it helps. HN’s markup would be significantly improved by removing it entirely.

> I really can't stand when people use those for quotes

Which would be way less likely if HN actually supported quotes.

grawprog · on Nov 17, 2020

>> I really can't stand when people use those for quotes

>Which would be way less likely if HN actually supported quotes.

But look how well this works ;p.

Sorry...couldn't resist.

I dunno, I like the 'hackish' nature of it.

You're right i'm sure the tiles or playing cards could be used like that too, it may be arbitrary, I don't know. But, those were just some reasons off the top of my head, i'm sure when HN was being programmed a bit more thought went into it, or maybe not, who knows?

My main point is, I like the simplicity of it all, sure it could be better, but better doesn't necessarily lead to better quality content.

There's a minimum amount of distractions, most users find reasonable ways to communicate the context of the content of their posts and scrolling through most threads tends to be a mostly uniform experience where if users are following a few established conventions, you can follow the flow of things pretty well.

It's not perfect, it's not the best, but I feel like it fits the general vibe and nature of the site. It gives HN an identity among all the other news aggregators and forums.

saagarjha · on Nov 18, 2020

> That’s not really believable given how arbitrary it is.

No, that's exactly what they did. I asked.

DerekL · on Nov 18, 2020

I don't think that die faces are emojis. They range from U+2680 to U+2685.

Let's see if they work here:

Edit: No, they were filtered out.

tzs · on Nov 17, 2020

I wish they would add U+2009 (thin space) to that list. That's the standard way under the SI system to separate digit groups, e.g., 1 234 567. HN just treats it as a regular space.

(The SI standard for separating the integer part from the fractional part is to use "." or ",", whichever is customary in your location. Using thin space for grouping removes the ambiguity that you get in places that use one of "."/"," for grouping and the other for a decimal point).

jrochkind1 · on Nov 17, 2020

I wonder if that's putting things through unicode "canonical normalization", or than custom rules.

Let's see what it does with `U+00BC Vulgar Fraction One Quarter Unicode Character`... ¼

Nope it allows it instead of turning into `1/4`, so that's not canonical normalization. I guess it's custom rules? Or some other unicode transformation we're not thinking of, or other third-party re-usable transformation.

masklinn · on Nov 17, 2020

> I guess it's custom rules? Or some other unicode transformation we're not thinking of, or other third-party re-usable transformation.

They just blacklisted (or whitelisted) blocks or categories.

jrochkind1 · on Nov 17, 2020

Converting a U+2009 THIN SPACE into an ordinary ascii space is not black/whitelisting.

masklinn · on Nov 17, 2020

True, there's almost certainly a whitespace normalisation pass at one point as well, likely during / around the processing of what little makup HN has.

whateveracct · on Nov 17, 2020

Imagine how much more expressive my comment would've been if HN didn't strip the emoji [1] I had at the end though

[1] https://emojipedia.org/pensive-face/

oauea · on Nov 17, 2020

It would make me instantly disregard your comment as immature

whateveracct · on Nov 17, 2020

hm that feels more like a problem you have than one inherent to my comment tho

hprotagonist · on Nov 17, 2020

how about very carefully rebased commit histories?

Freak_NL · on Nov 17, 2020

Oddly enough nobody in my company uses emoji in commit messages, even though we have no policy that prevents it. It just doesn't make sense there.

I see it on public repositories sometimes, but it never really seems to add anything useful.

jefftk · on Nov 17, 2020

The amp project uses them a lot, and has a system where different kinds of commits get different leading emoji: https://github.com/ampproject/amphtml/commits/master

I'm used to them at this point, and it's kind of nice when scanning commits to be able to see what type they are.

grawprog · on Nov 17, 2020

I have to admit, i've never actually read through any commit histories with emojis in them...

Don't get me wrong, i'm not going to get mad or lose my mind or anything when I see an emoji somewhere, it just I dunno it looks wrong or something.

masklinn · on Nov 17, 2020

Switch your system / default font to B&W if you don't like the colorised emoji.

nradov · on Nov 17, 2020

It's interesting to watch the evolution of written language in action. I expect in 20 years we will routinely see emojis in written English novels and news articles. In 50 years we'll see them in textbooks and scientific journal articles.

AnIdiotOnTheNet · on Nov 17, 2020

Yet another reason I'm glad for the inevitability of death.

I can't be the only one who thinks emoji are a terrible idea. Granted, I also don't think logographic characters are a good idea but at least they have thousands of years of use and agreed upon semantic meaning behind them.

szhu · on Nov 17, 2020

If everything you care to talk about can be easily described using thousand-year-old ideas, then I can see why you are against emojis. But this isn't true for many things people want to talk about today.

Language is just an encoding for ideas, and emojis are a new compression algorithm. Using a single character, you can now convey certain thoughts and sentiments that you previously needed many more characters to reference or explain.

"So then just explain it!" some might respond. "Why can't people be bothered to spend even a little time to write down what they think?" It's an accessibility issue. People have limited time every day to get their ideas across, and they deserve ways of conveying their ideas concisely. There is precedent for this too -- this is why we have acronyms and new words. "lol" and "minivan" don't have thousands of years of agreed-upon semantic meaning behind them.

A final thought -- whether you think emojis are a terrible idea might not be relevant to whether they should exist. Letting people live their own lives to the fullest is much more important than making sure you, I, a future historian, or any other third party understands what they are saying. But you don't have to worry about not being to understand conversations. Given what you prefer, if someone wants to address you as a target audience, then they probably won't use emojis.

reaperducer · on Nov 17, 2020

this isn't true for many things people want to talk about today

This makes me curious. What things can people talk about with emojis that they can't talk about in a proper language?

socraticmethod · on Nov 18, 2020

Imagine saying "Awww yisss" as an emoji, only you don't actually have to say "aww yiss."

reaperducer · on Nov 18, 2020

you don't actually have to say "aww yiss."

But you just did. So what does the emoji add to communication? And how does one get the secret decoder ring for it?

There are plenty of emojis that mean different things in different cultures. Some quite amusing.

mFixman · on Nov 17, 2020

> "I hate modern communication culture. It's obviously terrible and young people are getting dumber for using it."

— Every human being in recorded history when they hit 35 years old.

jhanschoo · on Nov 17, 2020

What an overreaction. You'll find the proliferation of emojis distributed appropriately according to the genre of writing (among good writing of the genre). For example, you'll still hardly see emoji in newswriting where they don't have much to add to the semantic content, but you already see it liberally used in places where stickers and drawings are already expected: e.g. in edited Instagram photos.

Sebb767 · on Nov 18, 2020

> In 50 years we'll see them in [..] scientific journal articles.

I highly doubt this. Common shortcurts (such as "it's" for "it is") and dialect is still absent from serious journals (unless it's the research target, of course) and I'm rather sure emojis will similarly be disregarded.

Rexxar · on Nov 18, 2020

Or it will be seen as a strange thing that old people like to use for no reason.

kaetemi · on Nov 17, 2020

Not only helped to improve support for UTF-8, but also for those pesky characters that take multiple codepoints...

masklinn · on Nov 17, 2020

Skin tone modifiers & other composites arrived later to force supporting those properly.

FullyFunctional · on Nov 17, 2020

"And it's amusing to see Apple using new emojis as a carrot to get people to install the latest security patches."

OMG, that makes so much sense. I was the opposite, grumbling about not caring about that silliness, not realizing the psychology of things.

Having suffered through the dark ages with Microsoft increasingly ruining the world with an endless stream of proprietary crap (I still hate them for making people think tab width is configurable), it's amazing to step back and witness how much things have improved (on this narrow slice).

kevsim · on Nov 18, 2020

Emojis are what caused me to learn about unicode personally. We had a big bloated library handling emoji stuff for us in our product [0]. Then one day, someone said that they wanted the comment threads to behave more like Slack - that if only an emoji was typed in a comment (without any additional text), it should appear big, otherwise it should be small. This required detecting when text was only an emoji which required me to learn how the heck this stuff actually worked. Great experience and I'm forever indebted to emojis.

0: https://kitemaker.co, an awesome new issue tracker with tons of hotkeys (and emojis)

amelius · on Nov 18, 2020

What I don't like about this approach is that everybody potentially sees different versions of the emoji.

(And by the way, HN stripped out the emoji I put in this comment; perhaps that is for the better, but it's kind of funny that the software we use here is quite opposite to the software we make in our day jobs)

qwerty456127 · on Nov 18, 2020

Do e-mail clients like Thunderbird and Outlook already use UTF-8 by default? The last time I checked they used ANSI/ISO codepages for new E-mails. I mean the desktop MS Office Outlook app primarily, but I'm curious about the web apps as well.

ChrisArchitect · on Nov 17, 2020

didn't seem like it was emojis paving the path but web-based email and internationalization of websites. Just the whole move to web in various key areas like email meant it became just less of a hair-pulling nightmare for developers to have to deal with encoding between countries and platforms. Throw in the dawn of smartphones (and emojis came along with that yes) and that was more problems on top of that, people moving between desktop/mobile/web etc. UTF-8 took care of alot of the headache.

smrtinsert · on Nov 17, 2020

Homesite, I miss that program. Those were the days.

disown · on Nov 17, 2020

Pretty sure the dominance of ASCII on the internet and the efficieny/compatability of UTF-8 in relation to ASCII paved the way for UTF-8 everywhere. It is the standard unicode encoding of the internet.

If anything, I would say the UTF-8 paved the way for emojis, not the other way around as the ubiquity of a unicode encoding allowed for the existence of emojis. Can't encode emojies with ASCII. You have to have unicode and its encoding first before you can have emojis.