Even more specifically, emoji paved the way for proper support of Unicode characters from beyond the Basic Multilingual Plane (BMP).
There are 16 of these planes. This first block of 65,536 characters is what you can encode with only two bytes (e.g. UTF-16), and it includes most of what anyone alive needs to encode their languages adequately enough. For a long while anything encoded beyond this block had only limited support, and plenty of bugs and limitations meant that using it was tricky (well, it worked fine in LaTeX of course; via xelatex for example). This was back in 2008/2009.
Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.
Then came the emoji additions (a core set is part of the BMP and came from Japanese telecom standards), and support is now ubiquitous. Using UTF-8 is a no-brainer for most applications, and a good things that is too!
The historic planes beyond the basic multilingual plane are usually referred to as the "astral planes" which includes things like gothic, runes, alchemy, egyptian, and emoji https://justine.storage.googleapis.com/astralplanes.txt
And the etymology of this being that Dungeons and Dragons has a "Prime Material Plane" and an "Astral Plane", where the Astral Plane connects the PMP to various "Outer Planes" made of ridiculous not-oft-encountered stuff.
But whoever came up with this cute analogy got the analogy wrong — the higher Unicode planes are analogous to the "outer planes" themselves; while the "astral plane" would be some sort of glue allowing you to access these outer planes from within the BMP. Like... surrogate-pair characters! One could nickname the reserved surrogate-pair range in the BMP, the "astral projection" range ;)
"Astral plane" predates Dungeons and Dragons by centuries. Looking at old discussions, I couldn't find any evidence that Unicode's usage is connected with D&D.
The term "astral plane" is older than D&D, and I would assume they took it from the more general usage, not the specific usage in D&D. https://en.wikipedia.org/wiki/Astral_plane
I’ve met several of the Unicode standard committee - They’re nerds. The kind of nerds for whom “Astral Plane” is a multilayered joke. It’s not not about the general usage, but nor is it not about the D&D term.
> Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.
Unfortunately, this has been false for a long time. BMP turned out to be not even barely enough even by Unicode 3.0 [1] where the initial set of Unicode emoji (722 characters) would barely fit in the undesignated area of BMP. Many important characters, starting from a larger set like CJKV and eventually to almost everything by Unicode 6.0 [2], got allocated in SMP and SIP instead as a result. HKSCS additions in the CJK Unified Ideograph Extension B block (U+20000..2A6FF) are notable examples.
CJK Unified Ideograph Extension B/C/D are all pretty exotic though. In normal daily use you won't encounter them, because input methods rarely offer them and people simply don't need them. These are important characters, but only a handful of them will ever be used by the average writer of Chinese or Japanese.
I was using some of these (from B and probably C) for very specific purposes at that time, and general support was a long way off in 2009 (although already good on GNU/Linux distributions).
To expand on this comment, UCS-2 defines a fixed-length, 2-byte encoding of Unicode. It can therefore only represent the first 65536 characters in the Basic Multilingual Plane (BMP).
UTF-16 allows representing characters outside of the BMP by using a reserved area to split a single codepoint into two surrogates that form a pair.
This makes UTF-16 complicated and in some ways worse than UTF-8: the encoding is longer for many typical texts, but is still not fixed-width. The bug you typically see is that codepoints outside of the BMP are munged when clipping the text to a certain length (or reversing it, but that doesn't happen in real systems generally.)
The reason why some older mobile phones struggle with SMS containing emojis instead of just displaying tofus in place of unsupported characters is that there's no way to send emojis in accordance to SMS standard - it defines the encoding to be UCS-2. In order to put emojis in SMS, newer phones send the messages as UTF-16 instead, technically violating the standard, which can break some parsers that only expect UCS-2 to be there.
UTF-16 is the worst of both worlds when compared to UTF-8 and UTF-32. The only reason it exists (and, unfortunately prevalent) is because a number of popular technologies (Java, Javascript, Windows) thought they were being smart when building their Unicode support on UCS-2, and now here we are.
Now, the issue of clipping or reversing strings is a problem not just because of encoding. It simply doesn't work even with UTF-32. You're going to end up cutting off combining characters for example. Manipulating strings is very difficult, and software should never really try to do it unless they know what they are doing, and even then you need to use a library to help you do it.
I said they thought they were smart. I'm not going to judge whether it actually was smart based on the situation then.
That said, UTF-8 was already 4 years old by the time Java came out. Surrogate pairs was added to Unicode in 1996, one year prior to the release of Java.
I joined Sun Microsystems around that time, and Unicode really wasn't a thing in the Solaris world for a few more years, so the fact that people wasn't aggressively pushing good Unicode support at the time is understandable. People just didn't have much experience with it.
Surrogates are technically a UTF-16 only thing. Realizing that sometimes they nevertheless escape out into the wild, WTF-8 defines a superset of UTF-8 that encodes them:
To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.
I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.
A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.
Specifically, filenames on Windows are not UTF-16 (or UCS-12) but rather WTF-16 - like UTF-16 but with possibly unmatched surragate pairs. WTF-8 provides an 8-bit encoding for such filenames that matches UTF-8 wherever the original was valid UTF-16 while converting the rest in the most straightforward way possible, menaing you need less code to go from WTF-16 to WTF-8 than going from UTF-16 to UTF-8 while rejecting invalid characters.
It's invalid according to the spec. They are permanently reserved code points for use in UTF-16.
> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
They could be replaced by the replacement character to produce a valid string.
Nitpick: UCS-2 actually isn't fixed-length either, eg "ẍ̊" (small x+umlaut+ring above) is two code units (1E8D 030A) or possibly three (0078 0308 030A).
UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point). Of course, to represent a grapheme cluster, more than one code point may be needed, but that's true of Unicode in general.
Yes, that was rather my point: if you're using a Unicode-based character encoding, you're going to have variable-width characters regardless, so you might as well use UTF-8.
> UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point).
Sure, but that's a implementaion detail of the mapping from characters (at the application level) to bytes (at the physical(-ish) representation level).
UTF-8 is simple enough to implement and yet I've seen it done improperly more than once.
The problem with UTF-8 is that the density is really good for North America and Western Europe but drops off quite a bit for other languages, and you have to trade CPU for bandwidth (eg, gzip) to do much about it.
Japan has several encodings (though shiftJIS is the only one that I can recall) that use escape characters to switch code pages. As long as you don't switch too rapidly between kanji and borrow words, it's more compact, but more complex to implement (I would say less so than implementing gzip but if you aren't using zlib, one of the most portable libraries in existence, you have much bigger issues than character encoding).
UTF-8 takes 3 bytes for all of the first block. Only the first 2048 characters fit into 2 bytes, which is mostly European languages.
Outside of embedded software this really isn't that much of a problem any more.
Taking a random Wikipedia page as sample I get 46kB (UTF-8) versus 35kB (Shift-JIS). A random Japanese text from Project Gutenberg is roughly ⅔ of the size of the UTF-8 text in Shift-JIS.
Those are impressive enough numbers, but add just a single photograph to the Wikipedia page and it doesn't matter at all. Text is just pretty efficient, even if you use an encoding that supports every language in the world.
First, that's because European languages have small alphabets. It's not like Chinese or Japanese with their many thousands of characters could have fit in those 2,048 spots anyways. So it makes sense to allocate the small common alphabets there.
Second, text is so comparatively tiny relative to photos, video, code, etc. that it really doesn't matter at all anyways.
Third, text is often zipped as well. It's often zipped over HTTP. It's zipped when it sits inside of an EPUB. It's zipped when it sits inside a Word document. You can even configure MySQL to zip text fields in a database. Basically, whenever space is an issue, you can fix it.
So it's hard to see how this is any problem in practice at all, when phones and computers mostly ship with 32 GB of SSD minimum.
The density drops off but it's still good density. It's not a problem. And the amount of CPU you need to do the bit shifts is negligible.
Shift JIS requires you to track extra context when you're actively using the text. That's going to take extra space. I bet that in most of the situations where shift JIS meaningfully wins out, you could get more benefit by using a combination of UTF-8 and Zstd.
An entertaining article, but it's not historically accurate. If you look at measured usage, UTF-8 took off around 2005 and was the dominant web encoding by 2008. Emojis weren't added to Unicode until 2010, at which point UTF-8 usage continued to increase at exactly the same rate as before.
Hmm, in MySQL land you have utf8, which means utf8mb3, and utf8mb4. Only the latter supports Emoji. It is only in the last releases that utf8mb4 is supported and also the default character set.
I work with WordPress a lot, and up to 3 years ago it was quite common for MySQL setups at shared hosting providers to only support utf8mb3. And Emoji support really did help here to move it forward.
It's probably down to browser/OS support and the increasing importance of user-generated content on the web. Windows only got Unicode support by default on the consumer side with Windows XP which launched in 2001. Mac OS X was also released in 2001 and had Unicode support from the start (unlike Classic Mac OS which had it bolted on with poor app support). Similarly, according to this old blog post I dug up[1] Internet Explorer only got good Unicode support with 5.5 in 2000, and Netscape never had good Unicode support. Firefox, which had Unicode support from the start, was released around 2002. By 2006, which is clearly an inflection point on that graph, browsers and OSes with poor Unicode support were highly obsolete and global social media platforms like Myspace, Facebook, and Twitter, which needed Unicode to enable sharing of user posts between users writing in different languages, were on the ascendance.
Dealing with the hell of mixed ISO-8859-1/15 vs CP-1252 vs UTF-8 vs simple ASCII was enough to make me an early embracer of UTF-everywhere despite mostly only having to deal with English language sources. "Someone copy/pasted from Word and broke the database again" are words you don't want to ever have to hear.
UTF-8 was the sane refactoring after the initial incompatible parallel standards were established.
The ascendency of the CJK market, followed by Google Chrome, paved the way for UTF-8 everywhere.
The more interesting thing is why basically no one uses Eastern ideograms in the West, except maybe the Korean ideogram for crying (ㅠㅠ) and rarely, other kaomoji-like stuff. Some kanji also tell visual stories, and most children learn them just fine, so it’s not as simple as accessibility. Borrowing kanji was also anticipated by many sci fi writers and yet is not to be.
I will bite, wtf has Chrome to do with UTF-8? As far as I can find the last browser to struggle with it was IE5, IE6 was released almost a decade before Chrome was a thing.
I’d quibble over that characterisation. JavaScript uses neither UCS-2 nor UTF-16. Rather, its strings are a sequence of 16-bit Unicode code points, with surrogates handled according to UTF-16 rules, saving that unmatched surrogates are permitted to remain (that is, JavaScript strings aren’t necessarily valid Unicode).
There is no requirement whatsoever that the browser actually use 16-bit code units to represent strings. This is what Simon Sapin achieved for Servo with the WTF-8 encoding, which extends UTF-8 to allow surrogates, including unmatched surrogates: it makes it able to represent these strings with 8-bit code units, commonly halving memory usage and improving the speed of various sorts of linear processing (though at the cost of random access speed which becomes O(n)).
First of all, they ARE getting traction. Many Youtubers and Twitch streamers started to use them in stream / video titles. I haven't seen corner barackets at all ten years ago, and these days I see them in use at least once a week.
Some programming languages also start adopting them, too. Raku is the one I know (it allows French and German quotes, too). Maybe Julia, too? I think some language communities tend to be more open to widespread Unicode usage in source code than others.
Those are called Corner Bracket. It took me a while to find out that I have to have CJK font installed in my computer to use the corner bracket. And the file size of CJK font family are huge! More than 100MB.
Oh GNU Unifont is new for me, thanks for sharing that information. I used other source for CJK (the one that are 100MB)to ensure that I have every single possible uncommon character/glyph installed without chasing for more fonts. One source that have it all in one file. I discovered this because other sources don't always have a full set.
GNU Unifont is a bitmap font with a tiny resolution. Nice to have as a last level fallback, but too ugly for any characters you actually see more than once in a fortnight.
Huh, I didn't know about those. I've been using euro-style « and » though, to be able to copy-paste things that already include " and ', and still delimit what I am quoting.
Guillemets are used in many languages like «this», but 'euro-style' is a bit of a misnomer. They are used all over the world, and in many European languages different pairs are used, such as guillemets the »other« way around, and „this” matching pair.
I believe it's „this“, actually (U-201E to start, U-201C to end), but the distinction between all those quotation marks is hellishly difficult, and I bet native speakers get it wrong all the time.
Once upon a time, I wanted to rely on these distinctions in a TTS frontend to distinguish between 5" floppy disk and "Mambo No. 5"
I soon realized that people use quotation marks and dashes in such a random manner that insisting on treating the semantics literally would create more confusion than it would resolve.
Edit: Now that my comment has posted, I realized you used the correct characters, but the font on my browser rendered them in a seemingly incorrect way, so they looked like double primes.
I take great pride in always typing things like 5¼″ and “Mambo № 5” with the exact Unicode scalars that I desire. I love my Compose key. (I have Compose+'+` produce ′ (prime), Compose+"+` produce ″ (double prime), Compose+:+: produce “, Compose+;+; produce ‘, Compose+"+" produce ”, Compose+'+' produce ’, Compose+N+o produce №, and Compose+1+4 produce ¼.)
> Out of curiosity, which Korean ideograph would that be?
Yu. Having two of them looks like a crying face. Although tha (ಥ) is also a common component of crying face (ಥ_ಥ). They're talking about kaomoji which use various non-latin or fullwidth symbols (though you're right that they're largely not ideograms) to compose pretty extensive "smileys" e.g. the look of disapproval uses kannada, denko uses greek and katakana, ...
The Gboard keyboard on android has a tab for many of these common “emoticon” faces / character sequences. If you open the emoji picker on the keyboard and then tap the far right bottom tab icon “:-)”
They can get very elaborate though, these are just very basic common faces.
> The Gboard keyboard on android has a tab for many of these common “emoticon” faces / character sequences. If you open the emoji picker on the keyboard and then tap the far right bottom tab icon “:-)”
iOS also has that on the standard Japanese "Kana" keyboard (and possibly others), under the "^_^" key.
> basically no one uses Eastern ideograms in the West
That’s false. There is a lot of students of CJK languages in European universities as Japanese and more recently Korean as popular language. Chinese and Japanese can be learn in some high-school too. There are publishers publishing books containing such characters (e.g. You Feng in Paris). And of course the diaspora and heritage learners... That a lot of people, even more in countries with significant East-Asian population like Australia or the US.
It's accessibility in the sense that computer input methods popular in the West can't generate them. As far as I know, there's no way to get my computer or phone keyboards to produce 水 without switching to one of the CJK input modes.
On Mac, at least, the "Emoji keyboard" accessible through Cmd+Ctrl+Space in all standard text controls makes it possible to add basically any character in Unicode if you know its name. For example, you can type "water" to get ⽔ (along with other characters, like the water droplet emojis). I use this often to type the Greek beta symbol, for example.
...and then when you look it up that way, if you note that it is U+2F54, and if you have your your input source set to "Unicode Hex Input", you can enter ⽔ by holding down the option key (⌥) and typing 2f54.
As far as I know this only works for characters in the BMP, but for such characters it may be faster than picking them from that cmd-ctrl-space popup.
If you don't have that input source available, you can add it via the "Input Sources" tab of Keyboard preferences. There you can also enable showing the input menu in the menu bar giving you an easy way to switch between "Unicode Hex Input" and your normal input source.
If this was ever a conscious decision, it was genuinely brilliant. Much better to have a carrot rather than the stick of "you'll get hacked (except probably not)"
As a millennial whose only chatting activities are confined to irc and doesn’t really get emojis — could someone please articulate this phenomenon in terms that would be meaningful to me?
I’ve seen sentences sometimes where words are actually replaced with emojis... is this how some subset of people actually communicate online or that’s just for some effect of irony?
It can be a way of adding intonation and other affective, out-of-band communication back into text. I never really see people use them purely to replace words one-for-one. But interspersed throughout a message, emojis can, as you say, add irony, but also communicate that a message that could be taken as ironic isn't, or communicate some other subtext.
Text is a flattening of speech, and emojis can add some of those missing dimensions back -- and, like our IRL verbal cues, tics, and gestures, they can be hard to decode if you're not "in" on the game.
It’s an ingroup signal, indicating (by “correctly” using emoji) that you’re part of a specific subculture to the recipient. “Praying_emojii flame_emojii YASS” indicates to me that the writer is young and hip. “Do you want to have a BBQ bbq_emojii ?” says to me that the writer is older and less hip.
It can be hard to communicate tone through writing. Emojis allow one to instantly mark a piece of writing as informal/non-serious with minimal effort. This includes irony - “eggplant_emojii” is often a non-serious reply indicating that “I am jokingly acting like this this is sexual or attractive”
It’s a proxy for longer writing. “Thumbsup_emoji” is a substitute for some marginally harder to articulate feelings of “looks good / I like it”
Of course, there are many subcultures that use emoji in different ways and as proxies for other things as well. At a previous employer we’d often just send “taco_emoji?” to ask who was buying lunch. It’s the sort of thing that can be used/abused in many different ways.
I really strongly recommend 'Because Internet: Understanding the New Rules of Language' by Grechen McCulloch. It's a lighthearted linguistic look at how our written communication has changed over the last couple decades since the advent of mainstream internet access.
ironic effect is, of course, communication all by itself.
My use of emoji in messaging applications is primarily limited to quick rebus-like reaction replies to meme images, a quick and dirty reaction to a message in slack expressing some vague emotional response, or making complicated fart-jokes with my partner that rely on a lot of out of band information.
(i also use IRC on the daily, and have been known to use emoji there too, so these communication forms are not disjoint)
On my Mac, I spent an evening figuring out how Apple's emoji picker works and backporting all the emoji's instead. But I'm not entirely sure what this says about me.
Emoji are just fonts, and with some search engine sleuthing, you can find an OG pistol emoji out there, pop open your system emoji font in an editor, and replace the water pistol with a pistol pistol.
Everyone else will still see the Nerfed version, of course.
:burrito: and :taco: are not emojis as such though, they are more alike good old "smileys" that turns characters into images. Supporting emojis would be to support the UTS#51 https://www.unicode.org/reports/tr51/ natively.
In russian segment of internet, various cyrillic encodings (win, dos, mac, koi-8) were a huge problem, and only UTF-8 finally solved it long before emojis became a thing.
It is known that developers from English-speaking countries are generally oblivious to encoding problems. Probably they could get by on ASCII far longer than the rest of the world, so no wonder that they might confuse cause and effect in this case.
Can confirm, at least anecdotally. As someone who runs a website but doesn't have nearly enough time to do everything he wants to do, upgrading to UTF-8 (and specifically utf8mb4) was never a priority - until my users starting using emojis and breaking things left, right and center.
Emoji probably contributed to widespread support of supplemental planes (fixing systems which treated UTF-16 as UCS-2), but I doubt they contributed much to UTF-8's popularity.
Or treated UTF8 as the nonsense that MySQL's utf8 is (it's 3-bytes utf8 aka only the BMP, and silently drops anything from the first non-BMP codepoint).
Indeed. To ensure MySQL stores "real" UTF-8 one has to use "utf8mb4" instead of "utf8", which just rolls off the tongue and backwards (in)compatibility seems to be the reason why one can't just DWIM things backwards... "utf8mb4 or bust" it is, then!
Information density matters, and I can't wait for someone to replace "old" color coding of files (.dircolors in batch) by 1 emoticon : a music note for music files, etc.
I'm glad about that. I find it weird reading through comment threads or forums and seeing mobile phone emojis scattered through. I find them distracting.
I'm not really too sure why. I don't mind them in personal messages or texts and stuff, but seeing them on public pages just kind of annoys me for some reason.
I'm not because it's completely arbitrary about it e.g. you can include 🁓, 🀕, ⏱, 🃅, box drawing, or Z̸̠̽a̷͍̟̱͔͛͘̚ĺ̸͎̌̄̌g̷͓͈̗̓͌̏̉o̴̢̺̹̕ but not trigrams, die faces, box elements, musical notes or flags. They just whitelisted/blacklisted entire blocks and called it a day.
Which obviously is par for the course when it comes to HN's comment box, the markup system is even more half-assed.
Sounds like they blacklisted things likely to clutter up the comment threads and left things unlikely to be used.
Country flags seem like they could be used for political trolling.
Die faces could lead to weird rolling threads or other things.
Musical notes, you got me, can't really think of anything too bad for those.
The markup's not great, but too much formatting is distracting. I personally prefer the limited options. You focus more on the content of your comment than making it look pretty.
The only thing i really despise about hn's formatting is the code blocks or whatever they are, the one on mobile that vanishes off the side and you have to scroll horizontally to read everything. I really can't stand when people use those for quotes.
Other than that though, hn's formatting makes everything uniform and fairly easy to read through. There's no fancy nonsense getting in the way of things.
Actually, that's part of why those code block things piss me off, they're probably the fanciest piece of formatting you can do and all it does is obstruct information and make me waste time while reading.
> Sounds like they blacklisted things likely to clutter up the comment threads and left things unlikely to be used.
That’s not really believable given how arbitrary it is.
> Die faces could lead to weird rolling threads or other things.
As if tiles or playing cards could not be used that way.
> The markup's not great, but too much formatting is distracting.
The problem is that despite having only two directives half of HN’s markup is actively detrimental: because there is no escaping, no inline literals, and the parsing is sub-par, in my experience the “emphasis” directive causes issues more often than it helps. HN’s markup would be significantly improved by removing it entirely.
> I really can't stand when people use those for quotes
Which would be way less likely if HN actually supported quotes.
>> I really can't stand when people use those for quotes
>Which would be way less likely if HN actually supported quotes.
But look how well this works ;p.
Sorry...couldn't resist.
I dunno, I like the 'hackish' nature of it.
You're right i'm sure the tiles or playing cards could be used like that too, it may be arbitrary, I don't know. But, those were just some reasons off the top of my head, i'm sure when HN was being programmed a bit more thought went into it, or maybe not, who knows?
My main point is, I like the simplicity of it all, sure it could be better, but better doesn't necessarily lead to better quality content.
There's a minimum amount of distractions, most users find reasonable ways to communicate the context of the content of their posts and scrolling through most threads tends to be a mostly uniform experience where if users are following a few established conventions, you can follow the flow of things pretty well.
It's not perfect, it's not the best, but I feel like it fits the general vibe and nature of the site. It gives HN an identity among all the other news aggregators and forums.
I wish they would add U+2009 (thin space) to that list. That's the standard way under the SI system to separate digit groups, e.g., 1 234 567. HN just treats it as a regular space.
(The SI standard for separating the integer part from the fractional part is to use "." or ",", whichever is customary in your location. Using thin space for grouping removes the ambiguity that you get in places that use one of "."/"," for grouping and the other for a decimal point).
I wonder if that's putting things through unicode "canonical normalization", or than custom rules.
Let's see what it does with `U+00BC Vulgar Fraction One Quarter Unicode Character`... ¼
Nope it allows it instead of turning into `1/4`, so that's not canonical normalization. I guess it's custom rules? Or some other unicode transformation we're not thinking of, or other third-party re-usable transformation.
True, there's almost certainly a whitespace normalisation pass at one point as well, likely during / around the processing of what little makup HN has.
It's interesting to watch the evolution of written language in action. I expect in 20 years we will routinely see emojis in written English novels and news articles. In 50 years we'll see them in textbooks and scientific journal articles.
Yet another reason I'm glad for the inevitability of death.
I can't be the only one who thinks emoji are a terrible idea. Granted, I also don't think logographic characters are a good idea but at least they have thousands of years of use and agreed upon semantic meaning behind them.
If everything you care to talk about can be easily described using thousand-year-old ideas, then I can see why you are against emojis. But this isn't true for many things people want to talk about today.
Language is just an encoding for ideas, and emojis are a new compression algorithm. Using a single character, you can now convey certain thoughts and sentiments that you previously needed many more characters to reference or explain.
"So then just explain it!" some might respond. "Why can't people be bothered to spend even a little time to write down what they think?" It's an accessibility issue. People have limited time every day to get their ideas across, and they deserve ways of conveying their ideas concisely. There is precedent for this too -- this is why we have acronyms and new words. "lol" and "minivan" don't have thousands of years of agreed-upon semantic meaning behind them.
A final thought -- whether you think emojis are a terrible idea might not be relevant to whether they should exist. Letting people live their own lives to the fullest is much more important than making sure you, I, a future historian, or any other third party understands what they are saying. But you don't have to worry about not being to understand conversations. Given what you prefer, if someone wants to address you as a target audience, then they probably won't use emojis.
What an overreaction. You'll find the proliferation of emojis distributed appropriately according to the genre of writing (among good writing of the genre). For example, you'll still hardly see emoji in newswriting where they don't have much to add to the semantic content, but you already see it liberally used in places where stickers and drawings are already expected: e.g. in edited Instagram photos.
> In 50 years we'll see them in [..] scientific journal articles.
I highly doubt this. Common shortcurts (such as "it's" for "it is") and dialect is still absent from serious journals (unless it's the research target, of course) and I'm rather sure emojis will similarly be disregarded.
"And it's amusing to see Apple using new emojis as a carrot to get people to install the latest security patches."
OMG, that makes so much sense. I was the opposite, grumbling about not caring about that silliness, not realizing the psychology of things.
Having suffered through the dark ages with Microsoft increasingly ruining the world with an endless stream of proprietary crap (I still hate them for making people think tab width is configurable), it's amazing to step back and witness how much things have improved (on this narrow slice).
Emojis are what caused me to learn about unicode personally. We had a big bloated library handling emoji stuff for us in our product [0]. Then one day, someone said that they wanted the comment threads to behave more like Slack - that if only an emoji was typed in a comment (without any additional text), it should appear big, otherwise it should be small. This required detecting when text was only an emoji which required me to learn how the heck this stuff actually worked. Great experience and I'm forever indebted to emojis.
0: https://kitemaker.co, an awesome new issue tracker with tons of hotkeys (and emojis)
What I don't like about this approach is that everybody potentially sees different versions of the emoji.
(And by the way, HN stripped out the emoji I put in this comment; perhaps that is for the better, but it's kind of funny that the software we use here is quite opposite to the software we make in our day jobs)
Do e-mail clients like Thunderbird and Outlook already use UTF-8 by default? The last time I checked they used ANSI/ISO codepages for new E-mails. I mean the desktop MS Office Outlook app primarily, but I'm curious about the web apps as well.
didn't seem like it was emojis paving the path but web-based email and internationalization of websites. Just the whole move to web in various key areas like email meant it became just less of a hair-pulling nightmare for developers to have to deal with encoding between countries and platforms. Throw in the dawn of smartphones (and emojis came along with that yes) and that was more problems on top of that, people moving between desktop/mobile/web etc. UTF-8 took care of alot of the headache.
Pretty sure the dominance of ASCII on the internet and the efficieny/compatability of UTF-8 in relation to ASCII paved the way for UTF-8 everywhere. It is the standard unicode encoding of the internet.
If anything, I would say the UTF-8 paved the way for emojis, not the other way around as the ubiquity of a unicode encoding allowed for the existence of emojis. Can't encode emojies with ASCII. You have to have unicode and its encoding first before you can have emojis.
There are 16 of these planes. This first block of 65,536 characters is what you can encode with only two bytes (e.g. UTF-16), and it includes most of what anyone alive needs to encode their languages adequately enough. For a long while anything encoded beyond this block had only limited support, and plenty of bugs and limitations meant that using it was tricky (well, it worked fine in LaTeX of course; via xelatex for example). This was back in 2008/2009.
Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.
Then came the emoji additions (a core set is part of the BMP and came from Japanese telecom standards), and support is now ubiquitous. Using UTF-8 is a no-brainer for most applications, and a good things that is too!