Validating UTF-8 strings using as little as 0.7 cycles per byte

SloopJon · on May 16, 2018

My first thought on seeing this was Bjoern Hoehrmann's UTF-8 decoder, which encodes a finite state machine into a byte array:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

I happened to come across this decoder again recently in Niels Lohmann's JSON library for C++:

https://github.com/nlohmann/json

I see that this is mentioned in a previous post:

https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-...

One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.

loeg · on May 16, 2018

> One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.

I believe that's what checkContinuation() is doing, based on its use of the "counts" parameters. I don't understand how it works, but I don't see any other reason for count_nibbles() to compute the '->count' member.

kwillets · on May 16, 2018

The counts are actually to find the length of the following continuation bytes, and mark them to look for overlaps or underlaps between the end of one code point and the next.

loeg · on May 17, 2018

Underlaps and overlaps would occur if sequences are invalid, i.e., overly long (or short) compared to the initial byte's prefix.

kwillets · on May 16, 2018

It checks overlongs. It's not too hard -- a few bits in the first byte, and a few in the second for some lengths.

KMag · on May 16, 2018

With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8, as well as reading/writing 128-bit variable length integers (preferably prefix-encoded[0] rather than LEB128).

A while back, I read that the Chinese Longson processors were a (subset?) of the MIPS instruction set with added instructions for Unicode handling, but that's all I've heard of processors with Unicode accelerating instructions, and I'm not sure which encoding(s) was/were accelerated.

[0]https://news.ycombinator.com/item?id=11263378

chrisseaton · on May 16, 2018

> With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8

If we took this attitude with every new technology, we'd have a large number of instructions that are now useless. At one time it probably seemed a good idea to have custom instructions for parsing XML, and people really were doing custom instructions for interpreting JVM bytecode.

cyphar · on May 17, 2018

The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way) should be clear that CPU designers don't really care about longevity, all that matters is current performance characteristics.

Not to mention that I think it's fair to say that UTF-8 will outlast AES (it's actually older than AES). After all, AES was only standardised in 2001 -- and AES-NI was added when it was 7 years old. Unicode (and UTF-8) were standardised in 1991. So if we have AES instructions we should already have UTF-8 instructions.

rauhl · on May 17, 2018

> The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way)

Say what? Unless you mean something other than ‘elliptic curve’ by ‘EC,’ that doesn’t make sense to me: AES & EC are completely different kinds of thing, the former being symmetric & the latter being asymmetric. Indeed, it’s quite common to use an EC-derived secret as the key for AES-encrypted material.

mywittyname · on May 16, 2018

I disagree. It's reasonable to assume UTF-8 will is here to stay. It will likely outlive any processor architectures designed today.

Plus, there are already several instructions on x86 that for string manipulation.

loeg · on May 16, 2018

> Plus, there are already several instructions on x86 that for string manipulation.

Also SHA1, AES, CRC32, and other specialized functions that may have less staying power than UTF-8 (SHA1 in particular, but also AES being a block cipher has some nuisances that means it is not always used in new algorithms).

GolDDranks · on May 16, 2018

What irks me is that the CRC32 instruction in the x86 architecture uses different polynomial than many of the CRC32 implementations in the wild, which makes the instruction unusable for them...

AceJohnny2 · on May 16, 2018

> and people really were doing custom instructions for interpreting JVM bytecode.

Ah yes, ARM Jazelle, and the good ol' ARM926EJ-S...

https://en.wikipedia.org/wiki/Jazelle

Did the concept die with ARMv8 (64bit)?

monocasa · on May 16, 2018

It had been essentially dead for a while even before AArch64. It's slightly obfuscated by the fact that there's a version of Jazelle that's basically a nop mode that'll trap on each bytecode.

sigjuice · on May 16, 2018

I don't see java in the list of features on my Raspberry Pi 3

  $ cat /proc/cpuinfo
  processor       : 0
  model name      : ARMv7 Processor rev 4 (v7l)
  BogoMIPS        : 38.40
  Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
  CPU implementer : 0x41
  CPU architecture: 7
  CPU variant     : 0x0
  CPU part        : 0xd03
  CPU revision    : 4

gsnedders · on May 16, 2018

Yes, it's gone in ARMv8.

basementcat · on May 17, 2018

Thanks to CPUID and analogous hardware capability registers on other platforms, outdated instruction sets can be deprecated and removed when there is no longer market demand for them.

chrisseaton · on May 17, 2018

That doesn’t help if you have binaries using those instructions that you want to keep running.

If it’s so easy to remove instructions how come it almost never happens for real?

basementcat · on May 18, 2018

Motorola removed 64 bit MUL/DIV as well as transcendental floating point instructions (FCOS, FSIN, ...) in the 68060, ARM removed Jazelle instructions, AMD removed 3DNow! instructions.

https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdf

https://en.wikipedia.org/wiki/Jazelle

https://web.archive.org/web/20131109151245/http://developer....

fao_ · on May 17, 2018

> we'd have a large number of instructions that are now useless.

We already have a large number of instructions that are useless. Look at the difference between CISC and RISC.

zbjornson · on May 16, 2018

I'm confused how the ASCII check works. It's checking if any byte is greater than zero. Wouldn't you want to check if the 8th bit in any byte is set?

aeruder · on May 16, 2018

The trick here is that each byte is treated as a signed 8-bit number. When the top bit is set, the number is negative.

zbjornson · on May 16, 2018

Oh duh, thanks. (Checking less than zero, read it wrong.)

I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).

https://github.com/lemire/fastvalidate-utf-8/pull/2

Sharlin · on May 16, 2018

Depends on your input. If non-ASCII strings are frequent and likely to contain a non-ASCII character fairly close to the start of the string, then it makes sense to short circuit.

loeg · on May 16, 2018

The previous >0 algorithm didn't short-circuit. There is no change to short-circuit behavior here.

Sharlin · on May 17, 2018

Ah, I was thinking of the naive implementation in the previous post [1].

[1] https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-...

vardump · on May 16, 2018

> I think it would be faster to OR the entire string with itself, then finally check the 8th bit though.

The string could have NUL (zero) bytes in between.

zbjornson · on May 16, 2018

You're right that it changes the behavior vs. what the current implementation is, but 0x0 is a valid ASCII character.

vardump · on May 16, 2018

While you're technically right NUL is a part of ASCII set in practise it's rarely wanted in the data.

MichaelGG · on May 16, 2018

Fast checking is really useful in things like HTTP/SIP parsing. Rust should expose such a function as well seeing as their strings must be UTF-8 validated. Though it's even faster if you can just avoid utf8 strings and work only on a few known ASCII bytes, it means you might push garbage further down the line.

masklinn · on May 16, 2018

> Rust should expose such a function as well seeing as their strings must be UTF-8 validated.

That's more or less what std::str::from_utf8 is: it runs UTF8 validation on the input slice, and just casts it to an &str if it's valid: https://doc.rust-lang.org/src/core/str/mod.rs.html#332-335

from_utf8_unchecked nothing more than an unsafe (c-style) cast: https://doc.rust-lang.org/src/core/str/mod.rs.html#437 and so should be a no-op at runtime.

MichaelGG · on May 16, 2018

I meant Rust should have a SIMD optimised version that assumes mostly ASCII. I'm guessing there is a trade-off involved depending on the content of the string.

scottlamb · on May 16, 2018

The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in Rust is a work in progress - the next version (1.27) will stabilize x86-64-specific SIMD intrinsics. There's an rfc (https://github.com/rust-lang/rfcs/pull/2366) for portable SIMD types.

masklinn · on May 16, 2018

simd support in Rust was only recently accepted and is being implemented so it currently relies on the vectorisation abilities on the compiler (it might get revisited soon-ish I guess).

As for the assumption of mostly-ascii, the validation function has a "striding" fast path for ascii which checks 2 words at a time (so 128 bits per iteration on 64b platforms) until it finds a non-ascii byte: https://doc.rust-lang.org/src/core/str/mod.rs.html#1541

chrismorgan · on May 16, 2018

std::str::from_bytes is that API.

Rust’s current implementation of full validation: https://github.com/rust-lang/rust/blob/2a3f5367a23a769a068c3...

I have a vague feeling there’s an even faster path for probably-ASCII out there, but I can’t immediately recall where and am going to bed. Doubtless someone else will answer before I get up.

The core team will be amenable to replacing this algorithm with something faster presuming it’s still correct.

kzrdude · on May 16, 2018

Given the new simd features, it's probably time to revisit that now

MichaelGG · on May 16, 2018

There is probably a trade-off depending on the content of the string, right? So the API probably needs a general-purpose and a "this should be all ASCII" version?

simias · on May 16, 2018

I don't know if it really makes a lot of sense to have such a specialized version of a method in the standard library. Effectively from_str and from_str_mostlyascii would be functionally identical except for a small performance difference depending on the encoding of the string data which wouldn't matter in 99% of programs.

Having that as a 3rd party crate would make complete sense however.

Retra · on May 17, 2018

If there's a canonical, obvious, and low maintenance way to do something that there's a common need for, then there's no reason it shouldn't be in std. If it needs to be hacked at and redesigned every few years, then yeah, leave it out.

simias · on May 17, 2018

That's not really the Rust way, for instance things like timekeeping and random number generation are delegated to external crates. Given how trivial it is to add dependencies thanks to cargo it's not really an issue in practice, even if I do think that they go a bit overboard at times (time really ought to be in std IMO, especially since there's already a rather useless std::time module).

nabla9 · on May 16, 2018

There is validating and "validating", what is this code doing?

For example, valid utf-8 must always use the shortest possible sequence or it's invalid. Validator must check against decoding invalid sequences.

example of invalid sequence:

    0xF0 0x80 0x80 0x8A

kwillets · on May 17, 2018

There are actually a few different parts to the code, and they work somewhat independently.

The first is to classify each byte by type, using the highest 4-5 bits. The types are:

ASCII

continuation

initial byte of:

   2 

   3 

   4-byte sequence

Given these types, the multibyte sequences are checked for proper lengths, ie each length indicator has to be followed by that exact number of continuation bytes until the next non-continuation. We do this by putting the following byte count where each initial is found, zeroing the rest, and carrying counts right with shift-and-subtract, saturated to 0. This just creates a descending sequence after each initial, eg 4 -> 3,2,1, which should reach zero right at the next initial or "first" byte (ascii or multi). The vector of nonzero following bytes xored with the vector of first bytes should then be all 1's, ie there should be no gaps or overlaps.

The next task is to check for overlongs. These are just bitfields in the first or second bytes which should be nonzero, and they can be checked at least two ways. One is masking and checking for any nonzero bits, based on a lookup table of masks. Another is comparison, based on the idea that any 1 bits in the bitfield of interest will make the byte larger than all 0's (the higher-order bits are fixed). Both of these methods assign checks on a per-length basis, using a lookup table from the highest 4 bits. Longer sequences also check the second byte in similar fashion.

In the end we basically have bitmaps of each type, and we check that they don't intersect and that their union covers the range. The code is a bit complicated by carrying over the previous range so that sequences that span registers will still get checked.

rspeer · on May 16, 2018

This is built on UTF-8 described as a state machine. The state machine already rejects overlong sequences.

There is no sense of "validator" in which an expression that accepts overlong sequences could be called a UTF-8 validator.

Someone · on May 16, 2018

”For example, valid utf-8 must always use the shortest possible sequence or it's invalid.”

nabla9 likely knows this, but for those wondering: utf8 can’t contain overlong encodings (https://en.m.wikipedia.org/wiki/UTF-8#Overlong_encodings), but need not use precomposed characters (https://en.m.wikipedia.org/wiki/Precomposed_character) whenever possible (another way in which a shorter sequence is possible)

jwilk · on May 16, 2018

The test suite includes these as "bad" sequences:

* \xc0\x9f (overlong U+001F)

* \xed\xa0\x81 (surrogate)

SloopJon · on May 16, 2018

I was too lazy to look up what these sequences represented, but I did a quick test with the Euro symbol example from Wikipedia, and it did indeed reject the overlong sequence:

    #include <stdbool.h>
    #include <string.h>
    #include <stdio.h>
    #include "simdutf8check.h"
    
    int
    main(int argc, char *argv[])
    {
        const char euro[] = "\xe2\x82\xac";
        const char eurolong[] = "\xf0\x82\x82\xac";
    
        bool valid = validate_utf8_fast(euro, sizeof euro);
        printf("validate_utf8_fast(euro): %d\n", valid);
    
        valid = validate_utf8_fast(eurolong, sizeof eurolong);
        printf("validate_utf8_fast(eurolong): %d\n", valid);
    
        return 0;
    }

kizer · on May 16, 2018

Should strings be represented as codepoint arrays in memory? In C, for example, should I always decode input and work with that, or convert to utf8?

masklinn · on May 16, 2018

> Should strings be represented as codepoint arrays in memory?

No. I think the implementor of Factor's unicode support originally did that but it turns out to not be useful:

* it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)

* this also has impact on CPU caches, lowering the average amount of data you can fit in your cache and work on

* it requires a complete conversion of incoming ascii and utf8 data (which only get more and more common as time goes on) rather than just a validation

* and because Unicode itself is variable-length (combining codepoints) it's not actually helpful when you're trying to properly manipulate unicode data

The only "advantage" of representing strings as codepoint arrays is that you get O(1) access to codepoints which is a terrible idea you should not encourage.

UTF-32 internal encoding makes some manipulations very slightly easier, but not enough to matter in the long run, and it encourages bad habits. If you don't need the O(1) access thing for backwards compatibility reasons, don't bother.

kizer · on May 16, 2018

I did not know that Unicode itself was variable length. That alone nullifies motivation to use codepoint arrays. Why would they do that??

masklinn · on May 17, 2018

To limit combinatoric explosion. Without combining codepoints, you need a version of each base symbol for each combination of modifier. And while latin scripts usually limit themselves to one diacritic per base[0], other scripts (hebrew or arabic come to mind) can pile up multiple diacritics.

[0] not that that's always the case, the navajo alphabet can have both ogonek and acute on the same latin base character

kizer · on May 17, 2018

I see... speaks to my ignorance on language.

bumblebritches5 · on May 16, 2018

Have you ever written your own unicode library, or are you just parroting what you've heard?

I have written my own unicode library, normalizing strings as UTF-32 is MUCH MUCH MUCH easier than trying to do the same in UTF-8 or UTF-16.

not to mention casefolding, and it also allows you to design better interfaces, so that the actual algorithm only needs to be implemented once.

wumpus · on May 16, 2018

Sounds like your library is the poster-child for:

> it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)

Ndymium · on May 16, 2018

That would require an array of 32 bit integers (basically UTF-32). It wastes a lot of space for most characters and doesn't help that much with indexing since user perceived characters can still be composed of many codepoints. Different languages take different approaches.

Python 3.3+, I believe, looks at the string and chooses ASCII/UCS-2/UCS-4 based on the character that takes the highest amount of bytes [1]. Elixir uses UTF-8 for strings (but you can get the codepoints easily), whereas Erlang uses the list of unicode ints style.

[1] https://www.b-list.org/weblog/2017/sep/05/how-python-does-un...

kizer · on May 16, 2018

Thanks for that link. It is very informative.

steveklabnik · on May 16, 2018

It's always tradeoffs.

If by "arrays of codepoints" you mean "each element of the array is a single codepoint", then you get O(1) indexing regarding codepoints, but may use up to 4x the memory as the variable-length encoding.

bumblebritches5 · on May 16, 2018

Which is fine when you're doing nothing but a simple operation on a string, then reencoding it.

Seriously, how many strings have you ever ran across that are more than 1kb?

how many strings have you ran across that are more than 256 codepoints?

nirvdrum · on May 16, 2018

People routinely read text files larger than 1 KB into memory. HTTP responses are often larger than 1 KB. Serialized data (JSON, XML, etc.) is often larger than 1 KB.

bumblebritches5 · on May 16, 2018

HTTP headers are not anywhere near 1kb of text.

the packet's themselves are only 1.5kb.

nirvdrum · on May 16, 2018

The headers are one component of a response. The body is often larger than 1 KB.

mromanuk · on May 17, 2018

I’m currently working with a 2 MB string, representing 150k words, with a structure that is a Packed Trie stored as an array for fast access of Unicode Scalars in Swift

steveklabnik · on May 16, 2018

Sure. It's fine sometimes, and not fine other times. That's why it's called a tradeoff.

I see strings larger than that all the time, and care about using low RSS. YMMV.

bumblebritches5 · on May 16, 2018

Yes, it gives you flexibility in using UTF-8 and UTF-16 (which despite being a shitty format, you WILL have to support UTF-16), and also overlong sequences don't matter.

slrz · on May 17, 2018

> which despite being a shitty format, you WILL have to support UTF-16

Luckily not. If you absolutely must deal with legacy encodings, feed them through a conversion pass beforehand.

loeg · on May 16, 2018

Generally no. You will know when you need it.

ozzmotik · on May 16, 2018

how exactly do you use a fraction of a cycle? i would assume that it would be fractional on average, but then again I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping

dragontamer · on May 16, 2018

> I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping

1. Standard ALUs on modern processors are 64-bits at a time. So right there, you're 8x faster on a per-byte basis.

2. He's using vectorized operations, so he can work with 128-bits, 256-bits (or potentially even 512-bits on high-end processors like Skylake-X). So 16x, 32x, or 64x at a time per operation.

3. Modern processors are super-scalar with multiple ALUs and multiple execution ports. I forget what the precise theoretical limit is, but Intel AND AMD machines can execute something like 4 or 5 operations per clock, depending on circumstances.

That assumes that all operations have been decoded into micro-ops (uops), they fit inside the uop cache (think of a uop cache as a L0 cache: beyond even the L1 cache), that they perfectly line up to available execution ports, that your data has no dependencies, and a whole host of other conditions. But its theoretically possible.

---------

In practice: your code will be limited by memory (even L1 cache is slower than the CPU), by decoding speed (not everything fits in the uOp cache, and only loops really benefit from the uop cache), dependencies (a = x + 1. a = a+2. The 2nd instruction depends on the 1st one to execute first, so the two instructions can't be done in parallel / superscalar).

The CPU is pretty good at trying to figure out how to optimally reorder and out-of-order execute your code. But that means that predicting the performance of the CPU is incredibly difficult: you have to keep in mind all of the parts of the CPU while thinking of the assembly code.

mcguire · on May 16, 2018

You handle multiple bytes at once; it's fractional on average.

He's using vectorized operations to handle 16 bytes at a time, basically:

    __m128i has_error = _mm_setzero_si128();
    __m128i zero = _mm_setzero_si128();
    for (...) {
        __m128i current_bytes = _mm_loadu_si128(src);
        has_error =
          _mm_or_si128(has_error, _mm_cmpgt_epi8(zero,current_bytes));
    }
    return _mm_testz_si128(has_error, has_error);

Each of the _mm* "functions" becomes 1 (?) vector instruction.

kizer · on May 16, 2018

I believe he means on average, as you said.