> One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.
I believe that's what checkContinuation() is doing, based on its use of the "counts" parameters. I don't understand how it works, but I don't see any other reason for count_nibbles() to compute the '->count' member.
The counts are actually to find the length of the following continuation bytes, and mark them to look for overlaps or underlaps between the end of one code point and the next.
With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8, as well as reading/writing 128-bit variable length integers (preferably prefix-encoded[0] rather than LEB128).
A while back, I read that the Chinese Longson processors were a (subset?) of the MIPS instruction set with added instructions for Unicode handling, but that's all I've heard of processors with Unicode accelerating instructions, and I'm not sure which encoding(s) was/were accelerated.
> With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8
If we took this attitude with every new technology, we'd have a large number of instructions that are now useless. At one time it probably seemed a good idea to have custom instructions for parsing XML, and people really were doing custom instructions for interpreting JVM bytecode.
The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way) should be clear that CPU designers don't really care about longevity, all that matters is current performance characteristics.
Not to mention that I think it's fair to say that UTF-8 will outlast AES (it's actually older than AES). After all, AES was only standardised in 2001 -- and AES-NI was added when it was 7 years old. Unicode (and UTF-8) were standardised in 1991. So if we have AES instructions we should already have UTF-8 instructions.
> The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way)
Say what? Unless you mean something other than ‘elliptic curve’ by ‘EC,’ that doesn’t make sense to me: AES & EC are completely different kinds of thing, the former being symmetric & the latter being asymmetric. Indeed, it’s quite common to use an EC-derived secret as the key for AES-encrypted material.
> Plus, there are already several instructions on x86 that for string manipulation.
Also SHA1, AES, CRC32, and other specialized functions that may have less staying power than UTF-8 (SHA1 in particular, but also AES being a block cipher has some nuisances that means it is not always used in new algorithms).
What irks me is that the CRC32 instruction in the x86 architecture uses different polynomial than many of the CRC32 implementations in the wild, which makes the instruction unusable for them...
It had been essentially dead for a while even before AArch64. It's slightly obfuscated by the fact that there's a version of Jazelle that's basically a nop mode that'll trap on each bytecode.
Thanks to CPUID and analogous hardware capability registers on other platforms, outdated instruction sets can be deprecated and removed when there is no longer market demand for them.
Motorola removed 64 bit MUL/DIV as well as transcendental floating point instructions (FCOS, FSIN, ...) in the 68060, ARM removed Jazelle instructions, AMD removed 3DNow! instructions.
Oh duh, thanks. (Checking less than zero, read it wrong.)
I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).
Depends on your input. If non-ASCII strings are frequent and likely to contain a non-ASCII character fairly close to the start of the string, then it makes sense to short circuit.
Fast checking is really useful in things like HTTP/SIP parsing. Rust should expose such a function as well seeing as their strings must be UTF-8 validated. Though it's even faster if you can just avoid utf8 strings and work only on a few known ASCII bytes, it means you might push garbage further down the line.
I meant Rust should have a SIMD optimised version that assumes mostly ASCII. I'm guessing there is a trade-off involved depending on the content of the string.
The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in Rust is a work in progress - the next version (1.27) will stabilize x86-64-specific SIMD intrinsics. There's an rfc (https://github.com/rust-lang/rfcs/pull/2366) for portable SIMD types.
simd support in Rust was only recently accepted and is being implemented so it currently relies on the vectorisation abilities on the compiler (it might get revisited soon-ish I guess).
As for the assumption of mostly-ascii, the validation function has a "striding" fast path for ascii which checks 2 words at a time (so 128 bits per iteration on 64b platforms) until it finds a non-ascii byte: https://doc.rust-lang.org/src/core/str/mod.rs.html#1541
I have a vague feeling there’s an even faster path for probably-ASCII out there, but I can’t immediately recall where and am going to bed. Doubtless someone else will answer before I get up.
The core team will be amenable to replacing this algorithm with something faster presuming it’s still correct.
There is probably a trade-off depending on the content of the string, right? So the API probably needs a general-purpose and a "this should be all ASCII" version?
I don't know if it really makes a lot of sense to have such a specialized version of a method in the standard library. Effectively from_str and from_str_mostlyascii would be functionally identical except for a small performance difference depending on the encoding of the string data which wouldn't matter in 99% of programs.
Having that as a 3rd party crate would make complete sense however.
If there's a canonical, obvious, and low maintenance way to do something that there's a common need for, then there's no reason it shouldn't be in std. If it needs to be hacked at and redesigned every few years, then yeah, leave it out.
That's not really the Rust way, for instance things like timekeeping and random number generation are delegated to external crates. Given how trivial it is to add dependencies thanks to cargo it's not really an issue in practice, even if I do think that they go a bit overboard at times (time really ought to be in std IMO, especially since there's already a rather useless std::time module).
There are actually a few different parts to the code, and they work somewhat independently.
The first is to classify each byte by type, using the highest 4-5 bits. The types are:
ASCII
continuation
initial byte of:
2
3
4-byte sequence
Given these types, the multibyte sequences are checked for proper lengths, ie each length indicator has to be followed by that exact number of continuation bytes until the next non-continuation. We do this by putting the following byte count where each initial is found, zeroing the rest, and carrying counts right with shift-and-subtract, saturated to 0. This just creates a descending sequence after each initial, eg 4 -> 3,2,1, which should reach zero right at the next initial or "first" byte (ascii or multi). The vector of nonzero following bytes xored with the vector of first bytes should then be all 1's, ie there should be no gaps or overlaps.
The next task is to check for overlongs. These are just bitfields in the first or second bytes which should be nonzero, and they can be checked at least two ways. One is masking and checking for any nonzero bits, based on a lookup table of masks. Another is comparison, based on the idea that any 1 bits in the bitfield of interest will make the byte larger than all 0's (the higher-order bits are fixed). Both of these methods assign checks on a per-length basis, using a lookup table from the highest 4 bits. Longer sequences also check the second byte in similar fashion.
In the end we basically have bitmaps of each type, and we check that they don't intersect and that their union covers the range. The code is a bit complicated by carrying over the previous range so that sequences that span registers will still get checked.
I was too lazy to look up what these sequences represented, but I did a quick test with the Euro symbol example from Wikipedia, and it did indeed reject the overlong sequence:
> Should strings be represented as codepoint arrays in memory?
No. I think the implementor of Factor's unicode support originally did that but it turns out to not be useful:
* it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)
* this also has impact on CPU caches, lowering the average amount of data you can fit in your cache and work on
* it requires a complete conversion of incoming ascii and utf8 data (which only get more and more common as time goes on) rather than just a validation
* and because Unicode itself is variable-length (combining codepoints) it's not actually helpful when you're trying to properly manipulate unicode data
The only "advantage" of representing strings as codepoint arrays is that you get O(1) access to codepoints which is a terrible idea you should not encourage.
UTF-32 internal encoding makes some manipulations very slightly easier, but not enough to matter in the long run, and it encourages bad habits. If you don't need the O(1) access thing for backwards compatibility reasons, don't bother.
To limit combinatoric explosion. Without combining codepoints, you need a version of each base symbol for each combination of modifier. And while latin scripts usually limit themselves to one diacritic per base[0], other scripts (hebrew or arabic come to mind) can pile up multiple diacritics.
[0] not that that's always the case, the navajo alphabet can have both ogonek and acute on the same latin base character
That would require an array of 32 bit integers (basically UTF-32). It wastes a lot of space for most characters and doesn't help that much with indexing since user perceived characters can still be composed of many codepoints. Different languages take different approaches.
Python 3.3+, I believe, looks at the string and chooses ASCII/UCS-2/UCS-4 based on the character that takes the highest amount of bytes [1]. Elixir uses UTF-8 for strings (but you can get the codepoints easily), whereas Erlang uses the list of unicode ints style.
If by "arrays of codepoints" you mean "each element of the array is a single codepoint", then you get O(1) indexing regarding codepoints, but may use up to 4x the memory as the variable-length encoding.
People routinely read text files larger than 1 KB into memory. HTTP responses are often larger than 1 KB. Serialized data (JSON, XML, etc.) is often larger than 1 KB.
I’m currently working with a 2 MB string, representing 150k words, with a structure that is a Packed Trie stored as an array for fast access of Unicode Scalars in Swift
Yes, it gives you flexibility in using UTF-8 and UTF-16 (which despite being a shitty format, you WILL have to support UTF-16), and also overlong sequences don't matter.
how exactly do you use a fraction of a cycle? i would assume that it would be fractional on average, but then again I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping
> I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping
1. Standard ALUs on modern processors are 64-bits at a time. So right there, you're 8x faster on a per-byte basis.
2. He's using vectorized operations, so he can work with 128-bits, 256-bits (or potentially even 512-bits on high-end processors like Skylake-X). So 16x, 32x, or 64x at a time per operation.
3. Modern processors are super-scalar with multiple ALUs and multiple execution ports. I forget what the precise theoretical limit is, but Intel AND AMD machines can execute something like 4 or 5 operations per clock, depending on circumstances.
That assumes that all operations have been decoded into micro-ops (uops), they fit inside the uop cache (think of a uop cache as a L0 cache: beyond even the L1 cache), that they perfectly line up to available execution ports, that your data has no dependencies, and a whole host of other conditions. But its theoretically possible.
---------
In practice: your code will be limited by memory (even L1 cache is slower than the CPU), by decoding speed (not everything fits in the uOp cache, and only loops really benefit from the uop cache), dependencies (a = x + 1. a = a+2. The 2nd instruction depends on the 1st one to execute first, so the two instructions can't be done in parallel / superscalar).
The CPU is pretty good at trying to figure out how to optimally reorder and out-of-order execute your code. But that means that predicting the performance of the CPU is incredibly difficult: you have to keep in mind all of the parts of the CPU while thinking of the assembly code.
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
I happened to come across this decoder again recently in Niels Lohmann's JSON library for C++:
https://github.com/nlohmann/json
I see that this is mentioned in a previous post:
https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-...
One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.