Escaping strings faster with AVX-512

dwrodri · on Sept 14, 2022

It is really unfortunate that AVX-512 has been such a mess when it came to support from Intel on their newer hardware platforms. I think Dr. Lemire has gone quite a ways in demonstrating that there's a lot of potential for software to see performance gains with this type of platform, but I think we'll need further improvements in developer experience before it's more commonplace.

The other way to look at it is that quite frankly most business services just don't hit a performance barrier where you're sitting and waiting on a string operation. You're usually waiting on some sort of I/O.

jeroenhd · on Sept 15, 2022

The Ryzen 7000 cores will support AVX512 so soon enough consumers will finally be able to properly benefit from its broad availability.

I/O wait is difficult to model. On an NVME drive over PCIe 5 I doubt you'll find yourself waiting on I/O all that often, especially if you're reading a lot of data. Decompression tools, for example, are often written in such a way to keep the CPU pipeline fed. They're still doing lots of output but while some other core(s) deal(s) with that, the performance optimized core can still crunch numbers.

You're probably not storing hundreds of megabytes of JSON because there are much better formats for such data sets, but text algorithms like these make an impact regardless.

Loading CSV files, for example, can be a huge CPU bottleneck. I've had to deal with CSV datasets of over a gigabyte and all libraries were kind of terrible at it. The data source also had a JSON version that contained fewer fields for some reason, I guess to save space? Either way, an optimized escape code path would've probably saved seconds on each run of the data analysis, and more if the string split was optimized better as well. The data was loaded into memory already beforehand so the only I/O would've been exchanging data between the RAM and the CPU cache.

nullifidian · on Sept 15, 2022

>The Ryzen 7000 cores will support AVX512

Anandtech:

"Critically, however, AMD is diverging from Intel in one important aspect: whereas Intel built a true, 512-bit wide SIMD machine for executing AVX-512 instructions, AMD did not. Instead, AMD will be executing these instructions over two cycles. This means AMD’s implementation still benefits from all of the additional instructions, register file space, and other technical improvements that came as part of AVX-512, but they won’t gain the innate doubling in SIMD throughput."

adrian_b · on Sept 15, 2022

That comment from Anandtech contains guesses that are unlikely to be true, e.g. "executing these instructions over two cycles", and it shows ignorance about how AVX-512 is implemented in the Intel CPUs.

What is really known from the AMD disclosure is that Zen 4 has the same execution resources as Zen 3, only the load/store units are improved, perhaps their bandwidth has been increased to match that of all recent Intel CPUs.

Zen 3 has 4 AVX pipelines which can execute four 256-bit instructions per cycle, but some more complex instructions, e.g. multiplications or FMAs can be executed by at most 2 pipelines.

There are 2 possible ways to execute a 512-bit instruction in such pipelines, either the instruction can occupy a pipeline for 2 cycles, or it can occupy 2 pipelines for 1 cycle.

The throughput is the same, but the latency of the operation is different. The simpler and better way is to occupy 2 pipelines for 1 cycle. This is also how most AVX-512 instructions are implemented in all Intel CPUs, with the exception of FMA/FMUL, for which a few models of Intel CPUs have a second 512-bit pipeline and with the exception of some other instructions for which there is a 256-bit extension of one of the three 256-bit pipelines that exist in Intel CPUs, allowing the Intel CPU to do two 512-bit instructions per cycle, even if it can do only three 256-bit instructions per cycle.

The Intel CPUs can do two 512-bit instructions per cycle, except for a few instructions like FMA/FMUL that can be done only one per cycle in the cheaper CPUs, but two per cycle in most Xeon Gold, all Xeon Platinum and the Xeon W models with AVX-512.

The AMD Zen 4 is certain to have the same 512-bit throughput per clock cycle (two 512-bit instructions per cycle, of which only 1 can be FMA/FMUL) as all Intel CPUs with the exception of the models with two 512-bit FMA units, which will have double throughput only for FMA/FMUL.

nullifidian · on Sept 15, 2022

Thank you for the explanation.

secondcoming · on Sept 15, 2022

But isn't Intel AVX-512 such a power hog that it will clock down the CPU? So the throughput loss might not be that bad if AMD can maintain the clock speed.

nextaccountic · on Sept 15, 2022

> I've had to deal with CSV datasets of over a gigabyte and all libraries were kind of terrible at it

Have you tried this one? https://lib.rs/crates/csvroll

It calls avx2 intrinsics directly for simd (unfortunately - it's an old library in Rust time, Rust nowadays has great support for portable simd)

anonymoushn · on Sept 15, 2022

Portable SIMD is not that useful for string parsing as far as I know - tbl is different from pshufb in ways that usually make tbl cost an additional operation for text parsing, moving things across lanes in AVX2 or AVX-512 can be fiddly (unless you just want vcompress or a masked store, then AVX-512 has you covered)

Edit: After looking over the docs, it's not clear that Rust's portable SIMD exposes tbl or pshufb *at all*, which would basically prevent it from being used for string parsing.

ArrayBoundCheck · on Sept 14, 2022

Parsing JSON takes more time than IO

wahern · on Sept 14, 2022

All the object allocation, pointer chasing, and general indirection is what makes JSON parsing slow, particularly building up tree structures. Bump allocators and the like don't help much, at least for the parsing--most malloc implementations already do fast, O(1) allocation for small objects, and have for decades. (Arenas are useful for making deallocation O(1), though.)

If you care about raw JSON throughput, use Ragel or something similar to build a state machine that directly parses JSON into a flat, native data structure. Now you have zero allocations without even needing to shim malloc/free. AVX-512 would still be at least as useful, but it's a much more difficult problem to leverage SIMD in a parser generator than in a simple string escape routine or behind a more abstract interface like a regex library.

Quite a few language environments these days provide in-language JSON deserializers, but they're still significantly slower than they could be even when they deserialize to flat data structures. The macro languages and internal compiler intrinsics used to accomplish this are the worst possible environments for development. Lisp-like languages aren't really an exception as they tend to trade easier in-language transforms for a steeper climb when it comes to generating optimized native code for the transform.

ArrayBoundCheck · on Sept 14, 2022

I didn't care about speed. I was merely saying the waiting on I/O is false and didn't want to go as far as saying it hasn't been true for 15+ years

nerdponx · on Sept 15, 2022

Is there any serialization format that is more "friendly" to memory allocators while still being human-readable and -writable?

hexane360 · on Sept 15, 2022

It's not about the format per se, but more about the fact that you're parsing an unknown/fully general structure in that format. APIs like Rust's serde can help to avoid excessive allocation when you have JSON in a known schema.

oever · on Sept 15, 2022

If a file is so large that the processor spends a lot of time parsing, then the file is too large to be conveniently edited by a person.

For large files it is best to use a binary format that can be read quickly without parsing or allocation. https://rkyv.org/ is an example.

Being 'friendly' is not why JSON is popular. JSON is popular because the decoder is included the web browser.

nerdponx · on Sept 15, 2022

Thanks, and good point about dataset size.

I appreciate the thorough "shootout" benchmarks provided by the authors as well!

adgjlsfhk1 · on Sept 15, 2022

text isn't readable without software. why should we expect binary data formats to be?

anonymoushn · on Sept 15, 2022

simdjson directly parses JSON into flat simdjson tape, which you can just use directly if you care to

wruza · on Sept 14, 2022

This claims 3gb/s, saturating nvme full queued read. It’s not clear if they construct data or just parse it, but you have to construct data with any format. Other libs also do a great job. Also, io latency shoud be accounted for, because average jsons usually fit into one sdd block, afaiu.

https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-t...

(Edit: just realized this is the same site as subj, heh)

ArrayBoundCheck · on Sept 14, 2022

Yep. I think the fastest nvme's were a tad above 3gb and network IO is certainly faster. It always irks me when something stop being true 15-20years ago is cited today

Negitivefrags · on Sept 15, 2022

"Are you actually IO bound? I seriously doubt it."

https://twitter.com/rygorous/status/1423778960739950595

captainmuon · on Sept 15, 2022

IO bound doesn't neccessarily mean waiting for GBs to transfer from the disk. It can mean reading meeelions of tiny files and seeking back and forth randomly. Doesn't matter if the files are in the cache or not. When someone says IO bound, it often means CPU bound but in the kernel or in some low level code.

Most apps and games could open instantly, unless you are doing something like heavy map generation or hitting the network. But it would require structuring the app specifically for it and there is often no incentive to do so.

the8472 · on Sept 15, 2022

> It can mean reading meeelions of tiny files and seeking back and forth randomly.

Tiny files are fine if you fetch them from NVMe and keep the queues fed, i.e. you need to issue multiple concurrent reads, not do it sequentially. Not being on windows helps too.

kadoban · on Sept 15, 2022

That seems specific to games.

Business logic is quite a bit more likely to just be waiting on IO, usually fetching from the DB or sending/receiving over the network. This is because there's a lot of business logic where those things are _everything it does_.

hresvelgr · on Sept 15, 2022

I think the point of the twitter rant was addressing a sentiment of "Oh, it's IO bound, guess I can't do anything about it." When there is usually something that can be done about it.

janwas · on Sept 15, 2022

Business logic with tons of branches is indeed not amenable to SIMD, nor is that where the compute is.

In a world with 400 Gb/s NICs, being IO bound takes some doing. How much SW manages that without SIMD, even on multiple cores?

Someone · on Sept 15, 2022

> In a world with 400 Gb/s NICs, being IO bound takes some doing.

depends on what you call I/O. From the perspective of the caller, that’s the entire stack starting at a http library call or sql driver call. If those can’t use the hardware to their fullest (or can’t, the way you’re using them) your process becomes I/O bound from its perspective.

janwas · on Sept 15, 2022

Fair point :) I'm assuming the stack is doing what it can to avoid being the bottleneck. On a system level, I'd still call what you're describing CPU-bound, as in: unable to use all potential IO capacity.

sakras · on Sept 15, 2022

Until adoption of io_uring picks up (looking at you RHEL), I’m not sure it’s even really possible to hit the bandwidth of modern nvme drives. It takes a lot of parallelism to do, and eventually context switch overhead kills you.

janwas · on Sept 15, 2022

FWIW Windows has had something basically equivalent (IO completion ports) for quite some time. Indeed helpful, I remember that early SSDs already required high queue depths.

kadoban · on Sept 15, 2022

It doesn't matter if you're 400 Tb/s, you still can't help the speed of light. You're still going to be waiting for a round-trip or a few.

janwas · on Sept 15, 2022

OK, if latency is a concern, is it possible to parallelize requests or increase batch sizes to maximize utilization and throughput?

gpderetta · on Sept 15, 2022

> In a world with 400 Gb/s NICs, being IO bound takes some doing.

Amen. CPU have stopped improving significantly a decade ago or more. IO never stopped.

pca006132 · on Sept 15, 2022

Perhaps something can be memory bound instead of CPU or IO bound?

anonymoushn · on Sept 15, 2022

If you do your string operations faster, you can get more jobs from the "got I/O" state to the "waiting on I/O" state per unit time!

secondcoming · on Sept 15, 2022

I'm envious of someone whose job allows them to work on micro-optimisation stuff like this. I often find myself sucked into a Compiler Explorer hole playing around with writing C++ code this way and that way to see what changes. I must have spent hours looking at various ways of lowercasing a string.

But it's not really justifiable from my employer's point of view, there are more important things for me to do.

Also, handcrafting SIMD code isn't really viable either since I predict there'll be a point where someone on my team wants to use a Mac, and so those routines will have to be written twice.

anonymoushn · on Sept 15, 2022

ascii tolower is like 5 instructions I think. You're computing this expression: v + ((v >= 'A') & (v <= 'Z') & 32). Unlike many string operations, portable SIMD libraries should be able to do exactly the right thing here, and writing one routine for AVX2 and another for NEON would be pretty easy if you don't want to adopt a portable SIMD library.

Edit: Oh, you can do it in 4 if you use an addition or subtraction by a constant to move the range you're testing for to the edge of the range of the type (in this case, [-128, 127] for i8 since there is no unsigned compare operator). The expression for this approach is v + ((v + 63 < -102) & 32).

secondcoming · on Sept 15, 2022

Yes, I'm aware of the math, but there are other aspects like 'does this form of the code help auto-vectorisation?', 'does it work across both gcc and clang?', 'what about optimisation levels', 'does my hand-rolled version beat the compiler's optimisation of the naive code?'. Etc. It's a lot of fun.

anonymoushn · on Sept 15, 2022

Oh. I'm skipping all of the fun by using SIMD intrinsics, I guess.

You could go to https://highload.fun/ to waste many evenings on this sort of thing.

janwas · on Sept 15, 2022

Good news, you can use portable intrinsics (github.com/google/highway) and write your code only once. (Disclosure: I am the main author.)

ItsRainingBits · on Sept 14, 2022

There's a copy of the loop used on the escape function inside the avx512_escape function [0]. Is it needed or just a copy and paste mistake? (I know nothing about vector instructions)

0: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

mmozeiko · on Sept 14, 2022

It is there to process last <32 elements. The vectorized loop processes up to 32 elements per iteration. The iteration does not happen if there are less than 32 elements left, because it wants to load 32 bytes as input. This is very typical in vectorized loops - process N elements per iteration and second loop that does tail of <N elements.

sakras · on Sept 15, 2022

Though I am curious about why he didn’t switch to using predicated instructions for the tail loop. I’ve switched to that pattern when writing AVX512.

moonchild · on Sept 15, 2022

'Tail handling' in general is an annoying aspect of simd. Masks are great, but no panacea--in particular, if you unroll, then you cannot take care of the tail with a single masked instruction. There are various solutions to this. I favour overlapping accesses, where that's feasible (following a great deal of evangelism from mateusz guzik); a colleague uses a variant of duff's device; you can also just generate multiple masks.

I would expect the linked code is just intended as a quick poc, so it does not bother to be optimal.

janwas · on Sept 15, 2022

:) For anyone interested, here's a brief discussion of tail handling: https://github.com/google/highway#strip-mining-loops

(Overlapping is indeed cool where it works - idempotent operations.)

moonchild · on Sept 16, 2022

> idempotent

Roughly speaking, while you do need idempotence for reductions, you do not need it for maps. (Of course, plenty of things don't look quite like reductions or maps--or they look like both at once--and each is unique.)

renox · on Sept 15, 2022

> 'Tail handling' in general is an annoying aspect of simd.

Only for "classical" SIMD, ARM SVE2 show a way to solve this issue cleanly. Not sure where you can use SVE2 though..

kristianp · on Sept 17, 2022

Not SVE2, but lemire has a post about using SVE on Amazon Graviton 2 processors: https://lemire.me/blog/2022/07/14/filtering-numbers-faster-w...

kristianp · on Sept 23, 2022

Correction, Graviton 3.

ItsRainingBits · on Sept 14, 2022

Thank you!

pabs3 · on Sept 15, 2022

If you do this, just make sure to offer runtime fallbacks via FMV/etc for those of us with ancient hardware.

https://wiki.debian.org/InstructionSelection

zmmmmm · on Sept 15, 2022

i can't understand Step 4:

    I copy these bytes with the quotes and backslash characters.

can anybody explain what this means?

e4m2 · on Sept 15, 2022

Looks like a mistake in the article, it should probably say:

    I compare these bytes with the quotes and backslash characters.

Which would correspond to the two _mm512_cmpeq_epi8_mask calls in the source code.

zmmmmm · on Sept 15, 2022

thanks that makes 100x more sense!

pfoof · on Sept 15, 2022

So is CISC becoming great again. Should Apple feel threatened?

kjeetgill · on Sept 15, 2022

I'd argue the cisc vs risc dichotomy hasn't really been true in ages. Among other things, does a cisc-y frontend over risc-ier microcode count as a cisc or a risc cpu? I also don't think Apple's chips are particularly risc-y in the first place. I believe they have plenty of special purpose silicon.

renox · on Sept 15, 2022

> does a cisc-y frontend over risc-ier microcode count as a cisc or a risc cpu?

I can't beleive that some are still asking the question, the answer is in the name: CISC=Complex Instruction Set; RISC=Reduced Instruction Set so if it has a CISC-y frontend the whole thing is a CISC CPU.

> I also don't think Apple's chips are particularly risc-y in the first place. I believe they have plenty of special purpose silicon.

I disagree: their CPU use the ARM-64 ISA is quite RISC, it's just that they don't provide only a CPU but a whole system on the chip.

kjeetgill · on Sept 15, 2022

I think you're too hung up on the name. My memory's a little fuzzy, and I'm no expert here, but in that day the question was one of CPU architecture and the path forward. Risc put forth the idea that reducing the instruction set would open up more doors in your silicon/design budget to produce faster CPUs compared to ever more elaborate instructions which hope shrink the total instruction counts.

It was always about architecture not (at least mostly) about interface. I'm not sure how much this dichotomy really plays out in the present, I think the architectural lessons have been learned and applied everywhere.

That's how I was taught it atleast.

ArrayBoundCheck · on Sept 14, 2022

The internet is so bad that when I see titles like these I know the article will not be true at all. The great thing is I can see lemire next to the title and I absolutely know not only did he walk the walk but he can write about it in an understandable way. I upvoted before reading and I haven't yet been disappointed. -Edit- Done, it's nice and short and I want to reread the code example. I'm not familiar enough with avx512

hresvelgr · on Sept 15, 2022

That's a weird way of saying you appreciate Daniel's writing and rigor in programming.

ArrayBoundCheck · on Sept 15, 2022

Yeah I know, I'm saying only Daniel (and Carmack) can pull of titles like that without people thinking they're not telling the whole truth. I still remember john's tweet "I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?". A nerd said "Either he's crazy or this is an unusual situation" it was neither Carmack followed it up with https://superuser.com/a/419167