Can you elaborate on the text search example? I genuinely have problems trying t...

pbsd · on Nov 19, 2014

Think regular expression JIT compilation, for instance, in the spirit of http://swtch.com/~rsc/regexp/regexp-x86.c.txt

vardump · on Nov 19, 2014

Regexp JIT was already mentioned. It's notable that regexp JIT technique is being used by practically all high performance regexp libraries already. Such as these:

* Perl and PCRE: http://sljit.sourceforge.net/pcre.html

* Java: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pa...

* .NET CLR: http://msdn.microsoft.com/en-us/library/gg578045(v=vs.110).a...

And a lot of others.

For text search in particular, you could also take advantage of SSE 4.2 string instructions, but still run on older CPUs. http://en.wikipedia.org/wiki/SSE4#SSE4.2

Similar story with AVX2, you have 256-bits wide registers. Soon (Intel Skylake in 2015) there will be AVX512, 512-bits wide registers with byte level processing instructions. Being able to process 64 bytes in one instruction, with ILP [1] potential of two instructions per clock cycle or more, can provide an order of magnitude performance advantage.

You can also optimize away unnecessary code for that particular search. No need to have those ignore case, etc. flags. Or for example you could to not include Unicode related logic, if you know ahead of time that normalization etc. won't be necessary for this particular text search case. This can have particularly high savings if you're branch predictor buffer limited already, by reducing the number of branches [2].

If the memory access patterns are not sequential (= predictable by the CPU), you could insert prefetch instructions at CPU model appropriate places to ensure data is going to be in L1 cache in time before use.

If you know the data is going to be searched only once, you could give a hint to the CPU that you're streaming the data. CPU then can optimize memory access patterns and minimize L1/L2 cache evictions, because it knows this data should not be stored in cache. In other words, non-temporal (= streaming) memory loads and stores. Like http://www.felixcloutier.com/x86/MOVNTDQA.html.

You could do profile guided optimization at runtime. Or just try random variations and pick the fastest for that particular combination of parameters and hardware without recompiling anything. Different CPU models have a lot of variation [3].

And a lot of other things. If the data sets are large, ability to adapt to a particular problem at runtime can have a huge payoff.

[1]: Instruction level parallelism.

[2]: A branch can mean if-statements, ?-ternary operator, boolean logic ("||", "&&" etc.), switch statements, and so on. Every branch in currently executing loop can potentially need an entry in the CPU branch predictor. If branch predictor buffer entries run out, this might cause CPU to mispredict that branch every time. The cost of a mispredicted branch is very high. On Intel Ivy Bridge processor, a single branch misprediction costs 14 clock cycles or the time to theoretically execute up to 4*14=56 instructions, practically about 15-30! Slightly related links, LLVM CPU scheduler definitions:

http://llvm.org/klaus/llvm/blob/release_33/lib/Target/X86/X8...

[3]: About CPU model variation, see for example this: http://www.agner.org/optimize/blog/read.php?i=285