CAT is indeed a good thing to look at. But there are some important caveats 1) unless you have a very small number of cores, it's not possible to reserve a cache slice for all programs running (some slices are shared for things like DDIO), 2) it's still not possible to lock some specific data in the cache because any collision will replace the data 3) the slices are kinda big, so it's hard to be properly fine grained.
Basically, CAT just prevents other processes from stealing all the cache. It does that by reserving ways (as in the cache associativity meaning)
1) fully agreed but most HFT apps with the exception of really simple ones like market data feed handlers which can easily fit their working set into L2 anyway will be the only thing running on a host
2) mutual cache eviction by hash collisions is solvable with a number of tricks (although those methods are not easy and often wasteful). The "DDIO slice" issue used to be a problem back when Intel used ring topology for LLC. These days they are built as a mesh thus minimizing this effect.
3) CAT doesn't recognize threads or processes. COS (class of service) uses CPU cores for way-of-cache assignments
Recent micro-architectures like SKX or CLX have 11 ways of L3 and what often happens is 1-2 ways get assigned to cpu0 for non latency-critical workloads while the rest are assigned to latency-sensitive, isolated cores usually running a single user space thread each.
2) Agreed about the solvability and difficulty of avoiding cache collisions.
DDIO must write its data somewhere in the L3 cache. It ends up in the shareable slice. So either you're okay with sharing your cache or cannot use these slices if you want exclusive access for your processes. That was my point.
3) CAT does not recognize processes but resctl does. Feels we're kinda nitpicking here...
Last of your point: Agree, that gives you 9ish usable slices which is not very much depending on the number of cores. That was my point I was trying to make.
3) resctl just uses COS under the hood. The same limitation applies
> Yeah, that gives you 9ish usable slices which is not very much. Again that was my point
This is 9 ways that you can use for your latency-sensitive workloads exclusively. This is MUCH better than letting all that LLC get trashed by non-critical processes/threads. Typically after applying such partitioning we've observed a 15-20% speed up in our apps.
In my area shaving off a few micros that way is a huge deal and definitely worth spending a couple of minutes implementing.