Can Wachy be used to trace kernel code as well? And would you need to do anything specific to get that to work? (How would Wachy find the kernel sources?)
I was thinking this would be pretty cool to use for debugging device drivers on embedded devices especially w.r.t latency.
I'm afraid that's not supported today. But it looks like kprobes do support offsets within a function, so it should be possible to get it to work. That does sound pretty cool, please open an issue for this if you're interested! It might be as simple as (effectively) s/uprobe/kprobe/g, in which case I can try to get that working.
It does! But you still have to actually "use" it, meaning that it appears as a field in the definition, and you have pass a PhantomData value whenever you create new instances of your data type.
In Haskell, you can omit these fields entirely, and achieve the same thing just by annotating the function.
For example, in Haskell, we can have
data Const a b = Const a
whereas in Rust, it would be:
struct Const<A, B> {
konst: A,
// does not exist at run-time
discard: PhantomData<B>
}
Type parameter variance is inferred from usage (e.g. covariant for normal fields, contravariant for function arguments) and without a usage there's no way to infer it.
AMD’s primary advertised improvement here is the use of a TAGE predictor, although it is only used for non-L1 fetches. This might not sound too impressive: AMD is still using a hashed perceptron prefetch engine for L1 fetches, which is going to be as many fetches as possible, but the TAGE L2 branch predictor uses additional tagging to enable longer branch histories for better prediction pathways. This becomes more important for the L2 prefetches and beyond, with the hashed perceptron preferred for short prefetches in the L1 based on power.
I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)
A little of both. My understanding of the above paragraph is that the L1 predictor is trying to predict which code-containing cache lines need to stay loaded in L1, and which can be released to L2, by determining which branches from L1 cache-lines to L1 cache-lines are likely to be taken in the near future. Since L1 cache lines are so small, the types of jumps that can even be analyzed successfully have very short jump distances—i.e. either jumps within the same code cache-line, or to its immediate neighbours. The L1 predictor doesn’t bother to guess the behaviour of jumps that would move the code-pointer more than one full cache-line in distance.
Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.
I was also confused by this, but my reading is this is entirely about branch prediction nothing about caching. In that context L1 and L2 simply refer to "first" and "second" level branch prediction strategies, and are not related to the L1 and L2 cache (in the same way that L1 and L2 BTB and L1 and L2 TLB are not related to L1 and L2 cache).
The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.
This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).
Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.
The existing PDF book has a couple of rendering issues, but if you install BlueBottle, it has Oberon System 3 as demo application, with the original book in Oberon rich text format (not to mix with RTF).
Yes, but for the previous few generations of transistor manufacturing, transistors' power consumption has not scaled down as well as their size. This is known as the failure of Dennard Scaling[0].
But I was under the impression that Linux isn't suitable as a real-time operating system (RTOS)[1] (which Microsoft may very well not require for their IoT systems). So the Linux Foundation providing an alternative free kernel meeting RTOS requirements makes perfect sense.
I'm not as clear on the history, but was Linux ever pitched as capable of being real-time OS? I don't think so. The hard requirements for real-time generally lead to very different systems than general-purpose operating systems.
They mentioned in the article that the full cache of each die is available. Additionally, EPYC uses the same dies used in Ryzen. I'd look at earlier articles for Ryzen to determine latencies within a single die.
So for whatever cores are enabled on each die, you get the L1/L2 caches for each core as per the Ryzen launch. Additionally, you get all of the shared L3 cache, irrespective of the number of cores disabled per core complex. This pattern follows across all four dies in each socket.
"Each Epyc has 64KB and 32KB of L1 instruction and data cache, respectively, versus 32KB for both in the Broadwell family, and 512KB of L2 cache versus 256KB. AMD says Epyc matches the Broadwells in L2 and L2 TLB latencies, and has roughly half the L3 latency of Intel's counterparts."
I was thinking this would be pretty cool to use for debugging device drivers on embedded devices especially w.r.t latency.