More

choudanu4 · on March 28, 2022

Can Wachy be used to trace kernel code as well? And would you need to do anything specific to get that to work? (How would Wachy find the kernel sources?)

I was thinking this would be pretty cool to use for debugging device drivers on embedded devices especially w.r.t latency.

vivek-jain · on March 28, 2022

I'm afraid that's not supported today. But it looks like kprobes do support offsets within a function, so it should be possible to get it to work. That does sound pretty cool, please open an issue for this if you're interested! It might be as simple as (effectively) s/uprobe/kprobe/g, in which case I can try to get that working.

choudanu4 · on Nov 24, 2019

Since you seem to know, what company would one go to for fabricating a 500nm chip these days at that price?

JetSpiegel · on Dec 2, 2019

500nm feature size is positively ancient. Any of the myriad fabless companies won't even care about those dinosaur technologies.

I would try the local university with an electronics program.

choudanu4 · on Nov 1, 2019

Does the PhantomData type in Rust do what you want? Or are you talking about something slightly different?

https://doc.rust-lang.org/nomicon/phantom-data.html

https://doc.rust-lang.org/std/marker/struct.PhantomData.html

So

  struct Sender<S> {
    /// Actual implementation of network I/O.
    inner: SenderImpl;  
    /// 0-sized field, doesn't exist at runtime.
    state: S;
  }

I believe could become

  struct Sender<S> {
    /// Actual implementation of network I/O.
    inner: SenderImpl;  
    /// 0-sized field, doesn't exist at runtime.
    _marker: PhantomData<S>;
  }

I've never used this PhantomData personally, so this might be wrong. Cheers!

dmkolobov · on Nov 1, 2019

It does! But you still have to actually "use" it, meaning that it appears as a field in the definition, and you have pass a PhantomData value whenever you create new instances of your data type.

In Haskell, you can omit these fields entirely, and achieve the same thing just by annotating the function.

For example, in Haskell, we can have

  data Const a b = Const a

whereas in Rust, it would be:

  struct Const<A, B> {
      konst: A, 
      // does not exist at run-time
      discard: PhantomData<B>
  }

Rusky · on Nov 1, 2019

The reason for this requirement is lifetime subtyping: https://doc.rust-lang.org/nomicon/subtyping.html

Type parameter variance is inferred from usage (e.g. covariant for normal fields, contravariant for function arguments) and without a usage there's no way to infer it.

choudanu4 · on June 11, 2019

AMD’s primary advertised improvement here is the use of a TAGE predictor, although it is only used for non-L1 fetches. This might not sound too impressive: AMD is still using a hashed perceptron prefetch engine for L1 fetches, which is going to be as many fetches as possible, but the TAGE L2 branch predictor uses additional tagging to enable longer branch histories for better prediction pathways. This becomes more important for the L2 prefetches and beyond, with the hashed perceptron preferred for short prefetches in the L1 based on power.

I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)

derefr · on June 11, 2019

A little of both. My understanding of the above paragraph is that the L1 predictor is trying to predict which code-containing cache lines need to stay loaded in L1, and which can be released to L2, by determining which branches from L1 cache-lines to L1 cache-lines are likely to be taken in the near future. Since L1 cache lines are so small, the types of jumps that can even be analyzed successfully have very short jump distances—i.e. either jumps within the same code cache-line, or to its immediate neighbours. The L1 predictor doesn’t bother to guess the behaviour of jumps that would move the code-pointer more than one full cache-line in distance.

Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.

BeeOnRope · on June 11, 2019

I was also confused by this, but my reading is this is entirely about branch prediction nothing about caching. In that context L1 and L2 simply refer to "first" and "second" level branch prediction strategies, and are not related to the L1 and L2 cache (in the same way that L1 and L2 BTB and L1 and L2 TLB are not related to L1 and L2 cache).

The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.

This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).

Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.

choudanu4 · on Oct 9, 2018

Does anyone know where I could get more information on Tools?

pjmlp · on Oct 9, 2018

Sure,

You can read the Project Oberon 1992's book, sections 3.3 and 3.4 as starting point.

http://www.ethoberon.ethz.ch/WirthPubl/ProjectOberon.pdf

Then you can follow up with the 2003's edition.

https://www.inf.ethz.ch/personal/wirth/ProjectOberon/index.h...

With a stop in the last pure Oberon iteration, Oberon System 3 Gadgets.

http://www.ethoberon.ethz.ch/ethoberon/tutorial/

Oberon Companion describes gadgets and tools

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.472...

The existing PDF book has a couple of rendering issues, but if you install BlueBottle, it has Oberon System 3 as demo application, with the original book in Oberon rich text format (not to mix with RTF).

https://www.youtube.com/watch?v=t6NMJh0noDk&index=35&list=WL...

Which by the way, also describes the Tools on its manual, A2 User Guide and Application Description

http://www.ocp.inf.ethz.ch/wiki/Documentation/Front

https://github.com/btreut/a2

The surviving ISO images

https://sourceforge.net/projects/a2oberon/files/

choudanu4 · on July 30, 2018

Yes, but for the previous few generations of transistor manufacturing, transistors' power consumption has not scaled down as well as their size. This is known as the failure of Dennard Scaling[0].

[0] https://en.wikipedia.org/wiki/Dennard_scaling

choudanu4 · on April 17, 2018

But I was under the impression that Linux isn't suitable as a real-time operating system (RTOS)[1] (which Microsoft may very well not require for their IoT systems). So the Linux Foundation providing an alternative free kernel meeting RTOS requirements makes perfect sense.

I'm not as clear on the history, but was Linux ever pitched as capable of being real-time OS? I don't think so. The hard requirements for real-time generally lead to very different systems than general-purpose operating systems.

[1] https://en.wikipedia.org/wiki/Real-time_operating_system

choudanu4 · on June 23, 2017

I was under the impression that the previous discussion got flagged because it was behind a paywall.

Don't despair, I think we do care. :)

choudanu4 · on June 20, 2017

Neural networks aside, machine learning is often used for implementing prediction mechanisms in CPUs. Good examples of prediction mechanisms are:

- Branch Predictor

- Cache Replacement Policy

- Memory Prefetcher

- Schedulers

choudanu4 · on June 20, 2017

They mentioned in the article that the full cache of each die is available. Additionally, EPYC uses the same dies used in Ryzen. I'd look at earlier articles for Ryzen to determine latencies within a single die.

So for whatever cores are enabled on each die, you get the L1/L2 caches for each core as per the Ryzen launch. Additionally, you get all of the shared L3 cache, irrespective of the number of cores disabled per core complex. This pattern follows across all four dies in each socket.

jnordwick · on June 20, 2017

I read that, but how much, type, architecture, latencies, etc? This is a huge factor in the performance of the chip.

tyingq · on June 20, 2017

"Each Epyc has 64KB and 32KB of L1 instruction and data cache, respectively, versus 32KB for both in the Broadwell family, and 512KB of L2 cache versus 256KB. AMD says Epyc matches the Broadwells in L2 and L2 TLB latencies, and has roughly half the L3 latency of Intel's counterparts."

https://www.theregister.co.uk/2017/06/20/amd_epyc_launch/

Various L3 sizes in the article.