Are there simulators that chip developers use to get an idea of what performance will be for certain workloads prior to creating an engineering sample? Or how does this work?
Absolutely! Chip designers have a several tools to do this.
First, they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can before laying out a single transitory. These models can run code just like a real hardware device, albeit slowly.
Once the chip is designed, verilog simulators are programs used to generate the exact logical output of a circuit, which can be used to measure performance on a workload. However, this method is even slower than the first!
For larger workloads and higher speed, they use extraordinarily expensive FPGA-based platforms called Emulators. This allows circuits to be run at speeds in the MHz range before ever being sent to a fab. Booting an OS, running a complex multicore workload with shared memory, they can measure almost any workload. But this method is not available until late in the design phase and the boxes themselves are prohibitively expensive from being deployed very widely.
The software models are the most useful for estimating performance, as long as they are written early and well :)
> they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can
How does this work? Do they model at the transistor level, or at the level of logical functions, or..? I'm particularly curious how this can estimate performance if it's anything higher-level than a direct transistor-for-transistor, layout-aware, emulation.
I'd be really interested in learning more if there's anything you could share, please. I can find info about chip design software and languages like Verilog (as you mention) but not this sort of modeling.
The idea is to write a C++ model that that produces cycle accurate outputs of the branch predictor, core pipeline, queues, memory latency, cache hierarchy, prefetch behaviour, etc. Transistor level accuracy isn't needed as long as the resulting cycle timings are identical or near identical. The improvement in workload runtime compared to a Verilog simulation is precisely because they aren't trying to model every transistor, but just the important parameters which effect performance.
Let's take a simple example: Instead of modeling a 64-bit adder in all its gory transistor level detail, you can just have the model return the correct data after 1 "cycle" or whatever your ALU latency is. As long as that cycle latency is the same as the real hardware, you'll get an accurate performance number.
What's particularly useful about these models is they enable much easier and faster state space exploration to see how a circuit would perform, well before going ahead with the Verilog implementation, which relatively speaking can take circuit designers ages. "How much faster would my CPU be if it had a 20% larger register file" can be answered in a day or two before getting a circuit designer to go try and implement such a thing.
If you want an open source example, take a look at the gem5 project (https://www.gem5.org). It's not quite as sophisticated as the proprietary models used in industry, but it's a used widely in academia and open source hardware design and is a great place to start.
A good example is one of the classic Computer Architecture class assignments which is to simulate a cache. So the way that looks is you have a stream of memory accesses and you "simulate it" by parsing that file and simulating the actions that would be taken. ie: "ok this block would be put in cache. This next access was a hit, this next access was a miss, etc". So then you just count those actions and estimate the performance by tallying all that up.
That's the behavioral model part and IRL they do basically the same thing to decide what behavior they actually want the hardware to do.
The next step is the circuit-level model done in verilog which actually simulates the logic-gates and does involve viewing a signal at every clock cycle.
There are a few specialized languages for hardware description, Verilog is common, as is VHDL. A good point for starting is the Wikipedia page about hardware description languages [1]. This is a slow moving area, so even old resources should be useful. I only encountered HDLs during my university years and that's longer ago than I care to remember. I recall we did something with MIPS (back then we did everything hardware-near on MIPS) and used a book by O'Reilly, something something Systems Design or so. Couldn't find it, probably wrong name, but I found this [2], maybe useful?
Basically you model all the elements of a chip (queues, memory, alus, etc), and how much time they take. You use a virtual clock so your simulation model can run at a different pace.
>they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can before laying out a single transitory.
Usually System Verilog instead of C++ but it has C++ interfaces
Ideally they know exactly how it will perform: Every part of the chip, including the caches, memory controller, and DRAM is implemented in a cycle accurate simulator. There are often multiple versions of that simulator, one written in C/C++ that matches the overall structure of the eventual hardware, and then simulations of the actual RTL (hardware source code, networks of gates).
The C-model and RTL model outputs are often also compared with each other as a correctness validation step, as they should ideally never diverge. (ie, implement twice, by two teams, and cross-check the results).
Those simulations are terrifically slow for larger chips, so there is a surprisingly small number of workloads that can be run through them in reasonable time. So there tend to be even more simulator implementations that sacrifice perfect performance emulation for 'good enough' performance correlation (when surprises can happen). Being able to come up with a non-exact simulator that perf-correlates with real hardware is an art in itself.
Are the C simulators hand crafted each time by the chip designer? It seems like the kind of thing that needs custom built but I’m wondering if there is a common toolset used, or platform?
The performance team usually thinks in terms of cycles. At runtime the frequency varies depending on various factors as you said, but this is mostly ignored.
Those aren't for quantifying performance, they are for developing the firmware/OS/software stack to run on the platform.
In other words, they are a slightly more accurate version of something like QEMU, although I guess I should point out they can generate traces that can be fed into tools to model HW perf, ex gem5.
systemc is a c++ derivative often used to model chip designs. Before chip tapeout most of the chip design has been fully simulated and emulated many many times, be it functional wise, or cycle by cycle(e.g. cycle accurate simulation).
Sounds like the V2 is about as wide in issue width at Apple’s M1/2 (8 MOPs) but not nearly has deep (~300 versus over 600). Can ARM actually keep such a wide architecture busy?
I’d love to see a performance per dollar article on these machines. Is there anything out there? My guess is they’ll be efficient in terms of cost to buy and cost to run compared to the competition and I wonder if it’s worth trying out?
I've been running a few sites on Hetzner Cloud's ARM machines and at a guess I'd say performance is extremely close on the 4 core ARM to the 4 core Intel / AMD, and it's a quarter of the cost. I've not seen any issues with them at all either. The bonus is that, being far more energy efficient, they're theoretically greener too. I wonder what the carbon savings would be if every server switched to ARM?
The ARM Cortex-A7xx and Neoverse N cores are intended to be comparable to the Intel E-cores (Atom cores, like in Alder Lake N, such as Intel N100, or in the small cores of Raptor Lake) and to the AMD compact cores (like in Bergamo or future mobile CPUs). These cores are optimized for low area and low power consumption, with the expectation that a good throughput can be obtained by using a large number of cores.
The ARM Cortex-X and Neoverse V cores are intended to compete with the Intel P-cores (like in Sapphire Rapids or the big cores of Raptor Lake) and with the AMD normal cores. These cores are optimized for high single-thread performance and for workloads where low latency is important.
The ARM Cortex-A5xx cores are much smaller and slower than any Intel or AMD cores.
I think the Cortex-X series cores are the ones starting to make their way into laptops and the like (Cortex-X4 is the latest). These are Arm's "flagship" cores.
If by fair you mean the same cost to produce on the same process node, then an Arm V class core should be the same as an AMD compact core, roughly. The Arm N cores are a lot smaller, closer to Intel's E cores. Full sized Intel P cores or AMD's non-C Ryzens are closer to an Apple M core.
While Apple M runs Arm instructions, it is not any of the cores designed by Arm like the cortex A7 series or the X series. While those chips come in packages from other manufacturers (ie Mediatek, Qualcom, or even Google's tensor) the actual core design is from ARM, but is tweaked (usually things like cache sizes can be adjusted) and integrated with other supporting hardware by the manufacturer. Apple's cores are actually completely custom, with no input from ARM the company.
So I guess that is reason why AWS went with V1 and 5nm. V2 isn't quite attractive in terms of Die Space, Node and Power usage. Graviton could trade for more V1 Core instead of top single core V2 performance.
I am hoping post ARM IPO we will have Cortex X5 and N3, V3 announcement. Also waiting if Apple A17 will gain any more double digit percentage IPC improvement. Personally I dont think that will happen.