Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Arm’s Neoverse V2 (chipsandcheese.com)
102 points by matt_d on Sept 11, 2023 | hide | past | favorite | 54 comments


Are there simulators that chip developers use to get an idea of what performance will be for certain workloads prior to creating an engineering sample? Or how does this work?


Absolutely! Chip designers have a several tools to do this.

First, they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can before laying out a single transitory. These models can run code just like a real hardware device, albeit slowly.

Once the chip is designed, verilog simulators are programs used to generate the exact logical output of a circuit, which can be used to measure performance on a workload. However, this method is even slower than the first!

For larger workloads and higher speed, they use extraordinarily expensive FPGA-based platforms called Emulators. This allows circuits to be run at speeds in the MHz range before ever being sent to a fab. Booting an OS, running a complex multicore workload with shared memory, they can measure almost any workload. But this method is not available until late in the design phase and the boxes themselves are prohibitively expensive from being deployed very widely.

The software models are the most useful for estimating performance, as long as they are written early and well :)


> they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can

How does this work? Do they model at the transistor level, or at the level of logical functions, or..? I'm particularly curious how this can estimate performance if it's anything higher-level than a direct transistor-for-transistor, layout-aware, emulation.

I'd be really interested in learning more if there's anything you could share, please. I can find info about chip design software and languages like Verilog (as you mention) but not this sort of modeling.


The idea is to write a C++ model that that produces cycle accurate outputs of the branch predictor, core pipeline, queues, memory latency, cache hierarchy, prefetch behaviour, etc. Transistor level accuracy isn't needed as long as the resulting cycle timings are identical or near identical. The improvement in workload runtime compared to a Verilog simulation is precisely because they aren't trying to model every transistor, but just the important parameters which effect performance.

Let's take a simple example: Instead of modeling a 64-bit adder in all its gory transistor level detail, you can just have the model return the correct data after 1 "cycle" or whatever your ALU latency is. As long as that cycle latency is the same as the real hardware, you'll get an accurate performance number.

What's particularly useful about these models is they enable much easier and faster state space exploration to see how a circuit would perform, well before going ahead with the Verilog implementation, which relatively speaking can take circuit designers ages. "How much faster would my CPU be if it had a 20% larger register file" can be answered in a day or two before getting a circuit designer to go try and implement such a thing.

If you want an open source example, take a look at the gem5 project (https://www.gem5.org). It's not quite as sophisticated as the proprietary models used in industry, but it's a used widely in academia and open source hardware design and is a great place to start.


This was really interesting to learn. Thankyou!


When I was in chip design about 15 years ago, we did transaction level modeling (TLM) using SystemC. Not sure if it’s still a thing these days.

https://en.m.wikipedia.org/wiki/Transaction-level_modeling


A good example is one of the classic Computer Architecture class assignments which is to simulate a cache. So the way that looks is you have a stream of memory accesses and you "simulate it" by parsing that file and simulating the actions that would be taken. ie: "ok this block would be put in cache. This next access was a hit, this next access was a miss, etc". So then you just count those actions and estimate the performance by tallying all that up.

That's the behavioral model part and IRL they do basically the same thing to decide what behavior they actually want the hardware to do.

The next step is the circuit-level model done in verilog which actually simulates the logic-gates and does involve viewing a signal at every clock cycle.


There are a few specialized languages for hardware description, Verilog is common, as is VHDL. A good point for starting is the Wikipedia page about hardware description languages [1]. This is a slow moving area, so even old resources should be useful. I only encountered HDLs during my university years and that's longer ago than I care to remember. I recall we did something with MIPS (back then we did everything hardware-near on MIPS) and used a book by O'Reilly, something something Systems Design or so. Couldn't find it, probably wrong name, but I found this [2], maybe useful?

[1] https://en.wikipedia.org/wiki/Hardware_description_language [2] https://freecomputerbooks.com/langVHDLBooks.html


The following is a pretty good overview:

"A Survey of Computer Architecture Simulation Techniques and Tools" - IEEE Access 2019 - Ayaz Akram, Lina Sawalha - https://ieeexplore.ieee.org/document/8718630

For more see also: https://github.com/MattPD/cpplinks/blob/master/comparch.md#e...


Basically you model all the elements of a chip (queues, memory, alus, etc), and how much time they take. You use a virtual clock so your simulation model can run at a different pace.


>they create detailed software models (usually in C++) of their chips to estimate performance as closely as they can before laying out a single transitory.

Usually System Verilog instead of C++ but it has C++ interfaces

https://en.wikipedia.org/wiki/SystemVerilog


Ideally they know exactly how it will perform: Every part of the chip, including the caches, memory controller, and DRAM is implemented in a cycle accurate simulator. There are often multiple versions of that simulator, one written in C/C++ that matches the overall structure of the eventual hardware, and then simulations of the actual RTL (hardware source code, networks of gates).

The C-model and RTL model outputs are often also compared with each other as a correctness validation step, as they should ideally never diverge. (ie, implement twice, by two teams, and cross-check the results).

Those simulations are terrifically slow for larger chips, so there is a surprisingly small number of workloads that can be run through them in reasonable time. So there tend to be even more simulator implementations that sacrifice perfect performance emulation for 'good enough' performance correlation (when surprises can happen). Being able to come up with a non-exact simulator that perf-correlates with real hardware is an art in itself.


Are the C simulators hand crafted each time by the chip designer? It seems like the kind of thing that needs custom built but I’m wondering if there is a common toolset used, or platform?


For production chips the simulators are usually completely custom. In academia people tend to modify existing simulators like SimpleScalar or Gem5.


Many thanks for the insight


The "exact" part is not exactly true for modern computer processors, where thermal and power constraints are a problem, right?


The performance team usually thinks in terms of cycles. At runtime the frequency varies depending on various factors as you said, but this is mostly ignored.



Those aren't for quantifying performance, they are for developing the firmware/OS/software stack to run on the platform.

In other words, they are a slightly more accurate version of something like QEMU, although I guess I should point out they can generate traces that can be fed into tools to model HW perf, ex gem5.


Second sentence in link: "They allow full control over the simulation, including profiling, debug and trace."


systemc is a c++ derivative often used to model chip designs. Before chip tapeout most of the chip design has been fully simulated and emulated many many times, be it functional wise, or cycle by cycle(e.g. cycle accurate simulation).


That just gave me PTSD. System C is terrible.


it's c++17 for me, I don't feel much difference at all between c++ and systemc.


Sounds like the V2 is about as wide in issue width at Apple’s M1/2 (8 MOPs) but not nearly has deep (~300 versus over 600). Can ARM actually keep such a wide architecture busy?


FYI, depth in this context generally refers to the number of pipeline stages. I assume you’re talking about the ROB size?


Maybe it’s something I picked up from AnandTech. Yes, I meant the ROB size.


I didn't know NVIDIA makes server ARM chips. Their tegra processor was promising, specially the GPU part. I hope they bring it back.


Tegra hasn't gone away. A new version just came out called the Orin.


It does not seem like what tegra used to be, looks like for automotive only https://wikimovel.com/index.php/Nvidia_Tegra_Orin


We might see new Tegra chip in next year's Nintendo Switch 2.


It’s more than capable for general-purpose use; you don’t have to use the lockstep if you don’t want it.


Tegra chips were there in consumer electronics like tablets and mobile phones, I dont see recent chips used in consumer electronics


It can be used however you want. There's no difference compared to something like the shield


There are more cost-effective chips for those.


Unfortunately at $2,000 Orin is more than overpriced for general-purpose use.


Has anyone here tried Orin?

And idea about the cost and performance?


It powers the Nintendo Switch.


Whither SVE? Will trying to use 256 bit vectors still result in the sad trombone condition register getting set?


Its 4x128 bit units now.

TBH this makes sense, as pretty much all the ARM code in the wild will be using NEON.


If you could get your hands on a Fujitsu A64FX you could get some really wide SVE vectors but those aren't really supposed to be for consumers.


I’d love to see a performance per dollar article on these machines. Is there anything out there? My guess is they’ll be efficient in terms of cost to buy and cost to run compared to the competition and I wonder if it’s worth trying out?


I've been running a few sites on Hetzner Cloud's ARM machines and at a guess I'd say performance is extremely close on the 4 core ARM to the 4 core Intel / AMD, and it's a quarter of the cost. I've not seen any issues with them at all either. The bonus is that, being far more energy efficient, they're theoretically greener too. I wonder what the carbon savings would be if every server switched to ARM?


I'm still not sure which ARM cores are the "most fair" to compare to laptop/desktop x86oids and Apple M series; The N2, the A710, something else?


The ARM Cortex-A7xx and Neoverse N cores are intended to be comparable to the Intel E-cores (Atom cores, like in Alder Lake N, such as Intel N100, or in the small cores of Raptor Lake) and to the AMD compact cores (like in Bergamo or future mobile CPUs). These cores are optimized for low area and low power consumption, with the expectation that a good throughput can be obtained by using a large number of cores.

The ARM Cortex-X and Neoverse V cores are intended to compete with the Intel P-cores (like in Sapphire Rapids or the big cores of Raptor Lake) and with the AMD normal cores. These cores are optimized for high single-thread performance and for workloads where low latency is important.

The ARM Cortex-A5xx cores are much smaller and slower than any Intel or AMD cores.


I think the Cortex-X series cores are the ones starting to make their way into laptops and the like (Cortex-X4 is the latest). These are Arm's "flagship" cores.


If by fair you mean the same cost to produce on the same process node, then an Arm V class core should be the same as an AMD compact core, roughly. The Arm N cores are a lot smaller, closer to Intel's E cores. Full sized Intel P cores or AMD's non-C Ryzens are closer to an Apple M core.


The irony, where Apple M series is an ARM core...


While Apple M runs Arm instructions, it is not any of the cores designed by Arm like the cortex A7 series or the X series. While those chips come in packages from other manufacturers (ie Mediatek, Qualcom, or even Google's tensor) the actual core design is from ARM, but is tweaked (usually things like cache sizes can be adjusted) and integrated with other supporting hardware by the manufacturer. Apple's cores are actually completely custom, with no input from ARM the company.


> While Apple M runs Arm instructions, it is not any of the cores designed by Arm like the cortex A7 series or the X series

How can we be sure about this since Apple does not disclose any details about their chips?

> Apple's cores are actually completely custom, with no input from ARM the company.

Apple is one of the founders of ARM, with other two being Acorn Computers and VLSI Technology.


ARM as in "Designed by ARM Ltd." not ARM as in "Uses the aarch64 ISA"


So I guess that is reason why AWS went with V1 and 5nm. V2 isn't quite attractive in terms of Die Space, Node and Power usage. Graviton could trade for more V1 Core instead of top single core V2 performance.

I am hoping post ARM IPO we will have Cortex X5 and N3, V3 announcement. Also waiting if Apple A17 will gain any more double digit percentage IPC improvement. Personally I dont think that will happen.


V2 probably wasn't available when Graviton 3 was developed.


It doesn't sound like a long time but Graviton 3 happened almost 2 years ago now.


The CMN-700 12x12 mesh seems interesting to me, I think this very much needs to be explored more. I hope somebody builds one!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: