Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD EPYC 97x4 “Bergamo” CPUs: 128 Zen 4c CPU Cores for Servers, Shipping Now (anandtech.com)
159 points by ksec on June 24, 2023 | hide | past | favorite | 123 comments


"Consistent x86 ISA" in that slide is clearly a dig at Intel. AVX-512 is still not present in the chips with E-cores.


Last I checked, there aren't any Xeons with E-cores. However, you're right: the most direct competitor to Bergamo seems to be Sierra Forest, a chip that will have nothing but E-cores (and no AVX-512).


The Intel "Sierra Forest" Xeon CPUs with E-cores are announced for 2024.

Even if they are not branded Xeon, but Atom, there already are a lot of Intel server CPUs with E-cores, e.g. Denverton Refresh, Snow Ridge, Parker Ridge and the recently launched Arizona Beach.


A chip with hundreds or thousands of E cores and AVX512 may be competitive with GPUs for AI.


Imagine if that chip also had HBM memory and fit into a PCIe slot, or maybe had direct-attached high-bandwidth networking.

Now imagine Intel had that product and canceled it in 2017, and you will be living in reality: https://en.wikipedia.org/wiki/Xeon_Phi


They’re fools not to revive that or something like it.

Not only would something like 1024 cores with AVX512 be competitive with GPUs it would have the added advantage of being waaay more versatile and easier to program.


That versatility comes at a cost though - nothing in silicon is free. It clearly couldn't compete with GPUs in their strengths, and couldn't find it's own niche, which is why it was cancelled.

And I'm not sure it'll be that much easier to program - "few large cores" and "GPU waves" are now pretty well supported in the stack with mature tooling - trying to insert a new stack between the two would likely be pretty difficult as it doesn't really fit well with either established paradigm so likely needs something new to show it's benefits.


Did you ever try to use the Xeon Phi? It was hideous to optimize for.

The cores were in order. If you've never written code for in order cores, you may not appreciate exactly how heavily you're leaning on out of order to save your bacon. Things your used to just working suddenly don't.

The cores had a 2D layout, but Intel refused to tell you what it was and in practice it was impossible to optimize for. You ended up with variable (as in different on every node) latency for memory access that made programming again painful (and add this to the fact the cores are in order, so you really can't work around the latency).

Then there was the software stack that was just odd in some ways. I was researching I/O throughout, and things that just work on any other CPU (on any modern ISA) just didn't behave in a sane way.

So yeah, I was not at all sad when Intel cancelled it.


I actually don't think that would be easier to program. The current thread abstraction won't be efficient with 1024 cores. OpenCL is much better suited for that. But then you might as well run it on a GPU.


Work stealing makes it arguably much easier to program than CUDA or OpenCL.


It is not easy to program Xeon Phi if you want to get performance. It is easy to run any application on it, but it is also easy to run it slower than a cheaper Xeon with much less core count.

It was a big experiment, and it failed. Some of the brightest people have tried to push it to the limit. In my circle, I have never heard anyone missing Xeon Phi when it is discontinued.


I won a Xeon Phi programming manual at an Intel talk; I can assure you that even though it's easy to program, it is absolutely not easy to optimise. The amount of compiler+profiler+cilk+… tooling Intel had to provide was impressive.


What you are talking about has literally came and went. You can probably track one down, buy it and have it in your computer in a few days.


> A chip with hundreds or thousands of E cores and AVX512 may be competitive with GPUs for AI.

A chip with hundreds or thousands of E-cores should be ideal for any cloud provider to offer vCPUs with a high markup. For example, a company like Hetzner sells 1vCPU services for around 4€/month. Even if they don't overprovision and allocate 1 core per 1vCPU, a single chip can potentially earn them 500€/month, which leads to a break even point at around 2 years.


Wait till you see the memory bandwidth that 1 thousandth of a cpu gets you.


> AVX-512 is still not present in the chips with E-cores.

Why would they put a dig at Intel’s consumer chips in a slide deck for their server parts? The Intel Xeons don’t have E-cores. This doesn’t make any sense unless I’m missing something.

Also, you know AMD has recent consumer chips without AVX-512, right?


>The Intel Xeons don’t have E-cores.

Xeons with E-cores are planned: https://en.wikipedia.org/wiki/Sierra_Forest

And even if they weren't, it would still be a cheap PR win.

>Also, you know AMD has recent consumer chips without AVX-512, right?

All the current generation processors (Zen4) support AVX-512, both mobile and desktop. It may be confusing because AMD's numbering scheme is intentionally misleading and they sell previous gen chips with new model numbers.


There are many kinds of Intel server CPUs with E-cores, with up to 24 E-cores, which are branded Atom, not Xeon, and which are less known, because they are mainly intended to be integrated in various telecommunication systems, e.g. in mobile phone infrastructure.

Intel server CPUs with E-cores that will use the Xeon brand, because they will have much more cores (6-times more, i.e. 144 vs. 24) than the current models, are announced for 2024.


They don't have a lot to brag about there. I'm on an expensive Thinkpad with an AMD Ryzen 7 PRO 6850H, less than a year old... no AVX512. Chip was released Apr 2022, with AVX&AVX2 but no 512. (Also completely garbage unstable and unreliable driver & firmware support from both AMD and Lenovo, but that's another story).

I've been messing with SIMD optimizations recently in a datastructure library I wrote, and as tantalizing as the AVX-512 is, it'll be years before it can be used in production software on a real scale. Introduced in 2017, and still totally unusable in the wild.


> Also completely garbage unstable and unreliable driver & firmware support from both AMD and Lenovo, but that's another story

as a counterpoint, i'm on a P14s Gen 2 Thinkpad (Ryzen 7 PRO 5850u) and have zero issues on EndeavourOS (Arch, KDE/Plasma). most of my time is in VSCode, SublimeMerge, Chrome, MPV.

i did have to rip out the realtek (or mediatek?) M.2 wifi card and swap in an intel one tho.


Lenovo straight up broke the USB-C ports support for external displays (and other things) in the latest BIOS update for my machine -- for both Windows and Linux -- and there's no way to roll back and there's not a peep from them on any kind of attempt to fix or even acknowledgement of a problem.

Requires a hardware ("pinhole" not power cycle) reset every time I want to switch between external monitor use and laptop only. My bug report/forum post here: https://forums.lenovo.com/t5/ThinkPad-Z-series-Laptops/Exter... -- other people are effected.

Last year they pushed out a BIOS update which caused the fan to run 100% of the time. Then a "fix" which caused it to overheat. And in-between there somewhere something was pushed that caused me to have to reformat the whole thing.

Lid-open-to-awake in Linux used to simply not work 75% of the time, required a hard reset. Now it works, but there's a 5-10 second pause before display turns on (Windows or Linux) vs almost instant on the work issued laptops I've had recently (MBP M1 and some HP Z series thing, running Linux)

Waste of cash. Bought this thing to be my dev workstation when I took a contract job last summer, wish I'd never done that. Software quality / support at Lenovo is a real problem.


My Think Pad Tablet has had issues after a bad firmware partial update. It won't stay on past 30 min. It is really souring my former respect for them. I understand it is part TPM issues, but now I have basically a semi-brick. I was able to move the whole os to a VM. I'll try linux and see if it was a Win-Bios interaction bug. And if that doesn't fix it I'll wait six more months and see if it gets a working update. By my former respect for Lenovo is now gone.


> Lenovo straight up broke the USB-C ports support (...)

According to AMD, the AMD Ryzen 7 PRO 6850H does not have USB type-C support.

https://www.amd.com/en/product/11621


What does that even mean? Clearly the laptop itself has USB-C support (it has 2 USB-C 4 and 1 1 USB-C 3.2).


Couple of T14s (one debian, one windows) and a 4800U (both) here, all solid. I haven't tried updating the bios.


If you're using an llvm-compiled language targeting a new avx512-enabled architecture, you're already using avx512 instructions to vastly accelerate your programs via autovectorization.


Hardly. Autovectorization is rarely performed as a percentage of all the cases where it’s theoretically possible, not reliably done (no guarantees - even if it AVs today, the same code may not AV tomorrow), and not something you want your actual hot path to rely on. AV is nice for “incidental” speedups to general code. No serious library touting SIMD acceleration will consider AV sufficient for the core hot loop.


> Autovectorization is rarely performed as a percentage of all the cases where it’s theoretically possible,

This is not my experience. Do you have some justification for this statement? I've found myself far more impressed by autovectorization than disappointed by it. I've found that code most people think is autovectorizable actually violates scalar contracts. But if you write clear code whose scalar implementation won't introduce UB if autovectorized, the compiler is really good.

Here's my go-to example. Agner Fog's VCL is a well-respected library. It has vectorized versions of all sorts of useful mathematitical functions. For fun, I rewrote his `exp(x)` function using his exact algorithm, but in scalar code, and it autovectorized, and benchmarks the same.

https://godbolt.org/z/TK637PTh7


Nobody distributes binaries compiled for AVX-512 for obvious reasons. The only times you'll use it are for source-based distros like Gentoo, and programs that do run time feature detection, but they are relatively rare and won't be using autovectorisation.


Runtime feature detection need not be rare nor hard, it's a few dozen lines of boilerplate. You can even write your code just once: see https://github.com/google/highway#examples.


That's still hard, given the tiny benefit it would give to 99% of code.

Maybe modern tooling (e.g. Cargo) will lower the barrier so that it becomes less rare, but in C++ it's definitely not worth the effort for the vast vast vast majority of projects.


Wow, I find "vast vast vast" hard to believe. If we don't have loops where a lot of CPU time is spent, why even write in C++?

I admittedly focus more on libraries, but quite a large number of them are already vectorized. I would venture that a sizable fraction of CPU time, even in 'normal' non-HPC context, uses SIMD indirectly. Think image/video decompression, browser handshakes/rendering, image editing, etc.


> If we don't have loops where a lot of CPU time is spent, why even write in C++?

Because it isn't just hot or autovectorisable loops that are faster in C++; everything is faster. Function calls, member accesses, arithmetic, etc. Even loops are normally not very hot and not autovectorisable.

You're right that things like audio/image processing, compression etc. benefits from SIMD but that is in the 1%. Those are libraries that have already been written. The vast vast majority of people are not writing audio codecs or whatever.


> Those are libraries that have already been written.

Written using SIMD.

> The vast vast majority of people are not writing audio codecs or whatever.

Or HPC, or finance, or json parsing, or PDE solving, or gaming....

All off these (and more) benefit from AVX512. Why are you going so far out of your way to be dismissive of this?


Unfortunately Rust is way behind on getting SIMD intrinsics out there. Anything AVX512 is nightly (unstable) only still.


My point is you can't really distribute a binary which targets it or rely on its effects. It's not common enough yet.

And autovectorization is nice, but explicit SIMD intrinsics use usually wins if done right.


Why not distribute a binary that uses AVX-512 if present, and falls back to AVX2 or SSE4 if not? It can be quite easy, see the example earlier in this thread.


Yes, well, in my case I don't have a machine to even test AVX512 on, which sorta speaks to the problems inherent with AVX512, doesn't it -- buy a high-end laptop in 2023 and still don't have AVX512 on it. Which is what led to my original comment here.

If I was getting paid $$ for this work, sure, I'd rent cloud instances or hardware to do that development. But it presents a dilemma for open source work.

Anyways, it's all griping. We'll either eventually all get AVX512, or it will die and some other more common wide vector extension will take its place, or we'll all be having the same gripe about NEON or RISC-V V extensions 10 years from now.

It's just frustrating some of the nice toys that are in AVX512, in particular nice support for bitmasking that would make the code I'm writing much nicer & faster.


Godbolt will run simple programs for you. And you can choose whatever compiler you want to use very easily. Just make sure you enable 'Output... > Run the compiled output'. I think there's some randomness about what cloud instance your code gets run on, because sometimes it fails due to not having all the necessary instructions.

> Anyways, it's all griping. We'll either eventually all get AVX512, or it will die and some other more common wide vector extension will take its place, or we'll all be having the same gripe about NEON or RISC-V V extensions 10 years from now.

I think we'll get a unified instruction set. Part of AVX512's difficulty is that it's not just AVX512 or not. It's AVX512F and/or AVX512BW and/or AVX512VL ...x10


I understand it's much more convenient to develop on the machine that has it all.

Further to the Godbolt suggestion, one band-aid is that Highway fairly efficiently emulates some of the fancier AVX-512 instructions such as CompressStore. You could then develop on AVX2, then rent 1 VCPU-hour to build and verify it indeed works on AVX-512.

As for the subsets of AVX-512, we defined them into groups matching Skylake and Icelake; that works pretty well. Zen4 would also support the Icelake features, but it gets its own target so that we can special-case/avoid the microcoded and super-slow CompressStore there.


What do you think the future of C++ is with vectorization? I keep thinking that the language is too far down the rabbit hole of scalar guarantees to ever allow for us to only rely on autovectorization. Maybe we can add some language features like "false path is defined behavior if statements" and bit-sized bools. But then we'd need people to totally change the way they code. Maybe we need a new, non-scalar successor language? Maybe libraries can take care of it?

I've been using Highway for a couple weeks (I'm the guy who is writing the unroller feature). Highway is more limited in its breadth than raw intrinsics. I've noticed a few instances so far where you had to make a judgement call, and decided not to have Highway expose certain features (like 32 bit indexing into 64 bit type scatter/gather). And with more things that x86/ARM throw at us, the harder I think it becomes for a library to be the solution. From what I've seen of std::simd, I don't see how that possibly can be the solution. What do you think?


Oh, hi :) I like your unroller idea, it can simplify user code and is quite general.

Agree about autovectorization. It is not even a true programming model, because we have only limited ability to influence results.

Also agree std::simd is far too limited (something like 50 ops, mostly the straightforward ones, vs >200 for Highway), and difficult to change/extend within the ISO process.

It is very difficult to get widespread traction with a new language, even given LLVM. Mojo, Carbon, Zig are also already potentially helpful.

I do believe a library approach (and in particular Highway, because considerable effort is required to maintain support for so many targets/compiler versions and AFAICS nothing else properly supports RISC-V and SVE) is the way to go for the next 5 years. Major compiler update cycles are something like 1.5-2 years and I don't think there will be a fundamental shift anytime soon towards RL, for example. After those 5 years, the future remains to be written :)

As to missing features: we are happy to add ops whenever there is sizable benefit for some app, and it doesn't hurt other targets. For mixed-type gather, x86 is the only platform that does this, so encouraging its use would pessimize other platforms. And I think apps can easily promote/demote their indices to match the data size. But always happy to discuss via Github issues :)


Highway looks nice. I'm in Rust these days, and SIMD support there is remarkably terrible in comparison. And I'm not nearly close enough to being an expert on the subject to contribute anything upstream to improve that. AVX512 is available only if you compile against nightly still.

https://github.com/arduano/simdeez looks like it's trying to fit into this space, fairly promising.


:) Thanks for the pointer. Glad to see simdeez now has a maintainer again. I believe its (mostly) 'abstracting SIMD width' feature is important and helps nudge people in a good direction.


I played with it last night and it does not seem to support many integer intrinsics (e.g. a wrapping of of _mm_set1_epi8 and _mm_cmpeq_epi8 seems not to be present), so isn't useful for my purpose. :-(



I was talking about simdeez, not highway -- but I did actually find tip of tree on simdeez worked for me.


Ahh. Okay. Good luck. It's a fun rabbit hole.


Yes but his point still stands.


He said "Introduced in 2017, and still totally unusable in the wild." I understand his frustrations with his own hardware, but he's using lower end consumer hardware. We could quibble about what it means to be "in the wild," but I operate in a compute heavy space where companies buy good hardware, and they all make effective use of AVX512 instruction.


I love thinkpads as much as the next person, but pulling in mobile cpus when the topic is massively multicore chips is extremely off topic at best.


Parent comment is the one who brought consumer CPUs into the discussion


With 2U, 4 Node, Dual Socket, and 128 Core CPU, you get up to 1024 Core or 2048 vCPU per Server Machine. We can now all enjoy our own cloud.


The amount of scientific work that could be done with such a machine would be staggering, specially if it can be local. I use 2-10k cores on a daily basis, having a dedicated machine with 1k would make debugging large scale problems much faster.


> 2k-10k cores on a daily basis

could you tell more please? sorry, not in the loop on what a typical “scientific workflow” is like


Large-scale simulations of phisical systems tend to have a variety of length/time scales that require vast numbers of CPUs to model in a timely fashion. This could be anywhere from a few hundred cpus to tens of thousands. Fluid dynamics, solid mechanics, fluid-structure interactions, electrohydrodynamics, etc. simulation tools can all benefit from having a larger number of highly interconnected CPUs, we can simply resolve engineering problems better with more computational resources for better understanding of problems.


what language/frameworks orchestrate work across 10k cores? is that like, 10k threads? 10k processes?


It is not a single language. c, C++, Fortran are the most commonly used. The communication happens using one of many versions of the message passing interface [0] (MPI). MPI essentially spawns child processes and a communication ring between nodes using whichever interconnect is available (usually infoniband).

[0] https://en.m.wikipedia.org/wiki/Message_Passing_Interface


While melting the wires in your walls. How much juice would that pull when under full load?


3kw


^ roughly how much a clothes dryer uses, for comparison.

A lot of power, but nothing really abnormal. Most rooms in a house won't be wired for it but that's about all. You probably wouldn't want this much thermal output in a normal room anyway, not to mention fan noise.


It could also double as a clothes dryer (put a cabinet with clothing racks on next to the server rack, pipe the hot air into it).

Honey, can you queue up a large simulation? Got a full load to dry again.


In most parts of Europe the standard room socket has >= 3.6 kW


There are no 4P motherboards for AMD EPYC processors, they're all 2P because the of the number of interconnects per server CPU.


The OP was confusingly suggesting a 2U box with four distinct computers inside (4 node), each with two sockets. These types of systems do exist, and can help with volume density, but then you have to work a lot harder with regards to power and heat density.

I don't know what things look like now, but I recall hearing many stories a decade ago about datacenters running out of power and cooling when the volume was closer to 1/3rd full.


I understand your happiness, but using such a vague word as "cloud" is doing a disservice to this setup


It's fascinating how they managed to reclaim so much die area by dropping the cache and optimising for low (~3GHz) frequency.


I think we will be seeing this more and more (less cache on the die OR stacked cache on top of the main chip instead of in it) since SRAM scaling is now near a halting point [1] (no more improvements) which means the fixed code of cache is going up with every new node.

https://semiwiki.com/forum/index.php?threads/tsmc-officially...


It's mostly the cache area rather than the frequency, right? Cache just takes a huge amount of space on the chip.


From what I've read, the L1/L2 cache per core is the same, and the L3 cache per chiplet is the same, but the cores per chiplet is doubled and the overall chiplet size is about the same (it's a little bigger, I think)

So, L3 cache didn't get smaller (in area or bytes), there's just less for each core. L1/L2 is relatively small, but they did use techniqued to make it smaller at the expense of performance.

I think the big difference really is the reductions in buffers, etc, needed to get a design that scales to the moon. This is likely a major factor for Apple's M series too. Apple computers are never getting the thermal design needed to clock to 5Ghz, so setting the design target much lower means a smaller core, better power efficiency, lower heat, etc. The same thing applies here: you're not running your dense servers at 5Ghz, there's just not enough capability to deliver power and remove heat; so a design with a more realistic target speed can be smaller and more efficient.


Semianalysis seems to indicate the core space itself is 35% smaller. Maybe that's process tweaks related to clock rate? But I don't think we know for absolute certain it really is the same core with the same execution units & same everything. Even though a ton of the listed stats are exactly the same.


Something tells me that Teams still won't run properly using this.


If you want to see something really amazing, look at the CPU usage to display one high frame rate animated gif posted in a slack channel on a modern quad core 11/12/13th gen Intel CPU laptop. Chat clients were not meant to take 70% of available CPU. You can literally hear the fans speed up.

Teams is bad, yeah, but does not have an entire monopoly on being terribly inefficient.


> modern

> quad core

pick one


Intel Core i3 processors are still quad-core, even in the latest generation. Not everyone has an i7.


Modern would probably refer to generation, not core count.

However, I wish to all 6+ core Intel laptop owners a very happy 65 decibel exhaust fan when opening your 3rd chrome tab.


Is everyone on HN running their computers in some parallel universe where apps are running 10X slower than my computer?


No, but on the other hand, I do have some glimmer of hope that eventually someone at Microsoft will see these kinds of comments and do something to fix that abomination. I have no doubt Microsoft employs many very talented people, but none of them seem to be working on the user facing products which make my corporate laptop run so poorly (Teams or OneDrive).


They've transitioned the consumer client from Electron to Edge WebView, Angular to React, and the same for the enterprise application appears to be in public preview and slated for general availability later this year. Still web tech but it's supposed to address some of the memory and performance problems.

So I think they've known how unpleasant Teams is!

https://www.microsoft.com/en-us/microsoft-365/blog/2023/03/2...


T2 is much worse that version 1, I'm sorry to say because I really really want to like Teams since I'm forced to use it for work.


Don't tell me that, I want to believe in the dream. :/


Highlighting the 9 second startup time does not bolster confidence.


Yes, appreciate the transparency, but it's kind of embarrassing "22 second launch time" was considered acceptable for so long, and "more than twice as fast" now means "almost a 10 second launch time".


Yeah it is something but as an app that usually runs at startup, I'd just be pleased if it can snap a little faster and be less janky. Fingers crossed.


To be fair, it has already improved a lot. They are doing something whether they see your comment or not.


Teams is hot garbage, it feels like it was designed to be terrible at everything it does. Everything feels laggy in it compared to Slack/Zoom.


Your kidding right, I mean teams is not great, but compared to the UI rubbish of zoom and slack?

Take for example the whole business of screensharing, if you share a single screen in zoom you get black blocks for all the zoom windows. Yes it makes sense that I don't share the window with participants, but at least let me freaking close it. Similarly moving the controls from bottom to top of the screen reliably confuses early (and even more advanced users).

And don't get me started on quoting text or including math in slack and what is the whole threads section?!


It feels like a quarter of the time screenshare doesn't even work in Teams and it's noticably choppier when it does work.

Code blocks working in Teams is inconsistent, and it seems to add additional spaces when copying.

Quoting & mentions also sucks. The channels, chats, activity screens are always a mess, feel like UX bandaid.

Private channel limits are garbage.

Search and discovery sucks.

There's inconsistent options for the calendar between Outlook & Teams. Don't you dare expect functional compatibility between MS products. Scheduling assistant sucks, sometimes confuses itself when people are busy.

The UX seems to rely on a myriad of nested menus & modals that are XHR backed, so each one lags. Even just typing text feels slow. It's as fluid as molasses.


> and it seems to add additional spaces when copying

It regularly adds _fake spaces_ to code blocks. People paste working code/SQL into their things and all spaces were replaced with mysterious utf8 that are invisible like spaces.


Not according to their own advertising: https://www.youtube.com/watch?v=CT7nnXej2K4


I think the problem is that using well optimised programs (Sublime text for example) on modern computers feels unbelievably fast.

Using something like Teams by comparison feels like mollases.


This doesn’t add any more value to the conversation than Reddit’s “but will it run Crysis?” comments.


Ok then. Steering away from Redditisms, let's make this into a deep reflection upon the state of modern application software that's consuming hardware resources faster than hardware can improve. We have these incredible CPUs yet our machines offer barely the same experience they did twenty years ago and even succeed in getting strictly worse as in Teams case. The newer the software, the more wasteful implementation we get, our performance budget spilled and absorbed by layers upon layers of abstractions. Of all this immense computing power, I can only feel the difference when compiling stuff and other specialized activities.


it didn't but it made me smile :)


I'm extremely curious if this could run Crysis in with software rendering, actually...


360W per CPU is less than 3W per core, not bad.


Cores won't draw much while blocking on a memory access. If the core count vs core utilization trade-off is a tie, the napkin math expectation would be same throughput, same power draw of cores, less power draw of cache for less cache.


We're going to be using a lot of these in a new hardware sku at my employer sooner or later (it's in the works). We software nerds get to figure out the scaling problems on these big CPUs that don't manifest on the smaller 18 and 24 core CPUs we're using today.


We'll approach it like a network-on-a-chip: 128 servers with an extremely fast network. Each server will be a process. One process will be something like ingress, another like a message queue and the rest will be workers. Chances are the chip doesn't need or want to be saturated, so you rotate the cores using Linux magic of some kind.

Put that on a napkin and give it to an unsuspecting intern.


Certainly the closer you can get to shared-nothing, the better! Some of the difficulty is we have a legacy stack that is not implemented that way. (You also need to account for significant kernel CPU use, like hardware interrupt handlers (e.g., NIC rx) running on some cores.)


> You also need to account for significant kernel CPU use, like hardware interrupt handlers (e.g., NIC rx) running on some cores.

In a lot of loads, it doesn't matter that much, because the application uses way more CPU than packet handling... But if it does matter, you really want to get things lined up as much as possible. Each CPU core gets one NIC queue pinned and one application thread pinned and the connections correctly mapped so the fast path never communicates to other cores. I'm not tuned into current NICs though, I don't know if you can get 128 queues on NICs now. If you have two sockets, then you also have the fun of Non-Uniform PCI-E Access...

When I was working on this (for a tcp mode HAProxy install), the NICs did 16 queues, and I had dual 12 or 14-core CPUs, so socket 0 got all its cores busy, and socket 1 just got a couple work threads and was mostly idle. Single socket, power of 2 cores is a lot better to optimize, but I had existing hardware to reuse.


Re: speed, you'd likely get orders of magnitude better bandwidth out of a real network of 128 servers instead, though latency would be lower here.


Anyone have some napkin math on how many QPS something like this can handle??


Somewhere between 1 and 10e7


The theoretical floor is probably zero, not one. I agree with the rest of the range :-).


Where can we buy?


I wonder why AMD decided to name chips after Italian cities...


Geographical names aren't copyrighted, so it's a safe choice. Before that they used names of various rivers around the world. Intel does something similar, but they make up fictional names based on real places. Nvidia uses names of famous scientists.


Probably selected a concept that there were a bunch of names from so that the series would be consistent but also which would not result in any trademark issues. Same reason Intel picks random words and "lake" after


I wonder if Zen 4c is going to trickle down to the consumer market. There are multiple possibilities for CPUs with mixed chiplets but who knows if they really make sense outside the server world. It's exciting


There was a very detailed post with die shots etc on semi analysis a few weeks back https://www.semianalysis.com/p/zen-4c-amds-response-to-hyper...

In the subscriber only section it was mentioned there will be some Zen5 consumer parts using Zen5c as the equivalent of current Intel E-cores.


I think it is unfair to the semianalysis team to share information from the subscriber only section.


Nothing unfair honestly.


Why? The article has a paywall about half way through.


Because for them subscribers are low-volume. Subscriber only section is generally pretty small and shares a couple key tips or viewpoints.

If there was nothing wrong with it OP wouldn't feel the need to use a throwaway account.


I made a throw as I can't remember my login and wanted to highlight the semi article due to the amazing detail there.

There is a lot more detail in the article on the possible 4c/5c uses, it's been speculated in a lot of contexts for the hybrid configurations (even by AMD directly) so don't think it's surprising to mention this at all.


A 24-core Ryzen with an 8C zen4+3d sram and a 16C zen4c chip might be quite interesting. There are probably already a few in AMD's labs ;)


Yeah, that would destroy the 14900K.


Wow, do CPUs have that capability now? To actually destroy other CPUs? How does that work? Do they get the "Annihilate.Competition" instructions set in an update when they detect a Intel machine in the network?/s


Have you never heard of the Intel C compiler's cripple AMD function?

https://www.agner.org/forum/viewtopic.php?t=6


Now when can we get the same through Threadripper? I'd love to use my desktop as both my main gaming rig as well as my productivity rig.


Zen 4c isn't good for desktop or workstation apps where you occasionally need single-thread performance. Threadripper Pro 7000 will be Genoa.


If you're okay with "only" 64 cores you can do this now. It's my current rig, used for both work and play.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: