Last I checked, there aren't any Xeons with E-cores. However, you're right: the most direct competitor to Bergamo seems to be Sierra Forest, a chip that will have nothing but E-cores (and no AVX-512).
The Intel "Sierra Forest" Xeon CPUs with E-cores are announced for 2024.
Even if they are not branded Xeon, but Atom, there already are a lot of Intel server CPUs with E-cores, e.g. Denverton Refresh, Snow Ridge, Parker Ridge and the recently launched Arizona Beach.
They’re fools not to revive that or something like it.
Not only would something like 1024 cores with AVX512 be competitive with GPUs it would have the added advantage of being waaay more versatile and easier to program.
That versatility comes at a cost though - nothing in silicon is free. It clearly couldn't compete with GPUs in their strengths, and couldn't find it's own niche, which is why it was cancelled.
And I'm not sure it'll be that much easier to program - "few large cores" and "GPU waves" are now pretty well supported in the stack with mature tooling - trying to insert a new stack between the two would likely be pretty difficult as it doesn't really fit well with either established paradigm so likely needs something new to show it's benefits.
Did you ever try to use the Xeon Phi? It was hideous to optimize for.
The cores were in order. If you've never written code for in order cores, you may not appreciate exactly how heavily you're leaning on out of order to save your bacon. Things your used to just working suddenly don't.
The cores had a 2D layout, but Intel refused to tell you what it was and in practice it was impossible to optimize for. You ended up with variable (as in different on every node) latency for memory access that made programming again painful (and add this to the fact the cores are in order, so you really can't work around the latency).
Then there was the software stack that was just odd in some ways. I was researching I/O throughout, and things that just work on any other CPU (on any modern ISA) just didn't behave in a sane way.
So yeah, I was not at all sad when Intel cancelled it.
I actually don't think that would be easier to program. The current thread abstraction won't be efficient with 1024 cores. OpenCL is much better suited for that. But then you might as well run it on a GPU.
It is not easy to program Xeon Phi if you want to get performance. It is easy to run any application on it, but it is also easy to run it slower than a cheaper Xeon with much less core count.
It was a big experiment, and it failed. Some of the brightest people have tried to push it to the limit. In my circle, I have never heard anyone missing Xeon Phi when it is discontinued.
I won a Xeon Phi programming manual at an Intel talk; I can assure you that even though it's easy to program, it is absolutely not easy to optimise. The amount of compiler+profiler+cilk+… tooling Intel had to provide was impressive.
> A chip with hundreds or thousands of E cores and AVX512 may be competitive with GPUs for AI.
A chip with hundreds or thousands of E-cores should be ideal for any cloud provider to offer vCPUs with a high markup. For example, a company like Hetzner sells 1vCPU services for around 4€/month. Even if they don't overprovision and allocate 1 core per 1vCPU, a single chip can potentially earn them 500€/month, which leads to a break even point at around 2 years.
> AVX-512 is still not present in the chips with E-cores.
Why would they put a dig at Intel’s consumer chips in a slide deck for their server parts? The Intel Xeons don’t have E-cores. This doesn’t make any sense unless I’m missing something.
Also, you know AMD has recent consumer chips without AVX-512, right?
And even if they weren't, it would still be a cheap PR win.
>Also, you know AMD has recent consumer chips without AVX-512, right?
All the current generation processors (Zen4) support AVX-512, both mobile and desktop. It may be confusing because AMD's numbering scheme is intentionally misleading and they sell previous gen chips with new model numbers.
There are many kinds of Intel server CPUs with E-cores, with up to 24 E-cores, which are branded Atom, not Xeon, and which are less known, because they are mainly intended to be integrated in various telecommunication systems, e.g. in mobile phone infrastructure.
Intel server CPUs with E-cores that will use the Xeon brand, because they will have much more cores (6-times more, i.e. 144 vs. 24) than the current models, are announced for 2024.
They don't have a lot to brag about there. I'm on an expensive Thinkpad with an AMD Ryzen 7 PRO 6850H, less than a year old... no AVX512. Chip was released Apr 2022, with AVX&AVX2 but no 512. (Also completely garbage unstable and unreliable driver & firmware support from both AMD and Lenovo, but that's another story).
I've been messing with SIMD optimizations recently in a datastructure library I wrote, and as tantalizing as the AVX-512 is, it'll be years before it can be used in production software on a real scale. Introduced in 2017, and still totally unusable in the wild.
> Also completely garbage unstable and unreliable driver & firmware support from both AMD and Lenovo, but that's another story
as a counterpoint, i'm on a P14s Gen 2 Thinkpad (Ryzen 7 PRO 5850u) and have zero issues on EndeavourOS (Arch, KDE/Plasma). most of my time is in VSCode, SublimeMerge, Chrome, MPV.
i did have to rip out the realtek (or mediatek?) M.2 wifi card and swap in an intel one tho.
Lenovo straight up broke the USB-C ports support for external displays (and other things) in the latest BIOS update for my machine -- for both Windows and Linux -- and there's no way to roll back and there's not a peep from them on any kind of attempt to fix or even acknowledgement of a problem.
Requires a hardware ("pinhole" not power cycle) reset every time I want to switch between external monitor use and laptop only. My bug report/forum post here: https://forums.lenovo.com/t5/ThinkPad-Z-series-Laptops/Exter... -- other people are effected.
Last year they pushed out a BIOS update which caused the fan to run 100% of the time. Then a "fix" which caused it to overheat. And in-between there somewhere something was pushed that caused me to have to reformat the whole thing.
Lid-open-to-awake in Linux used to simply not work 75% of the time, required a hard reset. Now it works, but there's a 5-10 second pause before display turns on (Windows or Linux) vs almost instant on the work issued laptops I've had recently (MBP M1 and some HP Z series thing, running Linux)
Waste of cash. Bought this thing to be my dev workstation when I took a contract job last summer, wish I'd never done that. Software quality / support at Lenovo is a real problem.
My Think Pad Tablet has had issues after a bad firmware partial update. It won't stay on past 30 min. It is really souring my former respect for them. I understand it is part TPM issues, but now I have basically a semi-brick. I was able to move the whole os to a VM. I'll try linux and see if it was a Win-Bios interaction bug. And if that doesn't fix it I'll wait six more months and see if it gets a working update. By my former respect for Lenovo is now gone.
If you're using an llvm-compiled language targeting a new avx512-enabled architecture, you're already using avx512 instructions to vastly accelerate your programs via autovectorization.
Hardly. Autovectorization is rarely performed as a percentage of all the cases where it’s theoretically possible, not reliably done (no guarantees - even if it AVs today, the same code may not AV tomorrow), and not something you want your actual hot path to rely on. AV is nice for “incidental” speedups to general code. No serious library touting SIMD acceleration will consider AV sufficient for the core hot loop.
> Autovectorization is rarely performed as a percentage of all the cases where it’s theoretically possible,
This is not my experience. Do you have some justification for this statement? I've found myself far more impressed by autovectorization than disappointed by it. I've found that code most people think is autovectorizable actually violates scalar contracts. But if you write clear code whose scalar implementation won't introduce UB if autovectorized, the compiler is really good.
Here's my go-to example. Agner Fog's VCL is a well-respected library. It has vectorized versions of all sorts of useful mathematitical functions. For fun, I rewrote his `exp(x)` function using his exact algorithm, but in scalar code, and it autovectorized, and benchmarks the same.
Nobody distributes binaries compiled for AVX-512 for obvious reasons. The only times you'll use it are for source-based distros like Gentoo, and programs that do run time feature detection, but they are relatively rare and won't be using autovectorisation.
Runtime feature detection need not be rare nor hard, it's a few dozen lines of boilerplate. You can even write your code just once: see https://github.com/google/highway#examples.
That's still hard, given the tiny benefit it would give to 99% of code.
Maybe modern tooling (e.g. Cargo) will lower the barrier so that it becomes less rare, but in C++ it's definitely not worth the effort for the vast vast vast majority of projects.
Wow, I find "vast vast vast" hard to believe.
If we don't have loops where a lot of CPU time is spent, why even write in C++?
I admittedly focus more on libraries, but quite a large number of them are already vectorized. I would venture that a sizable fraction of CPU time, even in 'normal' non-HPC context, uses SIMD indirectly. Think image/video decompression, browser handshakes/rendering, image editing, etc.
> If we don't have loops where a lot of CPU time is spent, why even write in C++?
Because it isn't just hot or autovectorisable loops that are faster in C++; everything is faster. Function calls, member accesses, arithmetic, etc. Even loops are normally not very hot and not autovectorisable.
You're right that things like audio/image processing, compression etc. benefits from SIMD but that is in the 1%. Those are libraries that have already been written. The vast vast majority of people are not writing audio codecs or whatever.
Why not distribute a binary that uses AVX-512 if present, and falls back to AVX2 or SSE4 if not? It can be quite easy, see the example earlier in this thread.
Yes, well, in my case I don't have a machine to even test AVX512 on, which sorta speaks to the problems inherent with AVX512, doesn't it -- buy a high-end laptop in 2023 and still don't have AVX512 on it. Which is what led to my original comment here.
If I was getting paid $$ for this work, sure, I'd rent cloud instances or hardware to do that development. But it presents a dilemma for open source work.
Anyways, it's all griping. We'll either eventually all get AVX512, or it will die and some other more common wide vector extension will take its place, or we'll all be having the same gripe about NEON or RISC-V V extensions 10 years from now.
It's just frustrating some of the nice toys that are in AVX512, in particular nice support for bitmasking that would make the code I'm writing much nicer & faster.
Godbolt will run simple programs for you. And you can choose whatever compiler you want to use very easily. Just make sure you enable 'Output... > Run the compiled output'. I think there's some randomness about what cloud instance your code gets run on, because sometimes it fails due to not having all the necessary instructions.
> Anyways, it's all griping. We'll either eventually all get AVX512, or it will die and some other more common wide vector extension will take its place, or we'll all be having the same gripe about NEON or RISC-V V extensions 10 years from now.
I think we'll get a unified instruction set. Part of AVX512's difficulty is that it's not just AVX512 or not. It's AVX512F and/or AVX512BW and/or AVX512VL ...x10
I understand it's much more convenient to develop on the machine that has it all.
Further to the Godbolt suggestion, one band-aid is that Highway fairly efficiently emulates some of the fancier AVX-512 instructions such as CompressStore. You could then develop on AVX2, then rent 1 VCPU-hour to build and verify it indeed works on AVX-512.
As for the subsets of AVX-512, we defined them into groups matching Skylake and Icelake; that works pretty well. Zen4 would also support the Icelake features, but it gets its own target so that we can special-case/avoid the microcoded and super-slow CompressStore there.
What do you think the future of C++ is with vectorization? I keep thinking that the language is too far down the rabbit hole of scalar guarantees to ever allow for us to only rely on autovectorization. Maybe we can add some language features like "false path is defined behavior if statements" and bit-sized bools. But then we'd need people to totally change the way they code. Maybe we need a new, non-scalar successor language? Maybe libraries can take care of it?
I've been using Highway for a couple weeks (I'm the guy who is writing the unroller feature). Highway is more limited in its breadth than raw intrinsics. I've noticed a few instances so far where you had to make a judgement call, and decided not to have Highway expose certain features (like 32 bit indexing into 64 bit type scatter/gather). And with more things that x86/ARM throw at us, the harder I think it becomes for a library to be the solution. From what I've seen of std::simd, I don't see how that possibly can be the solution. What do you think?
Oh, hi :) I like your unroller idea, it can simplify user code and is quite general.
Agree about autovectorization. It is not even a true programming model, because we have only limited ability to influence results.
Also agree std::simd is far too limited (something like 50 ops, mostly the straightforward ones, vs >200 for Highway), and difficult to change/extend within the ISO process.
It is very difficult to get widespread traction with a new language, even given LLVM. Mojo, Carbon, Zig are also already potentially helpful.
I do believe a library approach (and in particular Highway, because considerable effort is required to maintain support for so many targets/compiler versions and AFAICS nothing else properly supports RISC-V and SVE) is the way to go for the next 5 years. Major compiler update cycles are something like 1.5-2 years and I don't think there will be a fundamental shift anytime soon towards RL, for example. After those 5 years, the future remains to be written :)
As to missing features: we are happy to add ops whenever there is sizable benefit for some app, and it doesn't hurt other targets. For mixed-type gather, x86 is the only platform that does this, so encouraging its use would pessimize other platforms. And I think apps can easily promote/demote their indices to match the data size. But always happy to discuss via Github issues :)
Highway looks nice. I'm in Rust these days, and SIMD support there is remarkably terrible in comparison. And I'm not nearly close enough to being an expert on the subject to contribute anything upstream to improve that. AVX512 is available only if you compile against nightly still.
:) Thanks for the pointer. Glad to see simdeez now has a maintainer again. I believe its (mostly) 'abstracting SIMD width' feature is important and helps nudge people in a good direction.
I played with it last night and it does not seem to support many integer intrinsics (e.g. a wrapping of of _mm_set1_epi8 and _mm_cmpeq_epi8 seems not to be present), so isn't useful for my purpose. :-(
He said "Introduced in 2017, and still totally unusable in the wild." I understand his frustrations with his own hardware, but he's using lower end consumer hardware. We could quibble about what it means to be "in the wild," but I operate in a compute heavy space where companies buy good hardware, and they all make effective use of AVX512 instruction.
The amount of scientific work that could be done with such a machine would be staggering, specially if it can be local. I use 2-10k cores on a daily basis, having a dedicated machine with 1k would make debugging large scale problems much faster.
Large-scale simulations of phisical systems tend to have a variety of length/time scales that require vast numbers of CPUs to model in a timely fashion. This could be anywhere from a few hundred cpus to tens of thousands. Fluid dynamics, solid mechanics, fluid-structure interactions, electrohydrodynamics, etc. simulation tools can all benefit from having a larger number of highly interconnected CPUs, we can simply resolve engineering problems better with more computational resources for better understanding of problems.
It is not a single language. c, C++, Fortran are the most commonly used. The communication happens using one of many versions of the message passing interface [0] (MPI). MPI essentially spawns child processes and a communication ring between nodes using whichever interconnect is available (usually infoniband).
^ roughly how much a clothes dryer uses, for comparison.
A lot of power, but nothing really abnormal. Most rooms in a house won't be wired for it but that's about all. You probably wouldn't want this much thermal output in a normal room anyway, not to mention fan noise.
The OP was confusingly suggesting a 2U box with four distinct computers inside (4 node), each with two sockets. These types of systems do exist, and can help with volume density, but then you have to work a lot harder with regards to power and heat density.
I don't know what things look like now, but I recall hearing many stories a decade ago about datacenters running out of power and cooling when the volume was closer to 1/3rd full.
I think we will be seeing this more and more (less cache on the die OR stacked cache on top of the main chip instead of in it) since SRAM scaling is now near a halting point [1] (no more improvements) which means the fixed code of cache is going up with every new node.
From what I've read, the L1/L2 cache per core is the same, and the L3 cache per chiplet is the same, but the cores per chiplet is doubled and the overall chiplet size is about the same (it's a little bigger, I think)
So, L3 cache didn't get smaller (in area or bytes), there's just less for each core. L1/L2 is relatively small, but they did use techniqued to make it smaller at the expense of performance.
I think the big difference really is the reductions in buffers, etc, needed to get a design that scales to the moon. This is likely a major factor for Apple's M series too. Apple computers are never getting the thermal design needed to clock to 5Ghz, so setting the design target much lower means a smaller core, better power efficiency, lower heat, etc. The same thing applies here: you're not running your dense servers at 5Ghz, there's just not enough capability to deliver power and remove heat; so a design with a more realistic target speed can be smaller and more efficient.
Semianalysis seems to indicate the core space itself is 35% smaller. Maybe that's process tweaks related to clock rate? But I don't think we know for absolute certain it really is the same core with the same execution units & same everything. Even though a ton of the listed stats are exactly the same.
If you want to see something really amazing, look at the CPU usage to display one high frame rate animated gif posted in a slack channel on a modern quad core 11/12/13th gen Intel CPU laptop. Chat clients were not meant to take 70% of available CPU. You can literally hear the fans speed up.
Teams is bad, yeah, but does not have an entire monopoly on being terribly inefficient.
No, but on the other hand, I do have some glimmer of hope that eventually someone at Microsoft will see these kinds of comments and do something to fix that abomination. I have no doubt Microsoft employs many very talented people, but none of them seem to be working on the user facing products which make my corporate laptop run so poorly (Teams or OneDrive).
They've transitioned the consumer client from Electron to Edge WebView, Angular to React, and the same for the enterprise application appears to be in public preview and slated for general availability later this year. Still web tech but it's supposed to address some of the memory and performance problems.
Yes, appreciate the transparency, but it's kind of embarrassing "22 second launch time" was considered acceptable for so long, and "more than twice as fast" now means "almost a 10 second launch time".
Yeah it is something but as an app that usually runs at startup, I'd just be pleased if it can snap a little faster and be less janky. Fingers crossed.
Your kidding right, I mean teams is not great, but compared to the UI rubbish of zoom and slack?
Take for example the whole business of screensharing, if you share a single screen in zoom you get black blocks for all the zoom windows. Yes it makes sense that I don't share the window with participants, but at least let me freaking close it. Similarly moving the controls from bottom to top of the screen reliably confuses early (and even more advanced users).
And don't get me started on quoting text or including math in slack and what is the whole threads section?!
It feels like a quarter of the time screenshare doesn't even work in Teams and it's noticably choppier when it does work.
Code blocks working in Teams is inconsistent, and it seems to add additional spaces when copying.
Quoting & mentions also sucks. The channels, chats, activity screens are always a mess, feel like UX bandaid.
Private channel limits are garbage.
Search and discovery sucks.
There's inconsistent options for the calendar between Outlook & Teams. Don't you dare expect functional compatibility between MS products. Scheduling assistant sucks, sometimes confuses itself when people are busy.
The UX seems to rely on a myriad of nested menus & modals that are XHR backed, so each one lags. Even just typing text feels slow. It's as fluid as molasses.
> and it seems to add additional spaces when copying
It regularly adds _fake spaces_ to code blocks. People paste working code/SQL into their things and all spaces were replaced with mysterious utf8 that are invisible like spaces.
Ok then. Steering away from Redditisms, let's make this into a deep reflection upon the state of modern application software that's consuming hardware resources faster than hardware can improve. We have these incredible CPUs yet our machines offer barely the same experience they did twenty years ago and even succeed in getting strictly worse as in Teams case. The newer the software, the more wasteful implementation we get, our performance budget spilled and absorbed by layers upon layers of abstractions. Of all this immense computing power, I can only feel the difference when compiling stuff and other specialized activities.
Cores won't draw much while blocking on a memory access. If the core count vs core utilization trade-off is a tie, the napkin math expectation would be same throughput, same power draw of cores, less power draw of cache for less cache.
We're going to be using a lot of these in a new hardware sku at my employer sooner or later (it's in the works). We software nerds get to figure out the scaling problems on these big CPUs that don't manifest on the smaller 18 and 24 core CPUs we're using today.
We'll approach it like a network-on-a-chip: 128 servers with an extremely fast network. Each server will be a process. One process will be something like ingress, another like a message queue and the rest will be workers. Chances are the chip doesn't need or want to be saturated, so you rotate the cores using Linux magic of some kind.
Put that on a napkin and give it to an unsuspecting intern.
Certainly the closer you can get to shared-nothing, the better! Some of the difficulty is we have a legacy stack that is not implemented that way. (You also need to account for significant kernel CPU use, like hardware interrupt handlers (e.g., NIC rx) running on some cores.)
> You also need to account for significant kernel CPU use, like hardware interrupt handlers (e.g., NIC rx) running on some cores.
In a lot of loads, it doesn't matter that much, because the application uses way more CPU than packet handling... But if it does matter, you really want to get things lined up as much as possible. Each CPU core gets one NIC queue pinned and one application thread pinned and the connections correctly mapped so the fast path never communicates to other cores. I'm not tuned into current NICs though, I don't know if you can get 128 queues on NICs now. If you have two sockets, then you also have the fun of Non-Uniform PCI-E Access...
When I was working on this (for a tcp mode HAProxy install), the NICs did 16 queues, and I had dual 12 or 14-core CPUs, so socket 0 got all its cores busy, and socket 1 just got a couple work threads and was mostly idle. Single socket, power of 2 cores is a lot better to optimize, but I had existing hardware to reuse.
Geographical names aren't copyrighted, so it's a safe choice. Before that they used names of various rivers around the world. Intel does something similar, but they make up fictional names based on real places. Nvidia uses names of famous scientists.
Probably selected a concept that there were a bunch of names from so that the series would be consistent but also which would not result in any trademark issues. Same reason Intel picks random words and "lake" after
I wonder if Zen 4c is going to trickle down to the consumer market. There are multiple possibilities for CPUs with mixed chiplets but who knows if they really make sense outside the server world. It's exciting
I made a throw as I can't remember my login and wanted to highlight the semi article due to the amazing detail there.
There is a lot more detail in the article on the possible 4c/5c uses, it's been speculated in a lot of contexts for the hybrid configurations (even by AMD directly) so don't think it's surprising to mention this at all.
Wow, do CPUs have that capability now? To actually destroy other CPUs? How does that work? Do they get the "Annihilate.Competition" instructions set in an update when they detect a Intel machine in the network?/s