Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

66 threads per core looks more like a barrel processor than anything else. We shouldn’t expect those threads to be very fast, but we can assume that, if the processor has enough work, it should be able to be doing something useful most of the time (rather than waiting for memory).


I don’t know if Intel has the appetite to attempt another barrel processor.

The primary weakness of barrel processors is human; only a handful of people grok how to design codes that really exploit their potential. They look deceptively familiar at a code level, because normal code will run okay, but won’t perform well unless you do things that look extremely odd to someone that has only written code for CPUs. It is a weird type of architecture to design data structures and algorithms for and there isn’t a lot of literature on algorithm design for barrel processors.

I love barrel processors and have designed codes for a few different such architectures, starting with the old Tera systems, and became quite good at it. In the hands of someone that knows what they are doing I believe they can be more computationally efficient than just about any other architecture given a similar silicon budget for general purpose computing. However, the reality is that writing efficient code for a barrel processor requires carrying a much more complex model in your head than the equivalent code on a CPU; the economics favors architectures like CPUs where an average engineer can deliver adequate efficiency. At this point, I’ve given up on the idea that I’ll ever see a mainstream barrel processor, despite their strengths from a pure computational efficiency standpoint.


With the advent of GPU coding and the number of people exposed to it there is a chance that enough people are willing and able to try unfamiliar architectures if it gives them an advantage that this just might be viable now.


What kind of advantage barrel processors give a regular programmer (especially things that can't be done efficiently with CPU/GPU)?


Much closer communication lines between the two, thread level guarantees that would be hard to match otherwise. A barrel processor will 'hit' the threads far more frequently on tasks that would be hard to adapt to a GPU (so for instance, more branching).

Though CPU/GPU combinations are slowly moving in that direction anyway. Incidentally, if this sort of thing really interests you: when I was playing around with machine learning my trick to see how efficiently my code was using the GPU was really simple: first I ran a graphics benchmark that maxed out the GPU and measured power consumption. Then I did the same for just the CPU. Afterwards while running my own code I'd compare the ratio between my own code's power draw with the maximum obtained during the benchmarks, this gave a pretty good indication of whether or not I had made some large mistake and showed a nice and steady increase with every optimization step.


Could you not also see an impact to run times? If algorithm A draws 100W and runs in 10 seconds, and algo B draws 200W… shouldn’t it be very much sub 10 seconds to warrant being called a better algorithm?


Not really, it's an apples-to-oranges comparison. If you ran the same distributed algorithm on a single core you wouldn't see the same speed improvements. These chips are ridiculously efficient because they need far fewer gates to accomplish the same purpose if you have programmed to them specifically. Just like you couldn't simulate a GPU for the same power budget on a CPU. The loss of generality is more than made up for by the increase in efficiency. This is very similar in that respect, but with the caveat that a GPU is even more specialized and so even more efficient.

Imagine what kind of performance you could get out of hardware that is task specific. That's why for instance crypto mining went through a very rapid set of iterations: CPU->GPU->ASIC in a matter of a few years with an extremely brief blip of programmable hardware somewhere in there as well (FPGA based miners, approximately 2013).

Any loss of generality will result in efficiency and vice versa, the question is whether or not it is economically feasible and there are different points on that line that have resulted in marketable (and profitable) products. But there are also plenty of wrecks.


1. Thread Switching: • Hardware Mechanisms: Specific circuitry used for rapid thread context switching. • Context Preservation: How the state of a thread is saved/restored during switches. • Overhead: The time and resources required for a thread switch. 2. Memory Latency and Hiding: • Prefetching Strategies: Techniques to load data before it’s actually needed. • Cache Optimization: Adjusting cache behavior for optimal thread performance. • Memory Access Patterns: Sequences that reduce wait times and contention. 3. Parallelism: • Dependency Analysis: Identifying dependencies that might hinder parallel execution. • Fine vs. Coarse Parallelism: Granularity of tasks that can be parallelized. • Task Partitioning: Dividing tasks effectively among available threads. 4. Multithreading Models: • Static vs. Dynamic Multithreading: Predetermined vs. on-the-fly thread allocation. • Simultaneous Multithreading (SMT): Running multiple threads on one core. • Hardware Threads vs. Software Threads: Distinguishing threads at the CPU vs. OS level. 5. Data Structures: • Lock-free Data Structures: Structures designed for concurrent access without locks. • Thread-local Storage: Memory that’s specific to a thread. • Efficient Queue Designs: Optimizing data structures like queues for thread communication. 6. Synchronization Mechanisms: • Barrier Synchronization: Ensuring threads reach a point before proceeding. • Atomic Operations: Operations that complete without interruption. • Locks: Techniques to avoid common pitfalls like deadlocks and contention. 7. Stall Causes and Mitigation: • Instruction Dependencies: Managing data and control dependencies. • IO-bound vs. CPU-bound: Balancing IO and CPU workloads. • Handling Page Faults: Strategies for minimal disruption during memory page issues. 8. Instruction Pipelining: • Pipeline Hazards: Situations that can disrupt the smooth flow in a pipeline. • Out-of-Order Execution: Executing instructions out of their original order for efficiency. • Branch Prediction: Guessing the outcome of conditional operations to prepare.

There are some of the complications related to barrel processing


people aren't really doing GPU coding; they're calling CUDA libraries. As with other Intel projects this needs dev support to survive and Intel has not been good at providing it. Consider the fate of the Xeon Phi and Omnipath.


Huh that is a solid point, how big a fraction would you estimate to be actually capable of doing real GPU coding?


I don't think I have a broad enough perspective of the industry to assess that. In my corner of the world (scientific computing) it's probably three quarters simple use of accelerators and one quarter actual architecture based on the GPU itself.


There's two single threaded cores, but the four multithreaded cores are indeed Tera-style barrel processors. The Tera paper is even cited directly in the PIUMA white paper: https://arxiv.org/abs/2010.06277

edit: s/eight/four


Unfortunate name. Anyone familiar with the ways PUMA was misapplied in the past might reflexively read that as "Peeeuuw!ma".


What are the kinds of challenges to write things efficiently?

From skimming Wikipedia, it looks like a big challenge is cache pollution. Is it possible that the hit to cache locality is what inhibits uptake? After all, most threads in the OS are sitting idle doing nothing, which means you’re penalized for any “hot code” that’s largely serial (ie typically you have a small number of “hot” applications unless your problem is embarrassingly parallel)


> What are the kinds of challenges to write things efficiently?

The challenge is mostly that you have to create enough fine-grained parallelism and that per thread performance is relatively low. Amdahl's law is in full effect here, a sequential part is going to bite you hard. That's why each die on this chip has two sequential-performance cores.

The graph problems this processor is designed to handle have plenty of parallelism and most of the time those threads will be waiting for (uncached) 8B DRAM accesses.

> From skimming Wikipedia, it looks like a big challenge is cache pollution

This processor has tiny caches and the programmer decides which accesses are cached. In practice, you cache the thread's stack and do all the large graph accesses un-cached, letting the barrel processor hide the latency. There are very fast scratchpads on this thing for when you do need to exploit locality.


Is it then similar to how utilizing a GPU fully is much harder than utilizing a CPU fully?


In some aspects, yes. Arguably, it's much harder to max GPUs because they have the added difficulty of scheduling and executing by blocks of threads. If not all N threads in a block have their input data ready or are branching differently, you are idling some execution units.


It is not hard at all to fully utilize a GPU with a problem that maps well onto that type of architecture. It's impossible to fully utilize a GPU with a problem that does not.

A barrel processor would make branching more efficient than on a GPU, at the cost of throughput. The set of problems that are economically interesting and that would strongly profit from that is rather small, hence these processors remain niche. Conversely, the incentive to have problems that map well to a GPU is higher, because they are cheap and ubiquitous.


> The primary weakness of barrel processors is human; only a handful of people grok how to design codes that really exploit their potential. They look deceptively familiar at a code level, because normal code will run okay, but won’t perform well unless you do things that look extremely odd to someone that has only written code for CPUs.

Nowadays, apps that run straight on baremetal servers are the exception instead of the norm. Some cloud applications favour tech stacks based on high level languages designed to run single-threaded processes on interpreters that abstract all basic data structures, and their bottleneck is IO throughput instead of CPU.

Even if this processor is not ideal for all applications, it might just be the ideal tool for cloud applications that need to handle tons of connections and stay idling while waiting for IO operations to go through.


The new Mojo programming language is attempting to deal with such issues by raising the level of abstraction. The compiler and runtime environment are supposed to automatically optimize for major changes in hardware architecture. In practice through I don't know how well that could work for barrel processors.

https://www.modular.com/mojo


But how does it compare to GPUs for compute tasks (e.g. CUDA)? Both performance and difficulty wise?


Using OpenMP is often enough "to really exploit their potential" "to deliver adequate efficiency."

Which is what "average engineer" would use.


If Intel was very bullish on AI, they might be tempted to design arbitrarily complicated architectures and just trust that GPT-6 will be able to write code for it.

This was a bet that completely failed for Itanium, but maybe this time...


Is GPT-6 the proverbial magic wand that makes all your dreams come true? Because you just hand waved away all the complexity just by name dropping it. Let's get back to Earth for one second...


I think the hand-waving was the point. It was a dig at Intel, accusing them of not having a solid plan.


How could GPT-(arbitrarily large number) learn to write code for an architecture for which there is no training data?


By grokking multi threaded programming of course.


From reading the documentation in the context.


There's an obscene amount of chess literature, yet ChatGPT is mediocre and makes elementary mistakes (invalid moves).

This would work only if ChatGPT 6 is AGI, but then it doesn't make much sense to label it as ChatGPT.


If an AGI is a GPT and you can chat with it why wouldn't you call it ChatGPT?


Yeah, but that's a big IF.

If the actual requirement for this is AGI, then it's clearer to just say AGI instead of ChatGPT which might or might not ever become AGI.


Well... This is a research project, and they aren't betting the farm on it. It's not even x86-compatible.

And then, it'd require a lot of software rewriting, because we are not used to write for hundreds of threads for context switching is a very expensive operation on modern CPUs. On this CPU a context switch is fast and happens on any operation that makes the CPU wait for memory access, therefore, thinking in terms of hundreds of threads pays off. But, again, this will need some clever OS designs to hide everything under an API existing programs can recognize.

It may even be that nothing comes out of it except another lesson on how not to build a computer.


Why do we need x86 compatibility? x86 chips are already fast.

Existing programs don't need to see these things at all, they can just act like microservices, and we could load them with an image and call them as black boxes.

I would think a lot more work would be done by various accelerator cards by now.


They probably aren't insanely bullish. GPT is a revolution but it's a revolution because it can do "easy" general tasks for the first time, which makes it the first actually useful general purpose AI.

But it still can't really design things. It's probably about as far away from being able to design a complex CPU architecture as GPT-4 is from Eliza.


Nvidia released a toolkit to assist chip design a few months ago:

https://techmonitor.ai/technology/ai-and-automation/nvidia-a...

Google has been working on AI's to optimize code:

https://blog.research.google/2022/07/mlgo-machine-learning-f...

> But it still can't really design things.

What do you really mean by "really" in that sentence?

In any case, the claim you responeded to was not that the chips would be designed from the ground up by AI, only that AI will enable us to run code on top of chips that are even more complex than current chips.


The Nvidia toolkit you linked is for place & route, which is a much easier task than designing the actual architecture and microarchitecture. It's similar to laying out a PCB.

The Google research you linked is using AI to make better decisions about when to apply optimisations. For example when do you inline a function? Which register do you spill? Again this is a very simple (conceptually) low level task. Nothing like writing a compiler for example.

> What do you really mean by "really" in that sentence?

I mean successfully create complex new designs from scratch.


I really wouldn't say it's an easier task at all. It's so intense it's essentially impossible for a human to do from scratch. But it is "just" an optimization problem which is different from designing an architecture.


I doubt Intel leadership is that naive.


And a bet they didn't even know they made for iAPX 432. The ADA compiler was awful optimization-wise, but that came out in an external research paper too late to save it.


So the mill is going to be released then, right? Right?

Not too psyched to use some proprietary black box compilerthough.


They have been so careful not to leak anything that there is very little interest on it. I completely forgot about them and, if they never launch, I doubt anyone would notice.


Why not just have GPT-6 design the chip to start with?


Would you call that "general purpose" still? If the code needs to be so special to leverage the benefits of that design?


In a way, yes - it's no different from scalar CPUs from back when they didn't need to wait for memory. GPUs are specifically designed to operate on vectors.


64 of the 66 treads are slow threads where each group of 16 threads shares one set of execution units and all 64 threads share a scratchpad memory and the caches.

This part of each core is very similar to the existing GPUs.

What is different in this experimental Intel CPU and unlike in any previous GPU or CPU, is that each core, besides the GPU-like part, also includes 2 very fast threads, with out-of-order execution and a much higher clock frequency than the slow threads. Each of the 2 fast threads has its own non-shared execution units.

Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.


> Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.

Getting some Cell[1] vibes from that, except in reverse I guess.

[1]: https://en.wikipedia.org/wiki/Cell_(processor)


I'm far from a CPU or architecture expert but the way you describe it this CPU reminds me a bit of the Cell from IBM, Sony, and Toshiba. Though, I don't remember if the SPEs had any sort of shared memory in the Cell.


While there are some similarities with the Sony Cell, the differences are very significant.

The PPE of the Cell was a rather weak CPU, meant for control functions, not for computational tasks.

Here the 2 fast threads are clearly meant to execute all the tasks that cannot be parallelized, so they are very fast, according to Intel they are eight time faster than the slow threads, so the 2 fast threads concentrate 20% of the processing capability of a core, with only 80% provided by the other 64 threads.

It can be assumed that the power consumption of the 2 fast threads is much higher than that of the slow threads. It is likely that the 2 fast threads consume alone about the same power as all the other 64 threads, so they will be used at full-speed only for non-parallelizable tasks.

The second big difference was that in the Cell the communication between the PPE and the many SPEs was awkward, while here it is trivial, as all the threads of a core share the cache memories and the scratchpad memory.


The SPEs only had individual scratchpad memory that was divorced the traditional memory hierarchy. You needed to explicitly transfer memory in and out.


So this is a processor where you would have 97% of the threads doing some I/O like task? But that can't be disk I/O, so that would leave networking?


DRAM is the new I/O. So yes, this is designed to handle 97% of the threads doing constant bad-locality DRAM accesses.


And with this, DRAM access becomes the new asynchronous IO.


Neat insight

The protocols for HPC are so amorphous that they bubbled up into the lowest common denominator, completely software defined async global workspace


I think generally the threads are spending a lot of time waiting on memory. It can take >100 cycles to get something from ram so you could have all your threads try to read a pointer and still have computation to spare until the first read comes back from memory.

It could be that eg 97% of your threads are looking things up in big hashtables (eg computing a big join for a database query) or binary-searching big arrays, rather than ‘some I/O task’


Let's say that I can get something from RAM in 100 cycles. But if I have 60 threads all trying to do something with RAM, I can't do 60 RAM accesses in that 100 cycles, can I? Somebody's going to have to wait, aren't they?


this would work really well with rambus style async memory if it every got out from under the giant pile of patents

the 'plus' side here is that that condition gets handled gracefully, but yes, certainly you can end up in a situation where memory transactions per second is the bottleneck.

its likely more advtangeous to have a lot of memory controllers and ddr interfaces here than a lot of banks on the same bus. but that's a real cost and pin issue.

the mta 'solved' this by fully dissociating the memory from the cpu with a fabric

maybe you could do the same with cxl today


I’m not exactly sure what you mean. RAM allows multiple reads to be in flight at once but I guess won’t be clocked as fast as the cpu. So you’ll have to do some computation in some threads instead of reads. Peak performance will have a mix of some threads waiting on ram and others doing actual work.


This processor is out of my league, but do you have any idea how a program would use that optimally? How do you code for that?


> But that can't be disk I/O, so that would leave networking?

Networking is a huge part of cloud applications, and network connections take orders of magnitude longer to go through than disk access.

There are components of any cloud architecture which are dedicated exclusively to handling networking. Reverse proxies, ingress controllers, API gateways, message broker handlers, etc etc etc. Even function-as-a-service tasks heavily favour listening and reacting to network calls.

I dare say that pure horsepower servers are no longer driving demand for servers. The ability to shove as many processes and threads on a single CPU is by far the thing that cloud providers and on prem companies seek.


Reminds me of Tera, the original SMT. 128 threads per core in 1990!

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

(I’d put up money that the DARPA project that funded this work is from the same lineage of TLA interest that got Tera enough money to buy Cray!)


Following the history here, DARPA also funded work in ~2005 on a project Monarch, that I saw presented at a Google tech talk back then. I believe it refers to the "butterfly" architectures of very scalable interconnects of lightweight processing units

Bill Dally was working on the networking side (maybe equiv to the photonics interconnect here) and almost got it up and running at BBN (search for "monarch") http://franki66.free.fr/Principles%20and%20Practices%20of%20...

Here's some refs for the chip https://slideplayer.com/slide/7739558/ https://viterbischool.usc.edu/news/2007/03/monarch-system-on...

6 main RISC processors with 96 ALUs for the lightweight IO/compute processes


Their multithreaded cores are similar design, yes. It does not do XMT's in-hardware full/empty bit memory access system though.


Feels a bit like async on the programming language side. Just replace "waiting for memory" with "waiting for IO" and you're almost there.


> Just replace "waiting for memory" with "waiting for IO" and you're almost there.

You can see the latter in your code, you can’t—on a modern superscalar—see the former. Is the proposed architecture any different?


Haven't seen the ISA, but it's not insane to imagine one where explicit data cache manipulation instructions are a requirement (that a memory access outside a cache would fault). I think it's even helpful to make those operations explicit when you are writing high-performance code. On a processor like this, any operation to load a cache that's not already loaded should trigger a switch to the next runnable thread (which also become entities exposed to the ISA).

Also, I'm not even sure it'd be too painful to program (in assembly, at least). It'd be perhaps inconvenient to derive those ops from C code, but Rust, with its explicit ownership, may have a better hand here.


I don’t think it necessarily implies a barrel processor. More like a higher count for SMT which could be due to the higher CPU performance relative to the CPU <> memory bandwidth. While the slow fetches occur, the system could execute more instructions for other threads in parallel.


How useful would it be as a GPU?


Very badly, this is the polar opposite design of a GPU.

It does share the latency-hiding-by-parallelism design, but GPUs do that scheduling on a pretty coarse granularity (viz. warp). The barrel processors on this thing round-robin through each instruction.

GPUs are designed for dense compute: lots of predictable data accesses and control flow, high arithmetic intensity FLOPS.

In contrast, this is designed for lots of data-dependent unpredictable accesses at the 4-8B granularity with little to no FLOPS.


These particular chips. They seem more targeted at HPC work (and price point).

This sort of architecture. I wouldn't be surprised if current GPU were doing something similar.

If you think about executing a shader program. You typically are running that same code over a bunch of data. You can map that to multiple threads.

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-...


To me it looks like the opposite, the processor being very fast, and exporting itself as 66 threads to not spend virtually all its time waiting for external circuits.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: