Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

64 of the 66 treads are slow threads where each group of 16 threads shares one set of execution units and all 64 threads share a scratchpad memory and the caches.

This part of each core is very similar to the existing GPUs.

What is different in this experimental Intel CPU and unlike in any previous GPU or CPU, is that each core, besides the GPU-like part, also includes 2 very fast threads, with out-of-order execution and a much higher clock frequency than the slow threads. Each of the 2 fast threads has its own non-shared execution units.

Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.



> Separately, the 2 fast threads and the 64 slow threads are very similar with older CPUs or GPUs, but their combination into a single core with shared scratchpad memory and cache memories is novel.

Getting some Cell[1] vibes from that, except in reverse I guess.

[1]: https://en.wikipedia.org/wiki/Cell_(processor)


I'm far from a CPU or architecture expert but the way you describe it this CPU reminds me a bit of the Cell from IBM, Sony, and Toshiba. Though, I don't remember if the SPEs had any sort of shared memory in the Cell.


While there are some similarities with the Sony Cell, the differences are very significant.

The PPE of the Cell was a rather weak CPU, meant for control functions, not for computational tasks.

Here the 2 fast threads are clearly meant to execute all the tasks that cannot be parallelized, so they are very fast, according to Intel they are eight time faster than the slow threads, so the 2 fast threads concentrate 20% of the processing capability of a core, with only 80% provided by the other 64 threads.

It can be assumed that the power consumption of the 2 fast threads is much higher than that of the slow threads. It is likely that the 2 fast threads consume alone about the same power as all the other 64 threads, so they will be used at full-speed only for non-parallelizable tasks.

The second big difference was that in the Cell the communication between the PPE and the many SPEs was awkward, while here it is trivial, as all the threads of a core share the cache memories and the scratchpad memory.


The SPEs only had individual scratchpad memory that was divorced the traditional memory hierarchy. You needed to explicitly transfer memory in and out.


So this is a processor where you would have 97% of the threads doing some I/O like task? But that can't be disk I/O, so that would leave networking?


DRAM is the new I/O. So yes, this is designed to handle 97% of the threads doing constant bad-locality DRAM accesses.


And with this, DRAM access becomes the new asynchronous IO.


Neat insight

The protocols for HPC are so amorphous that they bubbled up into the lowest common denominator, completely software defined async global workspace


I think generally the threads are spending a lot of time waiting on memory. It can take >100 cycles to get something from ram so you could have all your threads try to read a pointer and still have computation to spare until the first read comes back from memory.

It could be that eg 97% of your threads are looking things up in big hashtables (eg computing a big join for a database query) or binary-searching big arrays, rather than ‘some I/O task’


Let's say that I can get something from RAM in 100 cycles. But if I have 60 threads all trying to do something with RAM, I can't do 60 RAM accesses in that 100 cycles, can I? Somebody's going to have to wait, aren't they?


this would work really well with rambus style async memory if it every got out from under the giant pile of patents

the 'plus' side here is that that condition gets handled gracefully, but yes, certainly you can end up in a situation where memory transactions per second is the bottleneck.

its likely more advtangeous to have a lot of memory controllers and ddr interfaces here than a lot of banks on the same bus. but that's a real cost and pin issue.

the mta 'solved' this by fully dissociating the memory from the cpu with a fabric

maybe you could do the same with cxl today


I’m not exactly sure what you mean. RAM allows multiple reads to be in flight at once but I guess won’t be clocked as fast as the cpu. So you’ll have to do some computation in some threads instead of reads. Peak performance will have a mix of some threads waiting on ram and others doing actual work.


This processor is out of my league, but do you have any idea how a program would use that optimally? How do you code for that?


> But that can't be disk I/O, so that would leave networking?

Networking is a huge part of cloud applications, and network connections take orders of magnitude longer to go through than disk access.

There are components of any cloud architecture which are dedicated exclusively to handling networking. Reverse proxies, ingress controllers, API gateways, message broker handlers, etc etc etc. Even function-as-a-service tasks heavily favour listening and reacting to network calls.

I dare say that pure horsepower servers are no longer driving demand for servers. The ability to shove as many processes and threads on a single CPU is by far the thing that cloud providers and on prem companies seek.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: