You do understand that current hardware exists to support C, right?

kps · on Oct 17, 2020

What aspect of currently popular CPU instruction sets ‘exists to support C’?

Athas · on Oct 17, 2020

Strong sequential consistency is a big one. Most architectures that have tried to diverge from this for performance reasons run into trouble with the way people like to write C code (but will not have trouble with languages actually built for concurrency).

Arguably the scalar focus of CPUs is also to make them more suited for C-like languages. Now, attempts to do radically different things (like Itanium) failed for various reasons, in Itanium's case at least partially because it was hard to write compilers good enough to exploit its VLIW design. It's up in the air whether a different high-level language would have made those compilers feasible.

It's not like current CPUs are completely crippled by having to mostly run C programs, and that we'd have 10x as many FLOPS if only most software was in Haskell, but there are certainly trade-offs that have been made.

It is interesting to look at DSPs and GPU architectures, for examples of performance-oriented machines that have not been constrained by mostly running legacy C code. My own experience is mostly with GPUs, and I wouldn't say the PTX-level CUDA architecture is too different from C. It's a scalar-oriented programming model, carefully designed so it can be transparently vectorised. This approach won over AMDs old explicitly VLIW-oriented architecture, and most GPU vendors are now also using the NVIDIA-style design (I think NVIDIA calls it SPMT). From a programming experience POV, the main difference between CUDA programming and C programming (apart from the massive parallelism) is manual control over the memory hierarchy instead of a deep cache hierarchy, and a really weak memory model.

Oh, and of course, when we say "CPUs are built for C", we really mean the huge family of shared-state imperative scalar languages that C belongs to. I don't think C has any really unique limitations or features that have to be catered to.

OldHand2018 · on Oct 17, 2020

> Now, attempts to do radically different things (like Itanium) failed for various reasons, in Itanium's case at least partially because it was hard to write compilers good enough to exploit its VLIW design. It's up in the air whether a different high-level language would have made those compilers feasible.

My day job involves supporting systems on Itanium: the Intel C compiler on Itanium is actually pretty good... now. We'd all have a different opinion of Itanium if it had been released with something half as good as what we've got now.

I'm sure you can have a compiler for any language that really makes VLIW shine. But it would take a lot of work, and you'd have to do that work early. Really early. Honestly, if any chip maker decided to do a clean-sheet VLIW processor and did compiler work side-by-side while they were designing it, I'd bet it would perform really well.

klelatti · on Oct 17, 2020

Thank you for an interesting comment - seems to imply that Intel have markedly improved the Itanium compiler since they discontinued Itanium which is interesting!

I guess any new architecture needs to be substantially better than existing out of order, superscalar implementations to justify any change and we are still seeing more transistors being thrown at existing architectures each year and generating some performance gains.

I wonder if / when this stops then we will see a revisiting of the VLIW approach.

innocenat · on Oct 17, 2020

I doubt it. Another major disadvantage of VLIW is instruction density. If compiler cannot fill all instruction slots, you are losing the density (thus wasting cache, bandwidth, etc).

hajile · on Oct 19, 2020

Didn't later Itanium CPU microarchitectures internally shift to a much more classic design so the could work around the compiler issues?

OldHand2018 · on Oct 20, 2020

I've never heard of that. If true, that might be a large hole in my theory :)

hajile · on Oct 20, 2020

To quote David Kanter at Realworldtech

> Poulson is fundamentally different and much more akin to traditional RISC or CISC microprocessors. Instructions, rather than explicitly parallel bundles, are dynamically scheduled and executed. Dependencies are resolved by flushing bad results and replaying instructions; no more global stalls. There is even a minimal degree of out-of-order execution – a profound repudiation of some of the underlying assumptions behind Itanium.

https://www.realworldtech.com/poulson/

Given the large amount of security problems OoO has caused, there is a chance that we may revisit the experiment in the future with a less rigid attitude and greater success.

OldHand2018 · on Oct 27, 2020

Thanks, that’s a really great article. It does change my views significantly.

alquemist · on Oct 17, 2020

> in Itanium's case at least partially because it was hard to write compilers good enough to exploit its VLIW design

This is half true. The other half is that OOO execution does all the pipelining a "good enough" compiler would do, except that dynamically at runtime, benefiting from just in time profiling information. Way back in the day OOO was considered too expensive, nowadays everybody uses it.

throwaway_pdp09 · on Oct 18, 2020

AIUI it's not pipelining but executing out of order is where the big win comes from, it allows some hiding of eg. memory fetch latency. Since data may or may not be in cache, it's apparently impossible for the compiler to know this so it has to be done dynamically (but I disclaim being any kind of expert in this).

alquemist · on Oct 18, 2020

That is very true, thanks for clarifying. It is one of the main reasons it is hard to build 'good enough' VLIW compiler, though I haven't paid attention to the field in >10 years. OOO = 'out of order' :)

fulafel · on Oct 21, 2020

OOO parallelism runs hard into limits too (see the heroic measures current ones go to to increase IPC a teensy bit). Parallelism friendly languages have the potential to break through this barrier given parallelism friendly processors. (And they do in GPU land...)

StillBored · on Oct 17, 2020

A shocking amount, but in a many cases its also what doesn't exist, or isn't optimized. C is designed to be a lowest common denominator language.

So, the whole flat memory model, large register machines, single stack registers. When you look at all the things people think are "crufty" about x86, its usually through the lenses of "modern" computing. Things like BCD, fixed point, capabilities, segmentation, call gates, all the odd 68000 addressing modes, etc. Many of those things that were well supported in other environments but ended up hindering or being unused by C compilers.

On the other side you have things like inc/dec two instructions which influence the idea of the unary ++ and -- rather than the longer more generic forms. So while the latency of inc is possibly the same as add, it still has a single byte encoding.

chubot · on Oct 18, 2020

Address generation units basically do C array indexing and pointer arithmetic, e.g. a[i], p + i, where a is a pointer of a particular size.

https://en.wikipedia.org/wiki/Address_generation_unit

In C, something like a[i] is more or less:

    (char*)(a) + (i * sizeof(*a))

And I think it will do a lot more than that for "free", e.g. a[i+1] or a[2k+1], though I don't know the details.

By having address calculations handled by separate circuitry that operates in parallel with the rest of the CPU, the number of CPU cycles required for executing various machine instructions can be reduced, bringing performance improvements.[2][3]

saagarjha · on Oct 17, 2020

Here's a (doubly-indirected) example: https://news.ycombinator.com/item?id=24813376

kps · on Oct 17, 2020

And what about that has anything to do with C specifically? Every useful programming language requires cause precede effect, and every architecture that allows load-store reordering has memory barrier instructions. Specifically, where would code written in C require the compiler to generate one of these instructions, where code hand-written for the process's native instruction set would not?

saagarjha · on Oct 17, 2020

It matches C's semantics exactly, to the point where ARM chose a specific acquire/release to match the "sequential consistency for data-race-free programs" model without requiring any global barriers or unnecessarily strong guarantees, while still allowing reordering.

(I should note that I believe this is actually C++'s memory model that C is using as well, and perhaps some other languages have adopted it too.)

throwaway_pdp09 · on Oct 17, 2020

Yep. They have a compiler to bring it down to the metal so IDK what you're saying.

--- EDIT ---

@saagarjha, as I'm being slowposted by HN, here's my response via edit:

OK, sure! You need some agreed semantics for that, at the low level. But the hardware guys aren't likely to add actors in the silicon. And they presumably don't intend to support eg. hardware level malloc, nor hardware level general expression evaluation[0], nor hardware level function calling complete with full argument handling, nor fopen, nor much more.

BTW "The metal which largely respects C's semantics?" C semantics were modelled after real machinery, which is why C has variables which can be assigned to, and arrays which follow very closely actual memory layout, and pointers which are for the hardware's address handling. If the C designers could follow theory rather than hardware, well, look at lisp.

[0] IIRC the PDPs had polynomial evaluation in hardware.

saagarjha · on Oct 17, 2020

The metal which largely respects C's semantics? For example, here are some instructions that exist to match C's atomics model: https://developer.arm.com/documentation/den0024/a/Memory-Ord...

repiret · on Oct 18, 2020

I've done work on a proprietary embedded RTOS that has had high level versions of those barriers at least a decade before the C atomics model was standardized (and compiles them to the closest barrier supported by the target architecture).

I suspect that the OS and Architecture communities have known about one-way barriers for a very long time, and they were only recently added to the Arm architecture because people only recently started making Arm CPUs that benefit from them. And that seems like a more likely explanation than them having been plucked from the C standard.

Moreover, one-way barriers are useful regardless of what language you're using.

saagarjha · on Oct 18, 2020

Note that I am specifically pointing to those exact barriers, and not "any old barriers". C's memory orderings don't really lower down to a single instruction on any other platform that I'm aware of because of subtle differences in semantics.

FullyFunctional · on Oct 17, 2020

[0] Close, it was the VAX-11 and is the poster child for CISC madness.