Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why has CPU frequency ceased to grow? (2014) (intel.com)
235 points by Osiris on Feb 21, 2018 | hide | past | favorite | 291 comments


That was a bit misleading in some ways. First, in pipelining you'll typically measure how long a pipeline steps in FO4s, which is to say the delay required for one transistor to drive 4 other transistors of the same width. Intel will typically design its pipeline stages to have 16 FO4s of delay. IBM is more aggressive and will try to work it down to 10. But of those 10, 2 are there for the latches you added to create the stage and 2 are there to account for the fact that a clock edge doesn't arrive everywhere at exactly the same time. So if you take one of those 16 FO4 Intel stages and cut it in half you won't have a two 8 FO4 stages but two 10 FO4 stages. And since those latch transistors take up space and energy you're got some severe diminishing returns problems.

One thing that's changed as transistors have gotten smaller is that leakage has gotten to be more of a problem. You used to just worry about active switching power but now you have to balance using higher voltage and lower thresholds to switch your transistors quickly with the leakage power that that will generate.

And finally velocity saturation is more of a problem on shorter channels making current go up more linearly with the gate voltage than quadratically.


Good points. One thing I would like to emphasize is the issue with clocks not arriving everywhere at the same time. Balancing the clock tree over a chip gets harder and harder.

But the clock setup, hold times also gets shorter and shorter when the clock frequency goes up. The clock signal will have jitter. The end result is that less and less time of the clock edge is usable to sample the signal into the register.

And this in turn put a strain on how well balanced the logic between the registers are. To allow all signals traverse the logic paths through the gates and stabilize in time to be sampled.

To add to the complexity, as we move down the geometries, the difference in performance of different transistors becomes relatively larger. One rason for this is that oxide layers consist of (in average) fewer and fewer molecules. When the layer was made up of 100 molecules, 101 or 102 didn't really make much of a difference. But when the average is 4 molecule one more or less will have a huge impact on the performance.

So controlling variance (clock tree balance, jitter in clock generatiom, imbalances between paths and variance in chip production) becomes ever more problematic and important.


How do you know all this stuff?


Got my master's in it, before ending up in sensors then robotics instead. And a continued interest, I guess.

Here are some free relevant courses. You might have to go back and take the pre-reqs.

https://ocw.mit.edu/courses/electrical-engineering-and-compu...

https://ocw.mit.edu/courses/electrical-engineering-and-compu...


How does one get into robotics? I have not looked much but none of my local schools seem to have "robotics".

I tinker with electronics and make some remote controlled robots for fun (internet controlled, live video with multi user input, sort of crowd controlled). I am now trying to self teach myself about kalman filters and control theory and want to build more autonomous robots.

But any info on getting into robotics for a day job would be nice.


Well, my path was finding doing the motion control on giant dish radars really satisfying then do well in an interview because you know can speak fluently about Kalman filters. But really you should be able to be useful on a robotics team if you have good programming, electronics, or mechanical engineering skills and then learn more on the job. Learn one of those deeply and ideally a few things about the other two as well.


There are other ways to get into any field besides studying it in school, but if you are going to go to school anyway and want to study something directly relevant to robotics, how about a mechatronics or a controls engineering program?


Thank you for this! I have been looking for IC design MOOCs for awhile.


In case you're wondering, FO = Fan Out.


I don't think anyone uses U/LVT transistors in low geometries, the leakage would be a nightmare .


I know a lot of people using LVT transistors in 28 and 16/14nm processes, including relatively low power (mobile and embedded) designs. I personally have used LVT variant SRAM blocks for both our 28nm and 16nm designs, and ULVT cells manually placed for critical path for Neo's FPU for our 28nm chip.


I should've rephrased: I don't know of anyone* who uses ULVTs exclusively, to answer the parent of using ULVTs to increase speed.

* Okay, I know of some people, but their design is different.


I'm not surprised, though I don't have a good sense of what the exact numbers are.


Leakage diff @ 125 degrees celsius is about one order of magnitude.


Anywhere where I can read more on this?


I took a class that went over this in depth like 3-4 years ago. Basically the message was that serial performance is saturating, and the only way to get speed improvements in the future is going to be by exploiting parallelism. However, most programmers, and programming languages, remain stuck in a serial-by-default paradigm. I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward. I find the apparent stagnation extremely depressing.


While Rust doesn't promote a specific model for parallelism, it's stronger compiler helps a lot with that.

https://doc.rust-lang.org/beta/nomicon/concurrency.html

https://doc.rust-lang.org/book/second-edition/ch16-01-thread...

I don't think fixing C(++) can give what Rust can give, because Rust has a clean start with these strong guarantees built-in while for C(++) it would always be an addon. Defaults are powerful.


A really powerful outworking of it is seen in the Rayon crate, where you can change a sequential iterator into a parallel iterator just by adding the crate, importing the trait and changing `iter` to `par_iter`. If it’s not thread-safe to do, then it won’t compile. (That’s the big difference from C++.) If it is, it will, and it’ll be smart about how it runs, spreading the load across all available cores pretty much optimally, or not bothering with multiple threads if it’s not going to be worth it (e.g. single-threaded, or only one item in the iterator). And all that with close enough to no overhead.

Basically, Rayon makes data parallelism really easy in a way that few if any other languages do. I’d love to have an equivalent in Python or Node, but it’s just not possible to achieve such a thing in most languages—even if you ignore the thread safety aspect.

Parallelism hasn’t seen a great deal of use until it’s urgently needed, because it’s hard to get right in most environments, and you normally need to substantially refactor code to make it happen. My hope is that with the likes of Rayon, parallelism can be a much more natural thing that people that care even a little about performance will just do, because it’s so easy to do.

https://crates.io/crates/rayon


> Parallelism hasn’t seen a great deal of use until it’s urgently needed

This is the main reason why, initially, Windows Store APIs were all async.

Microsoft learned that when developers can choose between both models, by default most chose synch models.


>Basically, Rayon makes data parallelism really easy in a way that few if any other languages do.

Syntax-wise, there's OpenMP which can turn a for-loop into a parallelized for-loop (independently scheduled iterations) with just some syntactic sugar on top of the loop.

OpenMP has support for at least C++ and Fortran, and is not hard to use.

I wonder how Rayon compares to OpenMP.


FWIW, you can also achieve similar behavior in Clojure via pmap, Reducers, etc.



To me, the "If it’s not thread-safe to do, then it won’t compile" part sounds rather more notable.


That's true of Haskell for the parallelism faculties. It's also true for STM. You can still get stuck or have race conditions with the other concurrency abstractions but it's less common than I encountered elsewhere. I didn't have problems resulting from mutable shared state that was aliased across threads and wasn't supposed to be, it's always explicit.


Another good crate to look at for this is Actix, Rust actors framework:

https://github.com/actix/actix


C++ has parallel algorithms built in (http://en.cppreference.com/w/cpp/algorithm).

Parallelism is complicated though, and easy parallelism pretty much requires a functional style. Things like Haskell's Accelerate library (https://www.stackage.org/package/accelerate) seem the ideal way forward to me.


Should be noted that so far only Visual Studio (partially) implements C++17 parallel algorithms, although there are several third-party implementations.[1]

Apples to oranges comparison of course, since Rayon isn't [planned to become?] part of the Rust standard library.

[1] http://www.bfilipek.com/2017/08/cpp17-details-parallel.html#...


If you're including third-party libraries, you have things like ArrayFire and Intel TBB.


Doesn't go also make it relatively easy to write parallel code?


Not especially. It has goroutines, C# has Tasks, C++ has green thread libraries. Added onto this are channels which are basically thread safe queues, other languages have those too. (I am aware that there are differences between what features goroutines and Tasks provides)

Go's implementation of these things is nice, neat, and all included out of the box though which is nice.

In general you can make parallelism easy, or efficient, rarely both, at least not in a way that can solve problems generally.

edit: I should add Go does also come with a data-race detection tool which can be very useful. Not sure any other language includes that out of the box!


I think Go does make it relatively easy to write parallel code in comparison to most other frequently used languages.

> Not sure any other language includes that out of the box!

Rust's compiler does it by default, at compile time. ;-) Go's race detector is never wrong, but it may omit things. On the other hand, Rust's compiler is also neither wrong nor does it omit things, except under one circumstance: someone, somewhere, wrote `unsafe` code and committed a bug inside that block.

ThreadSanitizer is also a thing: https://clang.llvm.org/docs/ThreadSanitizer.html


Except in other languages it's not part of the core, it's an external library. So on the end it's easy to write "parallel" code.


In C#, Task is part of the language, thread safe queues are part of the core libs.

Still not quite as tightly integrated as Go though.


I believe you're thinking of concurrency, which go handles pretty well with goroutines.


Go can use multiple cores. Goroutines will run in parallel (assuming GOMAXPROCS > 1.)


[flagged]


As someone who frequently posts about my personally-excellent experiences with Rust, what makes you suspect it's astroturf rather than just turf? You think Mozilla is paying people to post about Rust using puppet accounts? C'mon.


Think about it. Mozilla Rust -> Godzilla (C)Rust -> "Godlike" giant lizard crust -> God is radiant -> light -> illumination -> Illuminati lizard people from beneath the Earth's flat crust are funding paid protestors to shill Rust, it's really the only explanation that makes sense.

(The person you're replying to is using the transparent internet argument tactic of trying to cast doubt upon genuine enthusiasm by implying some vague sinister motive, conveniently without bothering to articulate what that motive might even possibly be. Let's all recognize bad-faith arguments, downvote, and move along.)


It does seem a little odd that every comment about Rust gushes about its progress and stability without discussing any downsides. One would be similarly suspicious if, e.g., Perl6 were discussed this way.


Try it out and write the critique, then. The posts are positive because Rust is genuinely achieving rapid progress and stability... but certainly not without flaws.

I've experienced plenty of frustration, albeit outweighed by the massive benefits for my use case. And there are whole problem domains to which it's just not suited. I see these mentioned pretty frequently.


For once there is no consensus how to do parallelism an concurrency in Rust since none of the library are matures.


> none of the library are matures.

Rayon, the Rust library getting most of the discussion in this thread, is now 1.0: https://github.com/rayon-rs/rayon/blob/master/RELEASES.md .


Don't most people opt for something called Tokyo? I don't do Rust much, but even I have heard of it by now...


Tokio for IO concurrency, Rayon for data parallelism.


Exactly! Rust doesn't solve any parallelism problems. It's no better than established languages like C# or F#, and probably worse.


I'd consider handling data races gracefully a pretty large step forward in doing parallel data computation.


>It does seem a little odd that every comment about Rust gushes about its progress and stability without discussing any downsides.

Doesn't seem odd at all.

First, a lot of people commenting about Rust are enthusiastic recent adopters that haven't seen much of the language, including any real ugly sides yet.

Second, it's not entirely true, almost all Rust threads mention the steep learning curve, the slow compiler, and other issues such as the variadic generics (or lack thereof to be precise).

Third, we have seen the same "early adopter enthusiasts seeing it all rosy" circle for RoR, Go, Node, and Mongo, nothing out of the ordinary here.


That's what happens when you have a new hip language/framework that has a lot of hype.

You see the same with Rust, Elixir, VSCode, Elm, Purescript, ReasonML, etc.

It's not astroturfing, it's just a lot of beginners that are overly excited to proselytize their new discovery.


I don't think it's fair to claim these people are all beginners. More likely it's a phenomenon where mature languages have been around long enough that you already know their strong and weak points, but when you look at a language in development you only see its potential.


I mean, it's not just true for new languages. For Haskell, you see beginners claim that "it's a perfect language and can do no wrong", while experts in the language will acknowledge its faults, including: long compile times, the downsides of laziness, etc.


I'd say exploiting parallelism is not the only way at all. Parallelism is only one way to compute differently. Specialization of hardware to specific workloads will explode in the next years as we can't rely anymore on Moore's law. This will happen on RISC-V, IMHO.

We already have these:

* Rendering, medium precision mathematics: GPU

* Low precision mathematics: TPU

* Software Defined Networking: Microsoft is deploying FPGAs, AWS has its own hardware

We could have:

* Databases: Projections, hashing, sorting in hardware.

* Dynamic runtimes: Hardware implemented memory models, HW assisted GC, code caches and user-level interrupts for the JITs. Here is the J extension RISC-V working group: [1]

etc.

Also, why not have the usual hot paths in Node.js|Spring Framework|Django directly etched into hardware? HW http header parsing surely could bring benefit to them all.

----

Of course language and programmers will have to adapt, but in a lot of cases the runtimes will take care of it automatically.

[1] https://groups.google.com/a/groups.riscv.org/forum/#!msg/hw-...


Currently working on software for RISC-V and I can definitely see some opportunities here. One thing that's becoming apparent though, is that some level of abstraction needs to be available in order for many of these things to become useful. RISC-V has a very exciting Vector processing extension, which is intended to replace packed SIMD in most cases, but some software systems assume packed SIMD is the only way to get more FP performance.

For example, WebAssembly specifically exposes SIMD primitives, which means that it may be necessary to work backwards from those SIMD primitives to make use of a true vector machine.

I think many people simply underestimate the cost of adopting a new programming model.

> Also, why not have the usual hot paths in Node.js|Spring Framework|Django directly etched into hardware? HW http header parsing surely could bring benefit to them all.

Well, in all the listed cases here, the CPU is not the bottleneck on throughput. As far as I can tell problem with HTTP is not that headers take too long to parse, it's that memory is still too slow, and context switches cost us precious time. The problem with Node is not that the hardware doesn't adequately model the semantics, it's that dynamic, weak typing makes it hard for any system (software or hardware) to understand what type things are.

Update: The J extension seems interesting, and I've read some research (not thoroughly) recently showing considerable power and time savings from hardware GC primitives. I'm excited to see what goes on in that committee.


> I'm excited to see what goes on in that committee.

I am too. So far, this post on the general RISC-V mailing list and a few videos online talk about it. I'd love to have some other sources of information on their progress.

Also, I've heard that the RISC-V foundation is actively seeking collaboration for Java. There is some work on having RISC-V backends in HotSpot and JikesRVM, but so far it is limited to interpretation IIRC. The fact that Oracle is not jumping at it and pouring hundred of millions into it is beyond me.


> There is some work on having RISC-V backends in HotSpot and JikesRVM, but so far it is limited to interpretation IIRC.

Well, JikesRVM is a proper JIT. Palmer Dabbelt from SiFive has worked on a HotSpot port before (for a different platform). I'm currently working on a V8 port. The availability of platform software and language environments is obviously of paramount importance, since it'll shape the remaining first impressions of the architecture.

> The fact that Oracle is not jumping at it and pouring hundred of millions into it is beyond me.

Well, it's a lot of work, and they have their SPARC investment to continue.


Still, if you have only two floats to add and make a decision based on the result, a GPU or a TPU will not help you.

Hashing, maybe sorting, blitting and some math could be performed at the on-module DRAM controller level even, without data crossing over the slow DDR4 bus or mangling the CPU caches.

I'd love, in fact, to explore such an architecture in a simulator. What would happen to CPU performance if, say, hashes could be computed without reading the data, memory be cleared without zeroes hitting the bus or some SIMD operations be conducted on the memory.

Edit: clarify the processing could be done on the module side of the memory bus.


Are you aware of previous work on processor-in-memory architecture research? They considered things like this, but I think it stalled out due to semiconductor process and economic practicality.

This article mostly mentions a UC Berkeley effort: https://en.wikipedia.org/wiki/Computational_RAM

This page mentions others: http://www.ai.mit.edu/projects/aries/course/notes/pim.html


> memory be cleared without zeroes

In some specific cases, such as when the Linux kernel maps memory into your process, this is exactly what happens. When you write the page, it faults and clears it on demand; but I don't think there would be a considerable benefit to doing this at a finer granularity.


Ideally, you would prefer to postpone the actual writing to memory to the moment the cached version is evicted from on-chip caches. When you write that data to RAM, it'll take a lot of CPU cycles during which even external memory fetches will end up being delayed. And if you want to write all zeros from that page when all you are using are the first 10 bytes, it's a lot of cycles that are going to waste. Being sure the memory was actually zeroed out when you commit those bytes to DDR could save a relatively large time for the memory buses.


There is a great paper from Jeff Dean on using the TPUs for more traditional CS functions.

https://research.google.com/pubs/pub46518.html The Case for Learned Index Structures - Research at Google


GPU's and Google's TPU's are only capable of certain albeit very important aspects of numerical mathematics. There are other areas of mathematics that have numerical aspects and thus precision, but aren't linear algebra.


> Basically the message was that serial performance is saturating, and the only way to get speed improvements in the future is going to be by exploiting parallelism.

People have been warning us about this for about 10 years now, but I still don't see those 64-core CPUs I was promised anywhere.

If we had the amount of parallelism we were told we were going to get, we could give every app its own core. OSes could even consider disabling context switches altogether for the majority of apps. Instead, we're left complaining about Electron apps like it matters.

That said, I'm not sure what stagnation you refer to. There's a reason languages like Rust, Elixir/Erlang and Go are getting popular. My PHP app could handle hundreds of concurrent connections on a single machine, my Elixir app handles hundreds of thousands. Yet, processors didn't get 1000x faster (and Elixir isn't even a particularly fast language). This is the opposite of stagnation, it's progress.


They have been here for a while, just for a very specific market.

Threadripper and EPYC exist now though. With 32 and 64 logical cores respectively.


Sure, there's been high end niche products for anything forever. You could buy a 64 core computer in 1990 (they'd call it a supercomputer, but same thing).

The people telling us we had to go change our code to use parallel processing fast predicted a significantly faster increase in amount-of-cores on commodity hardware. Instead, CPUs stopped getting faster and hardly gained more parallellism.


You could buy a 64 processor machine in 1990, but it wouldn’t have been a 64 core machine in the sense we’re talking about — a single socket system. This isn’t some trivial distinction either, as the whole memory architecture is very different indeed for the two scenarios.


Even in single-socket configurations, Threadripper and EPYC are still both NUMA architectures - the cores are split across two or four dies, each of which has its own memory controller and memory attached to it, with requests from a core to memory on another die going via an interconnect.


those 64-core CPUs I was promised

The people promising that were crackpots and no one really called them out on it, so that meme got repeated everywhere despite being wrong. Processor vendors can't release a new processor that runs existing apps slower because no one would buy it (not counting monopolistic tactics). And since many existing apps are single-threaded, that means new processors have to at least maintain the same single-threaded performance which means keeping brainiac cores which means you can only afford 6-8 of them. (And arguably there are 64 weak cores in your CPU; they're just in the IGP and you have to program them with OpenCL.)

Go/Rust/Elixir are not so much progress IMO as undoing the negative progress of writing large-scale software in scripting languages.


There are plenty of 64 cores servers. Yet, multi-core architectures seem to have hit a wall caused by slow memory access.

I'm still waiting for the massively parallel NUMA machine in a chip, but there are many manufacturing problems keeping those away.


And hit a wall for power/density.

https://en.wikipedia.org/wiki/Dark_silicon


If you can find an Nvidia GTX 1080, that has 2560 cores.


They're playing games with the term 'core'.

In CPU terms, a 1080 is 40 cores, each with a 64 way vector unit.


I'm surprised they don't sell it as a 40 core, 64-dimensional processor.


Without branching (or with limited branching, last I checked it masked and re-ran instead). A core implies both instruction and data operations, rather than SIMD behavior.


Where "core" ~= vector lane


https://golang.org/

In case you dont know, Golang goroutines are a marvel of parallelism. They are coroutines which are dispatched on a few OS threads. So you can use 100% of a multi-cores CPU and yet, spawn, say, 10K of those light threads without worrying about context switches PLUS have them all run concurrently. I've found that golang is one of those rare language, like Lisp, that actually change the way you think about programming. Makes you feel really more powerful.

If you dont know that language, I suggest to run the following and watch your cpu activity and memory (or any metric):

  import "time"
  func main() {
    for i := 0; i < 10000; i++ { 
      go func() { 
        for { 
          time.Sleep(time.Second) 
        }
      }() 
    }
  }


In my experience the facilities to throw tasks into a scheduler that will run them in parallel was never the hard thing to accomplish (regardless of the language: some may have built-in capabilities, others may have syntactic sugar provided by a lib, but at the end of the day, most systems provide some kind of runTask(f) method).

What's really hard is to break down a problem into parallelizable chunks, figure out as much independent work as possible to reduce the touchpoints, and coordinate all those tasks such that they keep the CPU as busy as possible and as a whole finish as early as possible.

Beyond this "parallel breakdown design", it's the little touchpoints with shared data structures and synchronization that creates the difficulty of implementation, and I haven't seen any language or system that does magic there.


throw tasks into a scheduler that will run them in parallel was never the hard thing to accomplish

Common mistake number 1. Goroutines are running in parallel AND concurrently - they are coroutines (common mistake number 2 is to think they are only coroutines). I suggest not to underestimate that, it's the big deal. Additionnaly, it's important to note that goroutines yields on sleeps (any kind of sleeps / waits, like disk reads, network requests, channel writes/reads, etc), and while it may sound like a detail, it's a wonder of cpu control. Due to that, there is also a rare elegance to the way Go solve sharing data via channels.

What's really hard is to break down a problem into parallelizable chunks, figure out as much independent work as possible to reduce the touchpoints, and coordinate all those tasks such that they keep the CPU as busy as possible and as a whole finish as early as possible.

That's exactly what Go is a wonder for due to the combination of parallelism, concurrency, yield-on-sleep and channels.


And yet, Go's standard map doesn't allow concurrent access. They recently added a concurrent map feature but it's probably easier to add a lock to your code than refactoring it with the new map type. I would've preferred if they had introduced a map type that can be used exactly as the standard one (i.e. without function calls). Calling functions via go func() is really easy but handling data between goroutines can still cause headaches and is something that could be improved.


If you squint a little, you'll notice that go is pretty big on not hiding complexity. There are obvious exceptions to this such as the garbage collector, and heap/stack control.

I'm unsure if the intent of not including a map function was due to this, however with a for loop such as

    for x := range ch {
       slice = append(slice, x)
    }

, it is immediately obvious there are allocations happening in the background.

The fact that map access, and slice access is not thread safe means there is no trickery going on in the background. The fact that it is not threadsafe means I don't need to worry about a lock if I only write to a map when it's created. Sure the compiler could take care of this - they have the race detector after all, but the compile speed is one of the design goals of go. I really like being able to compile in less than 1 second.

`sync.Map` however calls a function which implies there is more going on in the background.

If you follow the go mantra of share memory by communicate, don't communicate by sharing memory - handling data becomes a whole lot easier. It does allow you to share memory in case you do need the extra speed.

Like most things, all designs are a matter of trade offs. Sacrifice one thing for another. There are other languages that provide the functionality you desire - but I understand the frustration when one thing is 90% what you want.

Go obviously has room for improvement, and perhaps a native threadsafe map is one of those areas.


Well, go isn't really a high-level language which maps can be seen as a feature of. The idea behind go, I believe, is to focus and provide innovative system-level features.


Go is not a language for "system-level features" in the first place. Its garbage collector largely precludes it from that in any modern context.

That it can't do something effectively does not mean that it shouldn't or didn't mean to. (And users can't really fix it, because hey, no generics. Sigh.)


Go is not a language for "system-level features" in the first place. Its garbage collector largely precludes it from that in any modern context.

I'm pretty sure I disagree with that largely because Go is low-level enough so that the GC is easily handled.

And users can't really fix it, because hey, no generics. Sigh.

Users can fix it by writing languages on top of Go (particularly dynamic ones).


> golang is one of those rare language

Not sure how rare, .NET does that as well. Below is your sample translated to C#. It requires C# 7.1 because async main, but the rest of the stuff is available for many years, since 2010.

    static async Task routine()
    {
        await Task.Delay(TimeSpan.FromSeconds(1));
    }

    static Task Main()
    {
        return Task.WhenAll(Enumerable.Range(0, 1000).Select(i => routine()));
    }


It's unclear to me if these parallel tasks are also coroutines (there seems to be a call to yield but its unclear what it really does), and if they yield on any sleeps which is a key features of goroutines.


> if these parallel tasks are also coroutines

I don’t have hands-on experience with golang but based on what I know about it yes, they are.

> there seems to be a call to yield but its unclear what it really does

You mean “await”?

It waits for the completion of whatever is on the right side of “await”. If the result is already available, it just continues. If the result is not yet available (e.g. Task.Delay creates a task that will complete in some moment in the future), the control goes away from the async method to the scheduler. The scheduler can then run some other task on the same OS thread. When the result of that operation becomes available (in the sample code, when 1 second delay passes), the scheduler resumes execution of that async method, on the statement after the await.

> if they yield on any sleeps

No, not any sleep. You can call Thread.Sleep() which will put the whole OS thread to sleep. It’s up to programmers to avoid calling blocking APIs from their async methods, i.e. use await Task.Delay() instead of Thread.Sleep(), await stream.ReadAsync() instead of stream.Read(), and so on.


>marvel of parallelism

If Go's m:n is blowing your mind, you should check out Erlang. Spawning a million "processes" is not a big deal.


Cool, thanks for the suggestion.


How is that an example an improvement over c++? C++ has had pragmatic omp parallel for for two decades


I'm not familiar with OMP, so please forgive me if I'm incorrect - but OMP from my quick search appears to be thread based, while go routines are much lighter weight than threads. They have their own scheduler - which like most things is both good and bad depending on what you want them for. If you use them as intended, the internal scheduler is a great design decision. This means I can happily spawn 10,000 without concern.

Additionally, combining them with the power of channels makes quick work of many tasks. Channels of course can be implemented in C++ too, but having the compiler take care of it for you with additional tools such as the race detector is very handy.

For a large set of problems, they are very nice to work with.


You can do the same in C++ on Windows with PPL, UNIX/Windows with Intel TBB, or any of the fiber/co-routine libraries.

Then there is the ongoing work to add async/await patterns into C++20.


As a matter of fact, you can do the same in assembly if you want to.

Notice how you can always have that answer when it comes to programming languages: you can do the same in. The point is that, the way it's made in go, is awfully handy.


I hardly see a difference from

    go func () { }
and

    Task.Run(() => { })
With the benefit that on the latter example, the runtime allows me to customise how scheduling is done.


Are those Tasks coroutines? If yes, do they yield on any sleeps?

These aspects are key to goroutines.


Yes they do, they are the building blocks for async/await, get a thread allocated from a thread pool when running, and you can control how the scheduling takes place, by providing your own scheduler implementation.


That is pretty cool, C++ has come a long way since I used it last about 15 years ago.

It seems like both languages are equally capable here, with C++ having more power and foot guns when required as usual.


The code snippet example I wrote was actually .NET with TPL.

C++ with PPL on Windows, would be

    task<T> handle = create_task([] { /* ... */ });
And with standard C++

    future<T> handle = std::async(std::launch::async, [] { /* ... */ });
In both cases, with C++20 it will be possible to co_await handle, which you can already play with on clang and VC++.


Every OpenMP implementation I know of uses a thread pool, and dispatches parallel work to it.

This is probably exactly the same behavior as goroutines.


OMP is only good for CPU bound code. You try to do IO inside OMP parallel section, and you’ll put the whole OS thread to sleep. The OS kernel will likely reschedule some other thread on that hardware thread, but that rescheduling is an expensive process.

Goroutines and .net tasks allow a nice mix of CPU bound and IO bound code. While a goroutine/task is waiting for IO or something else to complete (timer in this example), the runtime will immediately use the hardware thread for some other task, without OS involved.


Your example will not run in parallel. The go runtime will schedule your goroutines concurrently, but they will be run by a single OS thread, and consequently on a single CPU core.

Once you execute truly on multiple CPU cores (by increasing GOMAXPROCS), you'll be having the same kind of race conditions in Go as in any other imperative language (inb4 Rust Evangelism Strike Force saying "except Rust").


> Once you execute truly on multiple CPU cores (by increasing GOMAXPROCS), you'll be having the same kind of race conditions in Go as in any other imperative language (inb4 Rust Evangelism Strike Force saying "except Rust").

Wrong. GOMAXPROCS defaults to the number of logical CPUs, IIRC since version 1.5. For example, I have four cores with 2 threads each, so goroutines will be executed on up to 8 threads unless I set GOMAXPROCS to something else or the application explicitly changes it using the runtime package.

And sure, you'll really have the same problems, but IMO channels and goroutines minimize the friction of implementing thread safe programs using CSP. GP seems a bit optimistic, agreed, but I think that there is at least some substance to the idea that go makes it easier to correctly utilize multiple cores.


GOMAXPROCS has defaulted to # of cores since 1.5.


Wrong. Goroutines are not simply coroutines.

GOMAXPROCS defaults to number of cores.


Prove it.

Prove it by replacing Sleep in your example with some number crunching, and show how it scales with the number of cores in your CPU.


https://imgur.com/a/DNpw3

Running the following code.

https://play.golang.org/p/k_rRxNAyb0i

I can assure you that go runs across all processors by default.

https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3...


That's not what he's saying. He knows go can use all processors if GOMAXPROCS is set correctly, the argument seems to be that there will be race conditions just like any other threading, which seems pretty self evident to me: yes, multi-threaded code can have concurrency issues, film at 11...



Dude you are wrong about this. Please stop.


Except Rust!


I wonder if the future will be massively parallel, when CUDA and opencl came out I thought that future processors would have more and more core, so if you follow Moore's law, CPUs would see their core count double each 6 months. The problem is that GPU don't have error correcting codes, so you cannot really run application code on a GPU.

The problem with parallelism is that C-like language don't fit well, only functional languages do. If you want to use multi-threading you have to forget about state and only work with input/output paradigms. For OSes it might mean a deep re-design, but I don't really know.

A possible design would be a small but very fast CPU that only takes care or scheduling and task control, and another chip with many cores that deal with payloads and user software.

AMD had some kind of hybrid chip that planned to do both graphics and task, but it was thrown away.

Going parallel would require to change both hardware and software, and by software I mean stateless.


These are very good points! I would generalize "functional" to declarative though:

Logic programming languages like Prolog and Mercury are also much more amenable to parallelization than C-like languages.

In fact, different Prolog clauses could in principle be executed in parallel without changing the declarative meaning of the program, at least as long as you stay in the so-called pure subset of the language which imposes certain restrictions on the code.


> The problem is that GPU don't have error correcting codes

.. eh? What are you referring to here? Mainstream CPUs don't have error correction either, unless you're talking about ECC on the higher-end ones.


Actually, mainstream CPUs do use error detection and correction for internal operations and memory.

https://en.wikipedia.org/wiki/Machine-check_exception

Realise that if you only protected RAM with ECC then you're leaving a lot of data vulnerable in caches and registers, so those need parity bits too (as well as lots of other error checks on CPU operations). And anyway, CPU errors are common due to bad power supplies and overclockers. And you don't want to add a whole lot of design effort to create a marginally cheaper-to-produce version of the CPU which doesn't do any error checking.

I've seen a lot of MCEs on non-ECC CPUs :(

IANAEE


I always like to think of FP dataflow pipelines as doing digital circuits modeling.


The first GPU models were basically stateless (via pixel and vertex shaders with texture input and outputs), but this was incredibly inefficient for many GPGPU tasks, so compute shaders and CUDA have ways to load from and store to arrays. The memory model is a bit funky, but I’m not sure how going back to functional is viable for GPU programming.


Intel Larrabee also got cancelled.


Kind of, it became Xeon Phi.


I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

There has. GPU programming is exactly that. CPU-heavy tasks (games, bitcoin mining, machine learning) have already migrated.


Unfortunately it requires more than just a change of language, it requires a change in mode of thinking by developers. People are very used to reasoning in terms of "do X, then Y, then Z" or "compute a value X then do A or B on the basis of that". In order to achieve automatic parallelism you need e.g. a type+proof system that can determine that X/Y/Z are independent, or a system that can partially execute both A and B then retire the branch not taken - without invoking security bugs!


I think it is easier to understand by those of us that also had electronic design as part of the engineering degree.


This issue was discussed in my Occam class ~28 years ago. Occam itself is an example of a concurrent/"parallel-by-default" programming language intended for a Transputer hardware environment (now retargetable to x86 etc.), but it's not the easiest of languages to learn:

https://en.wikipedia.org/wiki/Occam_(programming_language)

https://en.wikipedia.org/wiki/Transputer

https://en.wikipedia.org/wiki/KRoC


The sequential programming by default is because most people think sequential by default.

Even though claims of multi-tasking etc persist, the truth is good parallel programmers are a rare thing.

Many ordinary programmers already get into Hot-Water when they use two threads and access data where a Semaphore might be needed.

In addition many algorithms are sequential, so parallelizing them is tricky or gives you no true reward due to cross-thread communication. Add to that the OO-Software Structures that subtle encourage using sequential programming.

I think the future in parallel-programming is actually hiding the parallel programming completely - accept the fact that most humans are not made for it, allow for experts to unlock the ability to override that behavior- let compilers go as far as they can and live with the results.

It will suffer the same fate as functional programming. Really useful, but never dominant, due to limitations in the applying humans.


I think part of the reason why many programmers haven't wanted to deal with parallel execution is because concurrency is not easy to handle. It has several pitfalls and can be painful to debug. Also, it needs proactive efforts implementing it, so as long as not required, devs just stick to serial execution.

Now, with helpful systems like no-side-effect functional languages and reactive stream frameworks, a lot of gory detail can be abstracted away. I think this has recently lead to more parallel-by-default software development.


Most software is fast enough when written in a naive sequential style. For the parts that parallelize well and matter, there are already decently mature ways of using all cores. Languages like Go, Rust, and Erlang make it fairly easy to write concurrent programs.


> However, most programmers, and programming languages, remain stuck in a serial-by-default paradigm

You still have to decide upon the unit of work that is going to be sent to a different thread/core/processor/NUMA node/whatever. The different units of work that are distributed should not share state; one really doesn't want to be sharing a lot of state between different processors, because synchronizing the processors memory caches in NUMA is a extremely slow.

I guess it is really hard to break up both the program and data and decide upon the optimal granularity of the work units, it is not something that can be easily done behind the scenes - human intervention is still required.


I agree, programming languages have not caught up yet.

The right kind of language looks at serial program formulations and based on flow-analysis automatically identifies parallelizable fragments that are large enough to benefit from multicore, then schedules these fragments e.g. by using work-stealing in a system of green threads, i.e., mapping green threads to OS cores as efficiently as possible. Something like that.

In a good parallel language there need to be many immutable constructs by default, exception handling is tricky, and ordinary flow control needs to be compatible by default with parallel evaluation. The languages I've seen such as Parasail are not yet production ready.

Making the programmer control parallelism can be okay, like in Go and Ada, but in the end it should be automatic.

Edit: The problem is also that finding a neat way of solving the problem academically does not readily translate into an efficient implementation, so much that I wonder whether green threads are actually worth it over OS threads. In most languages/VMs they aren't but Go seems to be an exception.


One reason is that our thinking process is inherently serial and we do really badly at multitasking naturally. At least those involving deliberate thought. And our programs are almost an extension of our way of thinking, so we will have to push the boundaries of how we think and model systems in our mind before we can build excellent parallel programming languages. Not that it isn't being done, but the weight of the "serial" legacy is long...


It's not just our thought process; there are upper limits to the gains from increased parallelism.

https://en.wikipedia.org/wiki/Amdahl%27s_law


> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

CUDA. OpenCL. Vulkan Compute Shaders. DirectCompute. C++ AMP. AMD's ROCm. Intel's SPMD. Khronos SYCL.


This argument made sense 10 years ago. We've had dual cores for more than a decade, and CPU speeds stopped growing years ago. If you still can't think in threads and their primitives, or require crutches to handle multithreaded situations, then the problem is with you.


Beyond the issues of languages and mindset, many problem domains parallelize poorly. Too many important phenomena fundamentally involve feedback loops evolving over time, and when that happens you can't just compute f(t) and f(t + 1) on different cores. At that point, throwing more cores at the problem might let you make the model bigger, but will very quickly hit a wall in terms of making it faster.


A few attempts have already been done, the issue is with developer adoption, not lack of trying.

One of the best examples was StarLisp for the Connection Machine.


starlisp required that you rephrased your problem to be large vector, with support for turning off logical cpus, very much like programming a modern gpu. if your problem could be recast that way it was pretty nice. however the generalized scatter/gather was so expensive, it had to be used very sparingly. you had to use the grey coded nearest neighbor hypercube network as much as possible.

the really* cool language from Hillis and Steele was cmlisp, but I don't know how far they got, they never released anything.


> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system

There has. It's called a GPU. Things like OpenCL and CUDA are the new languages.


Yeah, the tricky bit is it's not just a new language but also laying out your data in a new way that compatible with vectorization/etc. Modern OO/etc techniques love to litter pointers to random places in memory at nearly every step.

It's partly what made the PS3 so hard to write for, the SPUs only have 256kb of directly addressable memory, everything else is DMA'd. That said when you had your code+data fitting in 256kb it screamed everywhere else as well since you fit in L1+L2 cache neatly.


But it would be much better if a single language + hardware system emerged, rather than a multitude of mutually incompatible hardware systems and languages.

With the current fragmented and sometimes proprietary forest of programming platforms, an application needs to be quite specialized to warrant investment in GPU compute outside the original niche of graphics acceleration. There are other giant problems too, after you get over rewriting your application for numerous different platforms - atrocious quality of GPU drivers causing OS crashes for users, lack of any common way to debug GPU code, the colourful quality of compilers, the wildly different performance characteristics of different platforms necessitating per-platform algorithm changes, etc...

Consider what a minority of applications bother to even put in the work to exploit large amounts of CPU parallelism, which is vastly easier. There is after all >10x parallelism available on a typical PC CPU, after you count cores, threads and SIMD lanes.


> But it would be much better if a single language + hardware system emerged

There will, but it takes time. The world of scalar hardware in the 1970's was no less fragmented. Honestly most of the incompatibilities in the SIMD world at this point are bugs and not fundamental problems. The vector world has settled on a broad architecture at this point for most things.


Maybe. I feel is's at least equally likely that app dev perceived ROI on GPGPU will get worse due to relative slowdown in GPU advances and increasing parallel programming productivity on CPUs, and an attractive GPGPU platform won't emerge before it's irrelevant.


With just Chrome and IntelliJ running on Ubuntu, I have 281 processes and kernel workers running (as reported by ps). Just stitching the apps together for display on your screen requires 3-4 different processes (and a GPU). That's why 4 logical cores is a bare minimum these days for a desktop, even if no individual application takes advantage of more than a single core.

On the server side, when we deploy a Node.js web service on AWS we start one instance per logical core, for 4-64 processes all independently serving connections.

It seems the process has become the new thread, the smallest unit you should design for. So today's workloads actually make pretty good use of all those cores. Unless you're doing high performance computing and need to squeeze every last drop of performance, processes are a straightforward way to parallelize.


Having 281 processes is not a biggie. Having 281 processes that can actually use a piece of the CPUs is quite a different thing.

I think we should, really, start thinking about such things. Maybe prefixing instructions with the execution unit that should handle them (and overflow back to the first one in a circle if we have more EU's in software than the actual hardware provides), separating dependencies within code flow in a more explicit way and, at the same time, not bothering with creating threads.


One option is that you write in a high-level language where the top-level control code is single threaded, but you call APIs that perform multi-threaded operations seamlessly. The prototypical example of this is Python with numpy/blas (or with deep learning libraries like TensorFlow).


Chapel[0] and Fortress[1] come to mind. Wikipedia also has a list of parallel programming languages[2] (although it seems to play somewhat fast and loose with the definition of "parallel programming language").

[0]: https://en.wikipedia.org/wiki/Chapel_(programming_language)

[1]: https://en.wikipedia.org/wiki/Fortress_(programming_language...

[2]: https://en.wikipedia.org/wiki/List_of_concurrent_and_paralle...


I think JavaScript is nice because it's async in nature. Concurrency is hard so it's nice to deal with it using a simple language, so that everything besides the business logic is abstracted. Yes you do not get the same performance, but CPU cores are relatively cheap compared to engineer salaries.


The notion that cpu cores are cheap compared to engineer salaries only scales so far.

Scaling upwards, your opinion on that changes when a single engineer’s service is running on 10k machines.

At the other end of the spectrum, if you’re developing high-performance applications for small systems (desktop, laptops, mobile), Your workload isn’t going to look like tens of thousands of concurrent independent requests, So the approach of getting parallelism by deploying multiple copies of the application no longer works


> but CPU cores are relatively cheap compared to engineer salaries

I often experienced that this backfired. Single machines are still constrained in their power and while it's easy to spin up additional VMs in the cloud, scaling a program properly to run on dozens of machines takes a lot of work. It can be faster to develop a program that is really efficient and can solve the problem on one machine than to develop faster only to then spend the time scaling it to a large fleet of servers.


I kinda regret adding the note about performance, because switching to a lower level language used to yield orders of magnitude more performance, but optimizers have evolved and today there's not much difference. Sometimes the higher level language will be even faster because of optimizations. And bad performance is often not to blame on the language, instead blame the programmer or more likely the business people as they think it's great to waste resources as it gives them an excuse to charge more.


   used to yield orders of magnitude more performance, but optimizers have evolved and today there's not much difference.
If this were true, you'd expect see a lot more native python and the like.


Python is interpreted so super optimizing compilers, SIMD auto vectorization and other recent goodies necessary to get that performance won't work.


Python is not always interpreted. And when it isn't, it's still slow.


Materials other than silicon can support higher clock rates.

From reading the open literature and advertisements by foundry companies, I think you could make a 6502 equivalent processor with Indium Phosphide with 64kb of static RAM that clocks at 30 GHz. With a more refined process you might push 90 GHz and a much more complex processor.

Yes, InP is more expensive than Silicon but part of that is the low volume that InP parts are made in. Advances in Silicon are getting much more expensive, and one InP microprocessor could do the work of ten Silicon-based cores so you can save on die area without the "race to the bottom" in size.

The main issue with high clocks is fast access to memory, probably you would need an optical interface to off-chip RAM, also I don't know what the InP equivalent of DRAM is. (Something like Optane?)


1. Compound semiconductors often suffer from p-type and n-type conduction problems. Plus they're ridiculously expensive and power hungry. They also often lack a native oxide.

2. The problem today in CPUs is not really clock speed but much more the memory access latency, optane is much slower* than DRAM and has much lower endurance.

*Even though silicon HKMG transistors use high k gate dielectrics now, they still use a silicon dioxide interfacial layer.


> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

there has and it's called labview, although by hardware you may have meant processors. labview has many quirks, but it surprisingly gets many things right, even in futuristic ways. when i move back to text-based languages it's always a jolt primarily due to the serial nature of them, even those that support asynchronous computation. it's really hard to recalibrate to having to assign things to temporary variables and the like. and the lower dimensions of a text file compared to a higher dimensional canvas is something that sticks out as a limiting factor in supporting parallel by default.

> I find the apparent stagnation extremely depressing.

agreed.


Golangs’s web server, by default serves every request in a new goroutine (thread of execution) making it parallel by default with no effort on the user’s behalf.

Obviously this problem domain is easily parallelized but it’s nice to see parallelism be the refs to standard when possible and reasonable to do so.


Just to be clear, goroutines aren't threads (although they are multiplexed onto threads), and they're only pre-empted at certain points in the go runtime (function calls are the big ones.)

If your request spins in a for-loop doing lots of work without function calls, other goroutines on the same thread won't get a chance to run, and you'll be limited to GOMAXPROCS simultaneous requests. In practice this never really happens though.


While modern languages support for parallelism is adequate, tools are still lacking imho. I avoid parallelism when it's not necessarily, because debugging all these race conditions, deadlocks, synchronization etc. is a nightmare.


That is mostly an issue with the "avoid IDE" crowd.

While IDE tooling can still be improved, the parallel debugging tools in .NET and Java eco-systems are already quite good.

On VS, I can have at any given moment a graphical snapshot on how all threads and tasks are interacting with each other, or just execute some of the threads.

It doesn't solve everything, but it makes it easier than a typical gdb session.


.NET has one of the best ecosystem overall, so it's more an exception than a rule (can't comment on Java as I don't work with it). As I'm mostly in low level, embedded system programming, you are still stuck with gdb, Valgrind and other primitive tools there, because if there's some legacy "IDE" at all, it's most often just some half-assed Eclipse plugin.


At least looking to their product sites, Microchip and Green Hills seem to have quite good tooling, then again I don't have embedded experience on modern systems, beyond mobile devices.


Do Microchip even produces multi-core MCU's? Haven't seen one, though I worked mostly with Cortex-A series so might miss something.


As I said, I don't have much experience in real embedded domain outside mobile devices (iOS, Android, UWP), but aren't Cortex-A5 supposed to handle up to 4 cores?


Yes, so it looks like Microchip produces multi-core MCUs after all, though as I've mentioned, I haven't encountered them and can't comment about quality of their tools.


I guess it's not going to help a lot except in special cases. See for example what Linus had to say about many cores and the long discussion after.

https://www.realworldtech.com/forum/?threadid=146066&curpost...


New languages are coming up, it will just take long adoption cycles, given the library and surrounding tooling ecosystem has to be mature enough to go with it.

Prominent examples being Go and Perl 6.

Perl 6 especially, given how audacious the project is. There are performance issues with it currently though, from what I hear they are working to fix it soon.


I remember someone posting some Perl 6 code and C code that did the same thing. The Perl 6 code was shorter, easier to understand, more correct (Unicode), and was reported to be faster for what they were doing.

There are things which are slower, but since it is a higher level language it may be easier to try multiple algorithms one of which may be significantly faster. It also has many useful features included, which can be optimized in ways that aren't recommended for user code. (writing the algorithm in NQP) There is also a code specilizer (spesh) and a JIT.

Basically for many things it can be fast enough. Also if you profile your code and find something that is egregiously slow you should report it. Many times such things get optimized quickly.


There are many languages that handle well parallelism, but not all problems really need parallelism in the program itself.

For example, for web programming, you can throw several machines (or processes) of your serial program to have it run in parallel for all practical purpose.


There are limits to how much parallelism will improve things, see amdahl's law.


Sure, but let's start moving towards parallelism first. May be a few decades from now we would be close to breaking those limits.


That law is a fact of life. Its a form of diminishing returns, there's only so much you can do.



Haskell kind of qualifies. You can't have race conditions when everything is immutable.

Of course this makes some other things a bit more difficult.


Seastar framework


Honest question: Is that name a pun on C* (https://en.wikipedia.org/wiki/C*)?


Microsoft has developed Parallel Linq which is a subset of .net framework and works like a charm.


Is it depressing or is it a kind of signal we should listen to and try to interpret ?


look at VHDL or Verilog programming for FPGAs. Its effectively exactly what you're talking about.


shaders?


In fact, there is! Erlang is extremely parallel. Processes are lightweight, share nothing, and run on as many CPUs you have.


And also its derivatives, like Elixir.


Elixir is amazing! The original author, José Valim, took a brilliant approach: Take a 30 year-old battle-tested highly-parallel VM, and build a modern language on top of it. Syntax is inspired by the good parts of Ruby (very clean), but nothing comes close in terms of the ease of parallelism... it's so natural, and insanely fast.

The unit tests are what convinced me this will be the next big thing. Beautifully clear syntax, succinct tests, and most importantly: Parallel out of the box. Hundreds of unit tests run instantaneously. Ruby TDD setups run tests that changed with maybe 1-2s lag... Elixir runs all the tests so fast that, at the beginning, I wasn't sure the tests were running.


I can't shake the feel that this parallelism thing will be nothing but a wild goose chase, because nature seems to be highly serial in all but the most macroscopic of senses.


How can you say that? Almost every part of every organism functions simultaneously all the time. At the community scale, organisms cooperate and communicate continuously in real time. My eldest daughter is going through an ant obsession phase at the moment, their communities and how they coordinate hundreds of individuals simultaneously are amazing. Nature and natural selection are the ultimate optimizers - if there is any way for an organism or species to extract even the tiniest survival or reproductive advantage, nature ruthlessly optimises for it. Parallel function and behaviour is one of it's primary tools.


All of your cells and neurons operate in parallel.


And yet you only have a single train of thought [1] – or multiple interleaved trains, but that's concurrency, not parallelism.

[1] Plus a couple of background processes, like breathing, that execute in parallel.


That's a huge simplification not pertaining to the fundamental nature of the brain. Maybe it is in the nature of our consciousness (which in itself is probably only an abstract concept) to perceive the processes it emerges from as a sequence of single trains of thought rather than a chaotic, continuous consolidation of processes both internal to the brain/body and outside of it.

You could look at society as a whole and see the zeitgeist as a singular "train of thought", but you'd probably still recognize humans as individual agents. I think we have a bias towards thinking of ourselves as the ultimate individuals, neither recognizing the processes within us (like those of our cells) or the processes beyond us (like those of a group of people, animals, plants etc.) as having a similar nature. This is probably a genetically advantageous trait.


> That's a huge simplification not pertaining to the fundamental nature of the brain.

I disagree. Conscious thinking is a crucial process that's inherently single-threaded, even though it runs on highly parallel hardware (the billions of neurons).

However, I must admit that my point doesn't necessarily contradict pjc's point, and I don't really agree with digi_owl's claim that parallelism is a "wild goose chase".


> I disagree. Conscious thinking is a crucial process that's inherently single-threaded, even though it runs on highly parallel hardware (the billions of neurons).

I don't even know what to make of that. What do you mean by "single-threaded" if at the same time you recognize that it "runs on highly parallel hardware"? If you actually mean that our consciousness emerges from a purely sequential process that happens to go on in a highly parallel system, no, that's clearly wrong. Experience, the fundamental basis of consciousness, actuates many parts of the brain at the same time. They process this information largely independently in different ways, and sometimes those processes result in a clear "train of thought" but most of the time they do not. You can not reason about the inherent nature of our consciousness in terms of trains of thought if you recognize any subjectivity to our experience that exists without reasoning or language. That's a matter of definition, of course, and without agreeing on a precise definition it's probably no use talking about what is inherent about it.

If that's not what you mean, CPU execution models are probably not a very helpful metaphor to explain your idea. The clearly defined layer of abstraction that separates a fully pipelined CPU design built with simultaneously operating logic gates from "single threaded" programs being executed on it doesn't exist in brains. I guess that's what bugs me most about this type of discussion on HN. It seems developers are very fond of taking their (admittedly versatile) hammers and hammer away at anything they can think of, for better or for worse.


> If you actually mean that our consciousness [...]

I'm not talking about consciousness (as in qualia or subjective experience), but about conscious thinking as in train of thought or intentionally thinking about something. For me, this process itself feels very sequential.


"there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it"

That's roughly what Intel tried with Itanium. I don't know if whatever barriers they hit are still barriers today.


Not really - Itanium tried to move the instruction-level parallelism and the burden of re-ordering execution to the compiler. You could only execute the maximum six instructions per cycle if there were no data dependencies. So it only really works for certain kinds of algorithmic data-heavy execution; that's why VLIW is only found in DSP architectures today.

Good SE answer here: https://softwareengineering.stackexchange.com/questions/2793... especially the ones focusing on cache misses.


"Itanium tried to move the instruction-level parallelism and the burden of re-ordering execution to the compiler"

That sounds a lot like "parallel-by-default C++" kind of language + hardware system to exploit it"

Just swapping "compiler" for language. They didn't succeed, but they did try

Edit: helping me understand where I'm off might be more helpful than a downvote. Does swapping "compiler" for "language" not respresent what Intel was trying to do?


With the web and requests the default would look to be parallel for most developers. The container runtime (Ruby, Python, Java) just abstracts away the parallelism on different cores.


Seems like an incredibly long winded way of saying 'To go faster you either need to split up each instruction into lots of parts or increase the voltage for the transisters. We've split the instructions as much as we can, and power consumption is proportional to Voltage cubed, so it's not a scalable plan.'


More importantly than mere power consumption, we don't have a way to remove the waste heat generated. Dennards law (like Moore's law but for power consumption per transistor) ended 10 years ago. Exactly the same time clock speeds stopped improving. There are actually a few computers out there that run around 10Ghz but they all have impractical cooling systems.

If there ever were a return to exponential scaling, we would very soon run into the Launder limit.


For those who are curious, see the Wikipedia article on Landauer's Principle [1]

I hadn't heard of this before, but it sounds like we are a long way off from reaching the limit; as the article states, modern computers use millions of times more energy than what Landauer's Principle implies is the lowest possible amount.

1: https://en.wikipedia.org/wiki/Landauer%27s_principle


> If there ever were a return to exponential scaling, we would very soon run into the Launder limit.

No we wouldn't. We're around ~10,000X off and would run into thermal danger zones long before.


BUT....The Launder limit is over optimistic because unless your computer runs at absolute zero you need to keep redundant copies of each bit for error correction. Transistors are implicitly error correcting in the sense that each bit is represented by a current of a few thousand electrons.

Factoring in the redundancy requirement we are likely only off by somewhere between 100x and 1000x. If there were ever a return to exponential technological improvement, we would run out of road after a few years.


That's my point though, the bottleneck is not going to be Landauer's Limit.


Re: all the programming replies.

Preface, I'm not a programmer, I'm a hardware guy.

It's all well and good to make sure your programs and future programs are able to be run in a parallel fashion but there is a big hole to that and it's the operating systems methods of handling cores and threads.

Let's use folding@home as an example. Very multithreaded. Now let's use, at first, Ryzen 1800x as the hardware we'll run it on. We have 8 physical cores. We also have two separate dies. Each die has four cores. Each die module has their own level 3 cache. As you use your system and you are also folding, even in the newest Linux kernel, data and instructions might get evicted and bounced around and take latency hits and thus performance hits. Nothing really locks the work to the cores or threads taking into account locality. You can adjust this with HTOP and set each thread of folding manually.

Beyond AMD, even Intel has similar issues still with the 8700k. Hell, in general just efficient multithreading seems like a tough compromise for OS development. "Users" want things to be smooth upon interaction, so you have preemption. Work wants to get done but it also wants to be a good citizen to the rest of the system.

Developers are going to have to learn about, and keep up to date with, much more then a fancy new language. You're going to have to learn each new CPU inside and out and how each OS treats it.


How did the BeOS designers make the BeOS so good at multiprocessing? I remember how well the operating system scaled with more than one CPU.


Applications scale, not operating systems.

There was nothing magic in BeOS, simply multithreading that seemed novel at the time in consumer-level hardware.

Hate to say it, but there was no magic, just engineering that everyone now has.


Probably because it ran on PowerPC. If I remember my systems design class from 20 years ago, RISC makes implementing or scaling multiprocessing easier.


Intel processors essentially are RISC now internally. They just expose more complex instructions as a higher level API.


They are not RISC internally. In fact I read an interview with an intel engineer that an intel cou has >10,000 uops. That’s not in any sense reduced!

And it doesn’t make sense to apply the term ”risc” to internal CPU design anyway.


It ran pretty well on Intel also. And it originally ran on AT&T Hobbit, whatever the hell that was.


Very well put! Not an expert but, as far as I know, parallel OS development definitely seems to be a frequent blind spot in the systems literature...


The traditional OS is becoming increasingly irrelevant in these days of virtualisation. I predict that future high-performance systems will be built with unikernels.


It's not a blind-spot, concurrency and parallelism are the core of every CS level operating system course at uni-level. There are mountains of white-papers discussing every aspect you can imagine. It's simply that all existing OS's are simply good enough.

Applications scale, not operating systems.

The problem is now in the application/algorithm level, not the OS level.


Notably, single-thread performance of code that is not friendly to vectorization has not stagnated despite stagnant clock frequency. Indeed, SPECint performance continues to grow exponentially, albeit more slowly since ~2004.

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-...


> Indeed, SPECint performance continues to grow exponentially, albeit more slowly since ~2004

Interesting chart! However that looks like moving goalposts. If you allow functions to still be exponentials when the rate repeatedly diminishes, any monotonic function can be an exponential: f(x) = x is an "exponential" with diminishing growth rate log(x). That curve looks approximately logistic to me, as most technologies usually do.

Actually I believe speculative execution theoretically allows increasing linearly single-threaded performance for exponentially more multi-threaded performance -- so a transistor price Moore's law (if not power/density) should allow a continued linear single-threaded growth. The problem is that without power (inverse) scaling, this would also cost exponentially more power. It seems the slight increase in power visible in that chart could account for a fraction of increased single-threaded performance (other factors would be improved efficiency and architectural gains); expect those factors to also stall soon (which would fulfill the "logistic prophecy").


Depends on the code, you can write ASM that's as fast on an old P4 as a modern i7. Just access random RAM locations and modern CPU's suck.


Increasing CPU clock speed also does not reduce DRAM latency.


We are talking about the sum of latency's. The CPU needs to do something AND you need to fetch from DRAM. Modern CPU's have increased the worst case overhead to fetch form DRAM to decrease the average case.

That's usually a good trade-off until someone wants to make your CPU look terrible.


My senior design course focused on asynchronous (clock-less) cryptography circuits; after learning of these, I looked into asynchronous general purpose processors, and learned that ARM actually designed an asynchronous processor back in the 2000's [0].

While they've never quite taken off (the extra gates decrease speed and our harder to manufacture), with the recent side-channel attacks on processor pipelines, I've been hopeful that I would see something pop up. Imagine a world where our processors run without a clock!

[0] https://www.eetimes.com/document.asp?doc_id=1299083


Wouldn't asynchronous CPUs increase the amount of side-channels? AFAIK in an asynchronous circuit every aspect of the calculation may affect the time it takes to complete.


The attacks I'm aware of use processor timings to measure the effects their programs are having. However, because an asynchronous circuit doesn't complete tasks in a reliable cycle, you can't measure how the pipeline is being effected by your program in the same way. You could find new ways to force the processor to act in a reliable manner that you could measure, but that may change quite randomly from processor to processor, or even from moment to moment, depending on external environmental factors.


Nope. In fact EMI and differential power analysis based approaches are actually heavily mitigated in asynchronous, as is bus snooping.


One of my CS lecturers worked on this: https://www.cl.cam.ac.uk/~rja14/Papers/micromicro2003.pdf

It turns out that the advantages are not so great as might appear, and the situation gets worse as DRAM delay dominates. Also logic designers are a conservative bunch and getting everyone to replace the industry standard design toolchain is a big ask.


That was Steve Furbers Amulet (https://en.wikipedia.org/wiki/AMULET_microprocessor)

He was one of my lecturers at uni, shame not much became of Armulet tho.


Shameless plug, we at Vathys run asynchronous: https://youtu.be/4nSn0JhZX18


Very cool :)


Thanks!


I understand that companies heavily invested in silicon would like to portray it that way, but CPU frequencies haven't ceased to grow.

DARPA manufactured a THz transistor made of InP back in 2014.

Silicon isn't the only semiconductor in nature, and others are actively being researched.

Also, "when you increase the frequency you increase the power" (which is their argument) doesn't explain why they can't increase the frequencies. That was always the case even back in 1960s.

What they actually need to explain is why they can't make the silicon more power-efficient anymore; all toy-physics arguments (such approximations/linearizations work only for a very limited range of frequencies, if they do at all, anyway meaning their scaling-relations aren't universal like they're trying to portray and those coefficients they ignore aren't constant across voltage, frequency, materials, ... either; you almost never get such simple and universal answers in condensed matter physics even for much simpler problems) mentioned there could have been made 50 years ago as well, but silicon CPU frequencies did go up.


Transistor switching frequency has almost nothing to do with processor clock speeds which are almost entirely limited by wire RC delay. You will get increased drive currents by switching to higher mobility materials but the performance improvement over strained silicon isn't that large.


That's just a plumbing problem which can be solved by lowering temperature or using a different material with lower resistivity. Yes, it'll probably cost more, but it's a problem that can readily be solved.

But if your switching frequency is slow, it doesn't matter if you use a superconductor for wires. It is the switching frequency that truly determines the limits for gate times, which in turn determines how fast your CPU is.

For the record, SiGe is also very promising in terms of switching speeds. There were experiments which shows near THz frequencies.


>That's just a plumbing problem which can be solved by lowering temperature or using a different material with lower resistivity. Yes, it'll probably cost more, but it's a problem that can readily be solved.

No it's not. What is this magical material with ultra low resistance? And how do you plan to reduce capacitance?

Btw, manufacturing terahertz speed transistors is very difficult. There are Mott FETs which will switch at 10 terahertz, but they're incredibly hard to manufacture and very power hungry.


There's nothing magical about it. As has been pointed out, chips with superconductors has been a thing for a long time (at least in experimental physics) and there's nothing magical about it, but resistivity is a (nonlinear) function of temperature and lowering it almost always lowers the resistance, so you can achieve this to an extent even with non "magical" materials without going to cryogenic temperatures. You don't have to reduce the capacitance; you'll be fine as long as you can lower the resistivity.

Do you have any references for FETs that switch at 10THz? I've never heard of it and I'm interested in the physics of it.


I can’t speak on their usability in circuits. However, materials that show properties characteristic of zero resistance exist. I’ve used them at work before.

https://en.wikipedia.org/wiki/Superconductivity


And how do you propose to cool every chip to cryogenic temperatures?

Not to mention the manufacturing challenge of integrating superconductors into chips (I think InP would be the easiest candidate, and that's saying something...)


In some cases because not necessary. Take the Gen 1 Google TPUs. They use a 700 mhz clock rate but are processing 65535 things at the same time. Very simple instructions.

Here is a great paper on comparing to silicon using far higher clock rates.

https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

Now what will be interesting is this new architecture can be used for more traditional CS functions.

I love this paper from Jeff Dean on using the TPUs instead of a CPU to replace a b-tree for example.

https://research.google.com/pubs/pub46518.html

This also solves our multi-thread issue. Basically it done in a multi-thread manner from the ground up.

We get a round peg for a round hole.


Let me take the opportunity to plug my favorite computer architecture course from ETH, who was published here recently.

https://www.youtube.com/playlist?list=PL5Q2soXY2Zi9OhoVQBXYF...

https://safari.ethz.ch/architecture/doku.php


I find their explanation of the pipelining issue slightly confusing, probably because they tried to simplify it to the extreme:

>One could object to this and note that due to shorter clock ticks, the small steps will be executed faster, so the average speed will be greater. However, the following diagram shows that this is not the case.

Said diagram shows that the two-clock-tick step locks the pipeline, that is you can't execute the first clock tick for the next instruction if you're still running the 2nd part of the previous one. When would this be the case? Isn't entire point of pipelining to divide a function into smaller steps that can be run in parallel? If you can split "step 3" across two clock cycles, couldn't you effectively subdivide it into two steps that could run in parallel?

I suppose that eventually you run across the issue that adding additional pipelining stages increases the logic size which in turn causes it to run slower or something like that. I wish the document was a little more specific, after all it doesn't hesitate to throw the physical formulas for power dissipation in the 2nd part so clearly it's not afraid to dig into technical details.


> If you can split "step 3" across two clock cycles, couldn't you effectively subdivide it into two steps that could run in parallel?

There is a difference between splitting "step 3" across two clock cycles and splitting "step 3" into two separate steps. The underlying assumption here is that "step 3" is indivisible. E.g. say "step 3" was memory access and the latency for that is 500 picoseconds, it's not like you can just split it into two steps and make it load faster.


> But remember, wrong overclocking can harm not only processor but you as well.

Don't date robots!


They fail to mention that dividing instructions into more pipeline stages means greater branch penalty which in turn prompts reckless speculative execution schemes to compensate.


The rule of thumb in chip design: your chip clock is as slow as your slowest logic pipe. Complex logic circuitry slows down potential clock rates significantly. You can make your logic gates switch faster, thus allowing longer signal paths, but it has a huge energy trade-off, as the article states.

Multi-core design seems now to compensate for slower clock rates, but it also has its trade-offs. It makes software more complex. In case of the CISC arch Intel established it's a huge trade-off, since CISC is supposed to make its processors easier to program, as opposed to RISC. I don't think that CISC is a good choice when it comes to massive parallelism.

But, since chip design is so expensive and is considered state of the art high-tech, we'll need to deal with everything that chip makers throw at us. Or do we?


I don't think that CISC is a good choice when it comes to massive parallelism.

On the contrary, I think the increased code density (reduced fetch bandwidth --- very important for multiple cores) and greater semantic information of CISC instructions is crucial for parallelism. Large operations can be broken up into individually scheduled uops inside the core, and those uops can then be parallelised, without the equivalent of fetching all those uops from memory as might occur in a classic RISC.

In fact, even modern ARMs use this uop-based "instruction splitting" in their microarchitecture.


If the performance gets significantly limited by instruction cache misses, then yes code density can become important as you point out. However, there are two things to keep in mind:

1) most actual RISC ISA have compact modes, typically using 16 bits instructions mixed with the regular 32 bits ones. That's Thumb2, Micro MIPS, RISC-V Compact mode, and others for embedded CPUs (ARC, Andes, ...). Their code density are competitive (and sometimes better) then x86. So with practical RISC implementation the code density is not a factor in RISC vs CISC;

2) there's a big outlier if I remember correctly: ARM in 64 bits mode dropped Thumb2 support. They certainly have to know how to keep a compact mode, and they decided not to bother. So I guess the I-cache limitation is maybe not a such a problem in real life? I don't have the data but I trust ARM to take benchmarking seriously, particularly for an ISA that also target server chips.


The RISC/CISC "tradeoff" is mostly a non-issue at the higher end of processor design: everything is now a hybrid. You have ARM64 with its SIMD and floating point extensions that hardly qualifies as "reduced" on one side, and Intel systems that have a suspiciously RISC-like internal architecture fed by decoder of the "legacy" CISC instruction set.

It still matters at the small end, which is why Cortex-M exists.

> Or do we?

A startup can design its own chips, but good luck getting anyone to use it.


It worked for P.A. Semi (eventually).


Challenge Accepted


The article doesn't answer the question at the fundamental level. The closest it gets is this: "Increased frequency depends heavily on the current level of technology and advances cannot move beyond these physical limitations."

Certainly, Moore's law is just an observation and cannot go on forever. Would it fair to say that we've simply reached the point where we can no longer "keep up" with Moore's observation because the technology is getting harder and not because we've actually reached any limit of physics?


You're skipping over the main point of the article, which is that the reason it's hard to increase the clockrate is because it is dependent on the slowest instruction that can be done in one tick.

And the main method of making an instruction faster is by splitting it, but all instructions have now already been split as much as is possible, while still having them operate correctly.


But this shouldn’t be true for a superpipelined processor, right?

Or put it another way, let’s say phase 3 contains several important instructions that cannot be reduced to a length of less than 1.7 clock ticks. If the pipeline stalls you have other pipelines that won’t.

Or you get crazy and put in 2 copies of the slow paths of phase 3 and one takes the even ticks and the other the odd ones.


Indeed, that's what I looked for too. The current transistor gate pitch is approaching atomic sizes, which results in electrons leaking through by quantum tunneling. I was hoping to understand more about how speed of light and quantum behavior is preventing further progress. Article only mentions transistor switching speed.


Because the subject of the article is explicitly about clock speed and not about making transistors smaller.


Until we have another material that could replace silicon. Not sure if we will see this happen in the next ten to twenty years.


The 60 mV/dec limit applies to all materials, not just silicon. Beating it will require a fundamentally different type of operation. Agree that changes in materials will require massive investment and learning.


There are transistor designs that can subvert the subthreshold swing limit, see NC-FETs and TFETs.


Sure, but many materials have much higher electron and hole mobility than silicon has.


He's talking about the Boltzmann limit of MOSFETs


Carbon nanotube transistors would be great if we could get them to form reliably instead of having 10% of the transistors fail to form.


Diamond transistors have been demonstrated in the lab in the 90's already.


Graphene. The material of the future.


TeraHertz CPUs would be nice.


Crank them up a little more and they will glow due to reaching visible light territory :)


Perhaps leading to little pixel-sized photovoltaic cores that absorb light on one side, do a bit of computation at 400-800 THz and emit light on the other side, merging the display with the CPU. I've heard that VR goggles with 8K resolution per eye would be approaching the limit of the human eye's resolving power, which would be about 66 million cores.


Crank them up a little more and they will go into deep ultraviolet. Problem solved.


There is no problem here, marketing folks would love it :)


So would gamers.

"Nice LEDs bro" "Nah that's my CPU"


Exponential growth can't be infinite. Who knew! Mind blown!


I've been looking at Haskell, Rust and Go for helping with parallelism but decided to go with a less known language: Pony. Not used actors a lot right now but looks really promising.


Rust has an actor framework among other great projects in its ecosystem:

https://github.com/actix/actix


Particularly given you haven't used actors, what advantages does Pony give you over Haskell?


Right now I found Pony more approachable than Haskell. Maybe because that I never fully grok monads (probably my fault not persevering enough), I always had problems composing monads.

Also the promise of Pony is a garbage collection that is concurrent with program execution and since I want to write low latency server code this feature sounds very appealing.


Funny to read comment section in russian source article. People argue about need of multicore processors in a desktop pc. Especially when I read those comments from 16 core machine.


>temperature

Using a proper thermal interface material in their CPUs would be a start...

When you can decrease the temps of Intel CPUs by 20°C with delidding, the heat argument seems quite constructed.


"Only you can prevent overclocking fires!"

Nicely written. Seems like the intended headline was "why it's bad to overclock", though!


It's a matter of cost and cooling really, IBM z13 is 5GHz, z14 is 5.2GHz.


No, it's a matter of economics. Those zXX chips are not cheap, nor is what they interface to.


There is a new relevant episode of changelog's podcast that talks about CPU advancements.

https://changelog.com/podcast/284


Cost/benefit AFIAK-- higher clock speeds require advanced cooling, use more power, etc., and it's been possible to get more speed by increasing transistor count instead at lower cost.


Are there any CPUs out there with FPGAs tacked on that are available to the hobbyists/gamer/build your own PC crowd?


There are some hobbyists using the 28nm Xilinx Zynq, a hardened circa-2009-cell-phone dual-core ARM with on-die FPGA. One popular board is the https://www.crowdsupply.com/krtkl/snickerdoodle


Not really, the market for combined processor/FPGAs are basically the opposite, FPGAs running a soft-core processor. Same effect tho.


Asynchronous CPUs or bust.

Or, better yet, wave pipelining...


It's there if you want it, but liquid helium doesn't come cheap.


"CPU manufactures will not allow a meltdown to happen."

No one in the office understands why i'm laughing....


The spectre of meltdown is haunting CPU manufactures.


Great, now I have to clean all that tea off of my display. ;-P


and i thought that humor is verboten around here... Does HN have an exception in the book that applies to Intel ?


I don't think humour as such is frowned upon on HN - if I was to attempt to write down the unwritten, I would say that posts that are just jokes tend to go down badly, but jokes that make a point or serious posts that are written with some wit are generally accepted.


The guidelines do not mention humor as such. It is often frowned upon, but I think the delicious irony in this case warrants an exception. ;-)


> But there are also strong concerns that the increased frequency will raise the CPU temperature so much that it will cause an actual physical melt down. Note that many CPU manufactures will not allow a meltdown to happen


At least they have a sense of humor.

Edit: Oh, 2014. The sweet irony.


Many CPU manufacturers will not, but Intel will.


(2014)


Thanks! Updated.


guys, I'll be honest: I found it really odd that the article didn't talk about the speed of light, and die size constraints. (c / 4 ghz = 7.49 cm only, if you double that frequency you have half that size to put components it between any two clock ticks.)

But there are limits to my hubris - this is on intel.com, so I'm going to go with "I'm the one missing something". Is the speed of light and number of transistors you can put in that path (due to die size) just not a practical constraint? Neither is mentioned.


The key factors in integrated circuit delay are more to do with capacitance; in order to change a gate's transistor from off to on, the driving gate has to charge the capacitance of the driving wire and the driven gate. Making features closer together increases their mutual capacitance.

(source: worked on this for a chip design software company. The delay approximation was based entirely around R/L/C modelling and had no terms for the speed of light per se. If I remember rightly it was calculated in integer pico-meters; I definitely remember it emitting an error message if you had more than 2cm of wire in any one net!)


I.e. the problem is the state of processor, which need to be erased. So we can make stack of stateless processors, which will readily accept fresh data, because they will need to fill capacitors only, not discharge, and then discharged after use. Kind of multicore design, but with each core used for only 1/n of time at e.g. 1Thz frequency. Unlike parallel system, sequential calculation will work faster in such setup.


.. what? No, the capacitance issue is inherent in any kind of electrical signal sent over wires.


discharging and charging capacitance is generally pretty symmetric. I don't think what you're suggesting would provide much benefit.


But heating and cooling is not symmetrical. We can heat a processor much faster than cool it. So, if we need to cool a processor 10x faster than we can, just use 10x more processors, and switch them in order, to allow them cool after use in overclocked mode. Using this simple technique, frequency can be raised by few GHz, which is important for serial computations.


To amplify what pcj50 said: The speed of light is a very real constraint. It's just not the constraint that people are hitting, because there are other constraints you hit first (capacitance, heating, etc).

I recall reading clear back in the 1970s that IBM mainframes were trying to do a dual processor setup. This wasn't multiple cores on one die, this was separate physical boxes. And they were having trouble because they wanted them to operate in sync (in the sense of presenting one image to the OS and applications), but they were more than a foot apart, and they were operating at sub-nanosecond frequencies. For them, the speed of light was definitely a constraint. Even if they got around all the electrical stuff, the speed of light still put a limit on how "in sync" those two CPUs could be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: