Why has CPU frequency ceased to grow? (2014)

Symmetry · on Feb 21, 2018

That was a bit misleading in some ways. First, in pipelining you'll typically measure how long a pipeline steps in FO4s, which is to say the delay required for one transistor to drive 4 other transistors of the same width. Intel will typically design its pipeline stages to have 16 FO4s of delay. IBM is more aggressive and will try to work it down to 10. But of those 10, 2 are there for the latches you added to create the stage and 2 are there to account for the fact that a clock edge doesn't arrive everywhere at exactly the same time. So if you take one of those 16 FO4 Intel stages and cut it in half you won't have a two 8 FO4 stages but two 10 FO4 stages. And since those latch transistors take up space and energy you're got some severe diminishing returns problems.

One thing that's changed as transistors have gotten smaller is that leakage has gotten to be more of a problem. You used to just worry about active switching power but now you have to balance using higher voltage and lower thresholds to switch your transistors quickly with the leakage power that that will generate.

And finally velocity saturation is more of a problem on shorter channels making current go up more linearly with the gate voltage than quadratically.

JoachimS · on Feb 21, 2018

Good points. One thing I would like to emphasize is the issue with clocks not arriving everywhere at the same time. Balancing the clock tree over a chip gets harder and harder.

But the clock setup, hold times also gets shorter and shorter when the clock frequency goes up. The clock signal will have jitter. The end result is that less and less time of the clock edge is usable to sample the signal into the register.

And this in turn put a strain on how well balanced the logic between the registers are. To allow all signals traverse the logic paths through the gates and stabilize in time to be sampled.

To add to the complexity, as we move down the geometries, the difference in performance of different transistors becomes relatively larger. One rason for this is that oxide layers consist of (in average) fewer and fewer molecules. When the layer was made up of 100 molecules, 101 or 102 didn't really make much of a difference. But when the average is 4 molecule one more or less will have a huge impact on the performance.

So controlling variance (clock tree balance, jitter in clock generatiom, imbalances between paths and variance in chip production) becomes ever more problematic and important.

nullbyte · on Feb 21, 2018

How do you know all this stuff?

Symmetry · on Feb 21, 2018

Got my master's in it, before ending up in sensors then robotics instead. And a continued interest, I guess.

Here are some free relevant courses. You might have to go back and take the pre-reqs.

https://ocw.mit.edu/courses/electrical-engineering-and-compu...

mbrumlow · on Feb 21, 2018

How does one get into robotics? I have not looked much but none of my local schools seem to have "robotics".

I tinker with electronics and make some remote controlled robots for fun (internet controlled, live video with multi user input, sort of crowd controlled). I am now trying to self teach myself about kalman filters and control theory and want to build more autonomous robots.

But any info on getting into robotics for a day job would be nice.

Symmetry · on Feb 21, 2018

Well, my path was finding doing the motion control on giant dish radars really satisfying then do well in an interview because you know can speak fluently about Kalman filters. But really you should be able to be useful on a robotics team if you have good programming, electronics, or mechanical engineering skills and then learn more on the job. Learn one of those deeply and ideally a few things about the other two as well.

tesseract · on Feb 22, 2018

There are other ways to get into any field besides studying it in school, but if you are going to go to school anyway and want to study something directly relevant to robotics, how about a mechatronics or a controls engineering program?

kickopotomus · on Feb 21, 2018

Thank you for this! I have been looking for IC design MOOCs for awhile.

jacquesm · on Feb 21, 2018

In case you're wondering, FO = Fan Out.

deepnotderp · on Feb 21, 2018

I don't think anyone uses U/LVT transistors in low geometries, the leakage would be a nightmare .

trsohmers · on Feb 21, 2018

I know a lot of people using LVT transistors in 28 and 16/14nm processes, including relatively low power (mobile and embedded) designs. I personally have used LVT variant SRAM blocks for both our 28nm and 16nm designs, and ULVT cells manually placed for critical path for Neo's FPU for our 28nm chip.

deepnotderp · on Feb 22, 2018

I should've rephrased: I don't know of anyone* who uses ULVTs exclusively, to answer the parent of using ULVTs to increase speed.

* Okay, I know of some people, but their design is different.

Symmetry · on Feb 21, 2018

I'm not surprised, though I don't have a good sense of what the exact numbers are.

deepnotderp · on Feb 22, 2018

Leakage diff @ 125 degrees celsius is about one order of magnitude.

abhishekjha · on Feb 21, 2018

Anywhere where I can read more on this?

mjfl · on Feb 21, 2018

I took a class that went over this in depth like 3-4 years ago. Basically the message was that serial performance is saturating, and the only way to get speed improvements in the future is going to be by exploiting parallelism. However, most programmers, and programming languages, remain stuck in a serial-by-default paradigm. I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward. I find the apparent stagnation extremely depressing.

madez · on Feb 21, 2018

While Rust doesn't promote a specific model for parallelism, it's stronger compiler helps a lot with that.

https://doc.rust-lang.org/beta/nomicon/concurrency.html

https://doc.rust-lang.org/book/second-edition/ch16-01-thread...

I don't think fixing C(++) can give what Rust can give, because Rust has a clean start with these strong guarantees built-in while for C(++) it would always be an addon. Defaults are powerful.

chrismorgan · on Feb 21, 2018

A really powerful outworking of it is seen in the Rayon crate, where you can change a sequential iterator into a parallel iterator just by adding the crate, importing the trait and changing `iter` to `par_iter`. If it’s not thread-safe to do, then it won’t compile. (That’s the big difference from C++.) If it is, it will, and it’ll be smart about how it runs, spreading the load across all available cores pretty much optimally, or not bothering with multiple threads if it’s not going to be worth it (e.g. single-threaded, or only one item in the iterator). And all that with close enough to no overhead.

Basically, Rayon makes data parallelism really easy in a way that few if any other languages do. I’d love to have an equivalent in Python or Node, but it’s just not possible to achieve such a thing in most languages—even if you ignore the thread safety aspect.

Parallelism hasn’t seen a great deal of use until it’s urgently needed, because it’s hard to get right in most environments, and you normally need to substantially refactor code to make it happen. My hope is that with the likes of Rayon, parallelism can be a much more natural thing that people that care even a little about performance will just do, because it’s so easy to do.

https://crates.io/crates/rayon

pjmlp · on Feb 21, 2018

> Parallelism hasn’t seen a great deal of use until it’s urgently needed

This is the main reason why, initially, Windows Store APIs were all async.

Microsoft learned that when developers can choose between both models, by default most chose synch models.

romwell · on Feb 27, 2018

>Basically, Rayon makes data parallelism really easy in a way that few if any other languages do.

Syntax-wise, there's OpenMP which can turn a for-loop into a parallelized for-loop (independently scheduled iterations) with just some syntactic sugar on top of the loop.

OpenMP has support for at least C++ and Fortran, and is not hard to use.

I wonder how Rayon compares to OpenMP.

ethagnawl · on Feb 21, 2018

FWIW, you can also achieve similar behavior in Clojure via pmap, Reducers, etc.

floatboth · on Feb 21, 2018

And in Haskell https://www.stackage.org/haddock/lts-10.6/parallel-3.2.1.1/C...

And in .NET https://docs.microsoft.com/en-us/dotnet/api/system.threading...

And in Java https://blog.oio.de/2016/01/22/parallel-stream-processing-in...

Rayon is notable because Rust is a native-compiled, no-GC language.

brazzy · on Feb 21, 2018

To me, the "If it’s not thread-safe to do, then it won’t compile" part sounds rather more notable.

coolsunglasses · on Feb 21, 2018

That's true of Haskell for the parallelism faculties. It's also true for STM. You can still get stuck or have race conditions with the other concurrency abstractions but it's less common than I encountered elsewhere. I didn't have problems resulting from mutable shared state that was aliased across threads and wasn't supposed to be, it's always explicit.

mitchtbaum · on Feb 21, 2018

Another good crate to look at for this is Actix, Rust actors framework:

https://github.com/actix/actix

beojan · on Feb 21, 2018

C++ has parallel algorithms built in (http://en.cppreference.com/w/cpp/algorithm).

Parallelism is complicated though, and easy parallelism pretty much requires a functional style. Things like Haskell's Accelerate library (https://www.stackage.org/package/accelerate) seem the ideal way forward to me.

smitherfield · on Feb 21, 2018

Should be noted that so far only Visual Studio (partially) implements C++17 parallel algorithms, although there are several third-party implementations.[1]

Apples to oranges comparison of course, since Rayon isn't [planned to become?] part of the Rust standard library.

[1] http://www.bfilipek.com/2017/08/cpp17-details-parallel.html#...

beojan · on Feb 22, 2018

If you're including third-party libraries, you have things like ArrayFire and Intel TBB.

ilogik · on Feb 21, 2018

Doesn't go also make it relatively easy to write parallel code?

gameswithgo · on Feb 21, 2018

Not especially. It has goroutines, C# has Tasks, C++ has green thread libraries. Added onto this are channels which are basically thread safe queues, other languages have those too. (I am aware that there are differences between what features goroutines and Tasks provides)

Go's implementation of these things is nice, neat, and all included out of the box though which is nice.

In general you can make parallelism easy, or efficient, rarely both, at least not in a way that can solve problems generally.

edit: I should add Go does also come with a data-race detection tool which can be very useful. Not sure any other language includes that out of the box!

burntsushi · on Feb 21, 2018

I think Go does make it relatively easy to write parallel code in comparison to most other frequently used languages.

> Not sure any other language includes that out of the box!

Rust's compiler does it by default, at compile time. ;-) Go's race detector is never wrong, but it may omit things. On the other hand, Rust's compiler is also neither wrong nor does it omit things, except under one circumstance: someone, somewhere, wrote `unsafe` code and committed a bug inside that block.

ThreadSanitizer is also a thing: https://clang.llvm.org/docs/ThreadSanitizer.html

Thaxll · on Feb 21, 2018

Except in other languages it's not part of the core, it's an external library. So on the end it's easy to write "parallel" code.

gameswithgo · on Feb 21, 2018

In C#, Task is part of the language, thread safe queues are part of the core libs.

Still not quite as tightly integrated as Go though.

ztoben · on Feb 21, 2018

I believe you're thinking of concurrency, which go handles pretty well with goroutines.

icedchai · on Feb 21, 2018

Go can use multiple cores. Goroutines will run in parallel (assuming GOMAXPROCS > 1.)

Bitcoin_McPonzi · on Feb 21, 2018

[flagged]

gilrain · on Feb 21, 2018

As someone who frequently posts about my personally-excellent experiences with Rust, what makes you suspect it's astroturf rather than just turf? You think Mozilla is paying people to post about Rust using puppet accounts? C'mon.

kibwen · on Feb 21, 2018

Think about it. Mozilla Rust -> Godzilla (C)Rust -> "Godlike" giant lizard crust -> God is radiant -> light -> illumination -> Illuminati lizard people from beneath the Earth's flat crust are funding paid protestors to shill Rust, it's really the only explanation that makes sense.

(The person you're replying to is using the transparent internet argument tactic of trying to cast doubt upon genuine enthusiasm by implying some vague sinister motive, conveniently without bothering to articulate what that motive might even possibly be. Let's all recognize bad-faith arguments, downvote, and move along.)

arbie · on Feb 21, 2018

It does seem a little odd that every comment about Rust gushes about its progress and stability without discussing any downsides. One would be similarly suspicious if, e.g., Perl6 were discussed this way.

gilrain · on Feb 21, 2018

Try it out and write the critique, then. The posts are positive because Rust is genuinely achieving rapid progress and stability... but certainly not without flaws.

I've experienced plenty of frustration, albeit outweighed by the massive benefits for my use case. And there are whole problem domains to which it's just not suited. I see these mentioned pretty frequently.

Thaxll · on Feb 21, 2018

For once there is no consensus how to do parallelism an concurrency in Rust since none of the library are matures.

kibwen · on Feb 21, 2018

> none of the library are matures.

Rayon, the Rust library getting most of the discussion in this thread, is now 1.0: https://github.com/rayon-rs/rayon/blob/master/RELEASES.md .

coldtea · on Feb 21, 2018

Don't most people opt for something called Tokyo? I don't do Rust much, but even I have heard of it by now...

steveklabnik · on Feb 21, 2018

Tokio for IO concurrency, Rayon for data parallelism.

Bitcoin_McPonzi · on Feb 21, 2018

Exactly! Rust doesn't solve any parallelism problems. It's no better than established languages like C# or F#, and probably worse.

vvanders · on Feb 21, 2018

I'd consider handling data races gracefully a pretty large step forward in doing parallel data computation.

coldtea · on Feb 21, 2018

>It does seem a little odd that every comment about Rust gushes about its progress and stability without discussing any downsides.

Doesn't seem odd at all.

First, a lot of people commenting about Rust are enthusiastic recent adopters that haven't seen much of the language, including any real ugly sides yet.

Second, it's not entirely true, almost all Rust threads mention the steep learning curve, the slow compiler, and other issues such as the variadic generics (or lack thereof to be precise).

Third, we have seen the same "early adopter enthusiasts seeing it all rosy" circle for RoR, Go, Node, and Mongo, nothing out of the ordinary here.

chillee · on Feb 21, 2018

That's what happens when you have a new hip language/framework that has a lot of hype.

You see the same with Rust, Elixir, VSCode, Elm, Purescript, ReasonML, etc.

It's not astroturfing, it's just a lot of beginners that are overly excited to proselytize their new discovery.

pault · on Feb 21, 2018

I don't think it's fair to claim these people are all beginners. More likely it's a phenomenon where mature languages have been around long enough that you already know their strong and weak points, but when you look at a language in development you only see its potential.

chillee · on Feb 22, 2018

I mean, it's not just true for new languages. For Haskell, you see beginners claim that "it's a perfect language and can do no wrong", while experts in the language will acknowledge its faults, including: long compile times, the downsides of laziness, etc.

BenoitP · on Feb 21, 2018

I'd say exploiting parallelism is not the only way at all. Parallelism is only one way to compute differently. Specialization of hardware to specific workloads will explode in the next years as we can't rely anymore on Moore's law. This will happen on RISC-V, IMHO.

We already have these:

* Rendering, medium precision mathematics: GPU

* Low precision mathematics: TPU

* Software Defined Networking: Microsoft is deploying FPGAs, AWS has its own hardware

We could have:

* Databases: Projections, hashing, sorting in hardware.

* Dynamic runtimes: Hardware implemented memory models, HW assisted GC, code caches and user-level interrupts for the JITs. Here is the J extension RISC-V working group: [1]

etc.

Also, why not have the usual hot paths in Node.js|Spring Framework|Django directly etched into hardware? HW http header parsing surely could bring benefit to them all.

----

Of course language and programmers will have to adapt, but in a lot of cases the runtimes will take care of it automatically.

[1] https://groups.google.com/a/groups.riscv.org/forum/#!msg/hw-...

microcolonel · on Feb 21, 2018

Currently working on software for RISC-V and I can definitely see some opportunities here. One thing that's becoming apparent though, is that some level of abstraction needs to be available in order for many of these things to become useful. RISC-V has a very exciting Vector processing extension, which is intended to replace packed SIMD in most cases, but some software systems assume packed SIMD is the only way to get more FP performance.

For example, WebAssembly specifically exposes SIMD primitives, which means that it may be necessary to work backwards from those SIMD primitives to make use of a true vector machine.

I think many people simply underestimate the cost of adopting a new programming model.

> Also, why not have the usual hot paths in Node.js|Spring Framework|Django directly etched into hardware? HW http header parsing surely could bring benefit to them all.

Well, in all the listed cases here, the CPU is not the bottleneck on throughput. As far as I can tell problem with HTTP is not that headers take too long to parse, it's that memory is still too slow, and context switches cost us precious time. The problem with Node is not that the hardware doesn't adequately model the semantics, it's that dynamic, weak typing makes it hard for any system (software or hardware) to understand what type things are.

Update: The J extension seems interesting, and I've read some research (not thoroughly) recently showing considerable power and time savings from hardware GC primitives. I'm excited to see what goes on in that committee.

BenoitP · on Feb 22, 2018

> I'm excited to see what goes on in that committee.

I am too. So far, this post on the general RISC-V mailing list and a few videos online talk about it. I'd love to have some other sources of information on their progress.

Also, I've heard that the RISC-V foundation is actively seeking collaboration for Java. There is some work on having RISC-V backends in HotSpot and JikesRVM, but so far it is limited to interpretation IIRC. The fact that Oracle is not jumping at it and pouring hundred of millions into it is beyond me.

microcolonel · on Feb 22, 2018

> There is some work on having RISC-V backends in HotSpot and JikesRVM, but so far it is limited to interpretation IIRC.

Well, JikesRVM is a proper JIT. Palmer Dabbelt from SiFive has worked on a HotSpot port before (for a different platform). I'm currently working on a V8 port. The availability of platform software and language environments is obviously of paramount importance, since it'll shape the remaining first impressions of the architecture.

> The fact that Oracle is not jumping at it and pouring hundred of millions into it is beyond me.

Well, it's a lot of work, and they have their SPARC investment to continue.

rbanffy · on Feb 21, 2018

Still, if you have only two floats to add and make a decision based on the result, a GPU or a TPU will not help you.

Hashing, maybe sorting, blitting and some math could be performed at the on-module DRAM controller level even, without data crossing over the slow DDR4 bus or mangling the CPU caches.

I'd love, in fact, to explore such an architecture in a simulator. What would happen to CPU performance if, say, hashes could be computed without reading the data, memory be cleared without zeroes hitting the bus or some SIMD operations be conducted on the memory.

Edit: clarify the processing could be done on the module side of the memory bus.

saltcured · on Feb 21, 2018

Are you aware of previous work on processor-in-memory architecture research? They considered things like this, but I think it stalled out due to semiconductor process and economic practicality.

This article mostly mentions a UC Berkeley effort: https://en.wikipedia.org/wiki/Computational_RAM

This page mentions others: http://www.ai.mit.edu/projects/aries/course/notes/pim.html

microcolonel · on Feb 21, 2018

> memory be cleared without zeroes

In some specific cases, such as when the Linux kernel maps memory into your process, this is exactly what happens. When you write the page, it faults and clears it on demand; but I don't think there would be a considerable benefit to doing this at a finer granularity.

rbanffy · on Feb 21, 2018

Ideally, you would prefer to postpone the actual writing to memory to the moment the cached version is evicted from on-chip caches. When you write that data to RAM, it'll take a lot of CPU cycles during which even external memory fetches will end up being delayed. And if you want to write all zeros from that page when all you are using are the first 10 bytes, it's a lot of cycles that are going to waste. Being sure the memory was actually zeroed out when you commit those bytes to DDR could save a relatively large time for the memory buses.

jacksmith21006 · on Feb 21, 2018

There is a great paper from Jeff Dean on using the TPUs for more traditional CS functions.

https://research.google.com/pubs/pub46518.html The Case for Learned Index Structures - Research at Google

madez · on Feb 21, 2018

GPU's and Google's TPU's are only capable of certain albeit very important aspects of numerical mathematics. There are other areas of mathematics that have numerical aspects and thus precision, but aren't linear algebra.

skrebbel · on Feb 21, 2018

> Basically the message was that serial performance is saturating, and the only way to get speed improvements in the future is going to be by exploiting parallelism.

People have been warning us about this for about 10 years now, but I still don't see those 64-core CPUs I was promised anywhere.

If we had the amount of parallelism we were told we were going to get, we could give every app its own core. OSes could even consider disabling context switches altogether for the majority of apps. Instead, we're left complaining about Electron apps like it matters.

That said, I'm not sure what stagnation you refer to. There's a reason languages like Rust, Elixir/Erlang and Go are getting popular. My PHP app could handle hundreds of concurrent connections on a single machine, my Elixir app handles hundreds of thousands. Yet, processors didn't get 1000x faster (and Elixir isn't even a particularly fast language). This is the opposite of stagnation, it's progress.

skummetmaelk · on Feb 21, 2018

They have been here for a while, just for a very specific market.

Threadripper and EPYC exist now though. With 32 and 64 logical cores respectively.

skrebbel · on Feb 21, 2018

Sure, there's been high end niche products for anything forever. You could buy a 64 core computer in 1990 (they'd call it a supercomputer, but same thing).

The people telling us we had to go change our code to use parallel processing fast predicted a significantly faster increase in amount-of-cores on commodity hardware. Instead, CPUs stopped getting faster and hardly gained more parallellism.

pdpi · on Feb 21, 2018

You could buy a 64 processor machine in 1990, but it wouldn’t have been a 64 core machine in the sense we’re talking about — a single socket system. This isn’t some trivial distinction either, as the whole memory architecture is very different indeed for the two scenarios.

makomk · on Feb 21, 2018

Even in single-socket configurations, Threadripper and EPYC are still both NUMA architectures - the cores are split across two or four dies, each of which has its own memory controller and memory attached to it, with requests from a core to memory on another die going via an interconnect.

wmf · on Feb 21, 2018

those 64-core CPUs I was promised

The people promising that were crackpots and no one really called them out on it, so that meme got repeated everywhere despite being wrong. Processor vendors can't release a new processor that runs existing apps slower because no one would buy it (not counting monopolistic tactics). And since many existing apps are single-threaded, that means new processors have to at least maintain the same single-threaded performance which means keeping brainiac cores which means you can only afford 6-8 of them. (And arguably there are 64 weak cores in your CPU; they're just in the IGP and you have to program them with OpenCL.)

Go/Rust/Elixir are not so much progress IMO as undoing the negative progress of writing large-scale software in scripting languages.

marcosdumay · on Feb 21, 2018

There are plenty of 64 cores servers. Yet, multi-core architectures seem to have hit a wall caused by slow memory access.

I'm still waiting for the massively parallel NUMA machine in a chip, but there are many manufacturing problems keeping those away.

monocasa · on Feb 21, 2018

And hit a wall for power/density.

https://en.wikipedia.org/wiki/Dark_silicon

pjc50 · on Feb 21, 2018

If you can find an Nvidia GTX 1080, that has 2560 cores.

monocasa · on Feb 21, 2018

They're playing games with the term 'core'.

In CPU terms, a 1080 is 40 cores, each with a 64 way vector unit.

Maybestring · on Feb 21, 2018

I'm surprised they don't sell it as a 40 core, 64-dimensional processor.

kale · on Feb 21, 2018

Without branching (or with limited branching, last I checked it masked and re-ran instead). A core implies both instruction and data operations, rather than SIMD behavior.

deepnotderp · on Feb 21, 2018

Where "core" ~= vector lane

quadcore · on Feb 21, 2018

https://golang.org/

In case you dont know, Golang goroutines are a marvel of parallelism. They are coroutines which are dispatched on a few OS threads. So you can use 100% of a multi-cores CPU and yet, spawn, say, 10K of those light threads without worrying about context switches PLUS have them all run concurrently. I've found that golang is one of those rare language, like Lisp, that actually change the way you think about programming. Makes you feel really more powerful.

If you dont know that language, I suggest to run the following and watch your cpu activity and memory (or any metric):

  import "time"
  func main() {
    for i := 0; i < 10000; i++ { 
      go func() { 
        for { 
          time.Sleep(time.Second) 
        }
      }() 
    }
  }

magnetic · on Feb 21, 2018

In my experience the facilities to throw tasks into a scheduler that will run them in parallel was never the hard thing to accomplish (regardless of the language: some may have built-in capabilities, others may have syntactic sugar provided by a lib, but at the end of the day, most systems provide some kind of runTask(f) method).

What's really hard is to break down a problem into parallelizable chunks, figure out as much independent work as possible to reduce the touchpoints, and coordinate all those tasks such that they keep the CPU as busy as possible and as a whole finish as early as possible.

Beyond this "parallel breakdown design", it's the little touchpoints with shared data structures and synchronization that creates the difficulty of implementation, and I haven't seen any language or system that does magic there.

quadcore · on Feb 22, 2018

throw tasks into a scheduler that will run them in parallel was never the hard thing to accomplish

Common mistake number 1. Goroutines are running in parallel AND concurrently - they are coroutines (common mistake number 2 is to think they are only coroutines). I suggest not to underestimate that, it's the big deal. Additionnaly, it's important to note that goroutines yields on sleeps (any kind of sleeps / waits, like disk reads, network requests, channel writes/reads, etc), and while it may sound like a detail, it's a wonder of cpu control. Due to that, there is also a rare elegance to the way Go solve sharing data via channels.

What's really hard is to break down a problem into parallelizable chunks, figure out as much independent work as possible to reduce the touchpoints, and coordinate all those tasks such that they keep the CPU as busy as possible and as a whole finish as early as possible.

That's exactly what Go is a wonder for due to the combination of parallelism, concurrency, yield-on-sleep and channels.

dx034 · on Feb 21, 2018

And yet, Go's standard map doesn't allow concurrent access. They recently added a concurrent map feature but it's probably easier to add a lock to your code than refactoring it with the new map type. I would've preferred if they had introduced a map type that can be used exactly as the standard one (i.e. without function calls). Calling functions via go func() is really easy but handling data between goroutines can still cause headaches and is something that could be improved.

ta2384428 · on Feb 21, 2018

If you squint a little, you'll notice that go is pretty big on not hiding complexity. There are obvious exceptions to this such as the garbage collector, and heap/stack control.

I'm unsure if the intent of not including a map function was due to this, however with a for loop such as

    for x := range ch {
       slice = append(slice, x)
    }

, it is immediately obvious there are allocations happening in the background.

The fact that map access, and slice access is not thread safe means there is no trickery going on in the background. The fact that it is not threadsafe means I don't need to worry about a lock if I only write to a map when it's created. Sure the compiler could take care of this - they have the race detector after all, but the compile speed is one of the design goals of go. I really like being able to compile in less than 1 second.

`sync.Map` however calls a function which implies there is more going on in the background.

If you follow the go mantra of share memory by communicate, don't communicate by sharing memory - handling data becomes a whole lot easier. It does allow you to share memory in case you do need the extra speed.

Like most things, all designs are a matter of trade offs. Sacrifice one thing for another. There are other languages that provide the functionality you desire - but I understand the frustration when one thing is 90% what you want.

Go obviously has room for improvement, and perhaps a native threadsafe map is one of those areas.

quadcore · on Feb 21, 2018

Well, go isn't really a high-level language which maps can be seen as a feature of. The idea behind go, I believe, is to focus and provide innovative system-level features.

eropple · on Feb 21, 2018

Go is not a language for "system-level features" in the first place. Its garbage collector largely precludes it from that in any modern context.

That it can't do something effectively does not mean that it shouldn't or didn't mean to. (And users can't really fix it, because hey, no generics. Sigh.)

quadcore · on Feb 22, 2018

Go is not a language for "system-level features" in the first place. Its garbage collector largely precludes it from that in any modern context.

I'm pretty sure I disagree with that largely because Go is low-level enough so that the GC is easily handled.

And users can't really fix it, because hey, no generics. Sigh.

Users can fix it by writing languages on top of Go (particularly dynamic ones).

Const-me · on Feb 22, 2018

> golang is one of those rare language

Not sure how rare, .NET does that as well. Below is your sample translated to C#. It requires C# 7.1 because async main, but the rest of the stuff is available for many years, since 2010.

    static async Task routine()
    {
        await Task.Delay(TimeSpan.FromSeconds(1));
    }

    static Task Main()
    {
        return Task.WhenAll(Enumerable.Range(0, 1000).Select(i => routine()));
    }

quadcore · on Feb 22, 2018

It's unclear to me if these parallel tasks are also coroutines (there seems to be a call to yield but its unclear what it really does), and if they yield on any sleeps which is a key features of goroutines.

Const-me · on Feb 22, 2018

> if these parallel tasks are also coroutines

I don’t have hands-on experience with golang but based on what I know about it yes, they are.

> there seems to be a call to yield but its unclear what it really does

You mean “await”?

It waits for the completion of whatever is on the right side of “await”. If the result is already available, it just continues. If the result is not yet available (e.g. Task.Delay creates a task that will complete in some moment in the future), the control goes away from the async method to the scheduler. The scheduler can then run some other task on the same OS thread. When the result of that operation becomes available (in the sample code, when 1 second delay passes), the scheduler resumes execution of that async method, on the statement after the await.

> if they yield on any sleeps

No, not any sleep. You can call Thread.Sleep() which will put the whole OS thread to sleep. It’s up to programmers to avoid calling blocking APIs from their async methods, i.e. use await Task.Delay() instead of Thread.Sleep(), await stream.ReadAsync() instead of stream.Read(), and so on.

brobinson · on Feb 22, 2018

>marvel of parallelism

If Go's m:n is blowing your mind, you should check out Erlang. Spawning a million "processes" is not a big deal.

quadcore · on Feb 22, 2018

Cool, thanks for the suggestion.

VHRanger · on Feb 21, 2018

How is that an example an improvement over c++? C++ has had pragmatic omp parallel for for two decades

ta2384428 · on Feb 21, 2018

I'm not familiar with OMP, so please forgive me if I'm incorrect - but OMP from my quick search appears to be thread based, while go routines are much lighter weight than threads. They have their own scheduler - which like most things is both good and bad depending on what you want them for. If you use them as intended, the internal scheduler is a great design decision. This means I can happily spawn 10,000 without concern.

Additionally, combining them with the power of channels makes quick work of many tasks. Channels of course can be implemented in C++ too, but having the compiler take care of it for you with additional tools such as the race detector is very handy.

For a large set of problems, they are very nice to work with.

pjmlp · on Feb 21, 2018

You can do the same in C++ on Windows with PPL, UNIX/Windows with Intel TBB, or any of the fiber/co-routine libraries.

Then there is the ongoing work to add async/await patterns into C++20.

quadcore · on Feb 21, 2018

As a matter of fact, you can do the same in assembly if you want to.

Notice how you can always have that answer when it comes to programming languages: you can do the same in. The point is that, the way it's made in go, is awfully handy.

pjmlp · on Feb 21, 2018

I hardly see a difference from

    go func () { }

and

    Task.Run(() => { })

With the benefit that on the latter example, the runtime allows me to customise how scheduling is done.

quadcore · on Feb 22, 2018

Are those Tasks coroutines? If yes, do they yield on any sleeps?

These aspects are key to goroutines.

pjmlp · on Feb 22, 2018

Yes they do, they are the building blocks for async/await, get a thread allocated from a thread pool when running, and you can control how the scheduling takes place, by providing your own scheduler implementation.

ta2384428 · on Feb 22, 2018

That is pretty cool, C++ has come a long way since I used it last about 15 years ago.

It seems like both languages are equally capable here, with C++ having more power and foot guns when required as usual.

pjmlp · on Feb 22, 2018

The code snippet example I wrote was actually .NET with TPL.

C++ with PPL on Windows, would be

    task<T> handle = create_task([] { /* ... */ });

And with standard C++

    future<T> handle = std::async(std::launch::async, [] { /* ... */ });

In both cases, with C++20 it will be possible to co_await handle, which you can already play with on clang and VC++.

creato · on Feb 21, 2018

Every OpenMP implementation I know of uses a thread pool, and dispatches parallel work to it.

This is probably exactly the same behavior as goroutines.

Const-me · on Feb 22, 2018

OMP is only good for CPU bound code. You try to do IO inside OMP parallel section, and you’ll put the whole OS thread to sleep. The OS kernel will likely reschedule some other thread on that hardware thread, but that rescheduling is an expensive process.

Goroutines and .net tasks allow a nice mix of CPU bound and IO bound code. While a goroutine/task is waiting for IO or something else to complete (timer in this example), the runtime will immediately use the hardware thread for some other task, without OS involved.

tzahola · on Feb 21, 2018

Your example will not run in parallel. The go runtime will schedule your goroutines concurrently, but they will be run by a single OS thread, and consequently on a single CPU core.

Once you execute truly on multiple CPU cores (by increasing GOMAXPROCS), you'll be having the same kind of race conditions in Go as in any other imperative language (inb4 Rust Evangelism Strike Force saying "except Rust").

boomlinde · on Feb 21, 2018

> Once you execute truly on multiple CPU cores (by increasing GOMAXPROCS), you'll be having the same kind of race conditions in Go as in any other imperative language (inb4 Rust Evangelism Strike Force saying "except Rust").

Wrong. GOMAXPROCS defaults to the number of logical CPUs, IIRC since version 1.5. For example, I have four cores with 2 threads each, so goroutines will be executed on up to 8 threads unless I set GOMAXPROCS to something else or the application explicitly changes it using the runtime package.

And sure, you'll really have the same problems, but IMO channels and goroutines minimize the friction of implementing thread safe programs using CSP. GP seems a bit optimistic, agreed, but I think that there is at least some substance to the idea that go makes it easier to correctly utilize multiple cores.

zxcmx · on Feb 21, 2018

GOMAXPROCS has defaulted to # of cores since 1.5.

quadcore · on Feb 21, 2018

Wrong. Goroutines are not simply coroutines.

GOMAXPROCS defaults to number of cores.

tzahola · on Feb 21, 2018

Prove it.

Prove it by replacing Sleep in your example with some number crunching, and show how it scales with the number of cores in your CPU.

ta2384428 · on Feb 21, 2018

https://imgur.com/a/DNpw3

Running the following code.

https://play.golang.org/p/k_rRxNAyb0i

I can assure you that go runs across all processors by default.

https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3...

grkvlt · on Feb 24, 2018

That's not what he's saying. He knows go can use all processors if GOMAXPROCS is set correctly, the argument seems to be that there will be race conditions just like any other threading, which seems pretty self evident to me: yes, multi-threaded code can have concurrency issues, film at 11...

gall0ws · on Feb 21, 2018

https://play.golang.org/p/o8odLCRqUMA

gameswithgo · on Feb 21, 2018

Dude you are wrong about this. Please stop.

iknowstuff · on Feb 22, 2018

Except Rust!

jokoon · on Feb 21, 2018

I wonder if the future will be massively parallel, when CUDA and opencl came out I thought that future processors would have more and more core, so if you follow Moore's law, CPUs would see their core count double each 6 months. The problem is that GPU don't have error correcting codes, so you cannot really run application code on a GPU.

The problem with parallelism is that C-like language don't fit well, only functional languages do. If you want to use multi-threading you have to forget about state and only work with input/output paradigms. For OSes it might mean a deep re-design, but I don't really know.

A possible design would be a small but very fast CPU that only takes care or scheduling and task control, and another chip with many cores that deal with payloads and user software.

AMD had some kind of hybrid chip that planned to do both graphics and task, but it was thrown away.

Going parallel would require to change both hardware and software, and by software I mean stateless.

zmonx · on Feb 21, 2018

These are very good points! I would generalize "functional" to declarative though:

Logic programming languages like Prolog and Mercury are also much more amenable to parallelization than C-like languages.

In fact, different Prolog clauses could in principle be executed in parallel without changing the declarative meaning of the program, at least as long as you stay in the so-called pure subset of the language which imposes certain restrictions on the code.

pjc50 · on Feb 21, 2018

> The problem is that GPU don't have error correcting codes

.. eh? What are you referring to here? Mainstream CPUs don't have error correction either, unless you're talking about ECC on the higher-end ones.

versteegen · on Feb 22, 2018

Actually, mainstream CPUs do use error detection and correction for internal operations and memory.

https://en.wikipedia.org/wiki/Machine-check_exception

Realise that if you only protected RAM with ECC then you're leaving a lot of data vulnerable in caches and registers, so those need parity bits too (as well as lots of other error checks on CPU operations). And anyway, CPU errors are common due to bad power supplies and overclockers. And you don't want to add a whole lot of design effort to create a marginally cheaper-to-produce version of the CPU which doesn't do any error checking.

I've seen a lot of MCEs on non-ECC CPUs :(

IANAEE

pjmlp · on Feb 21, 2018

I always like to think of FP dataflow pipelines as doing digital circuits modeling.

seanmcdirmid · on Feb 21, 2018

The first GPU models were basically stateless (via pixel and vertex shaders with texture input and outputs), but this was incredibly inefficient for many GPGPU tasks, so compute shaders and CUDA have ways to load from and store to arrays. The memory model is a bit funky, but I’m not sure how going back to functional is viable for GPU programming.

tzahola · on Feb 21, 2018

Intel Larrabee also got cancelled.

pjmlp · on Feb 21, 2018

Kind of, it became Xeon Phi.

otabdeveloper1 · on Feb 21, 2018

I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

There has. GPU programming is exactly that. CPU-heavy tasks (games, bitcoin mining, machine learning) have already migrated.

pjc50 · on Feb 21, 2018

Unfortunately it requires more than just a change of language, it requires a change in mode of thinking by developers. People are very used to reasoning in terms of "do X, then Y, then Z" or "compute a value X then do A or B on the basis of that". In order to achieve automatic parallelism you need e.g. a type+proof system that can determine that X/Y/Z are independent, or a system that can partially execute both A and B then retire the branch not taken - without invoking security bugs!

pjmlp · on Feb 21, 2018

I think it is easier to understand by those of us that also had electronic design as part of the engineering degree.

UncleSlacky · on Feb 21, 2018

This issue was discussed in my Occam class ~28 years ago. Occam itself is an example of a concurrent/"parallel-by-default" programming language intended for a Transputer hardware environment (now retargetable to x86 etc.), but it's not the easiest of languages to learn:

https://en.wikipedia.org/wiki/Occam_(programming_language)

https://en.wikipedia.org/wiki/Transputer

https://en.wikipedia.org/wiki/KRoC

Pica_soO · on Feb 21, 2018

The sequential programming by default is because most people think sequential by default.

Even though claims of multi-tasking etc persist, the truth is good parallel programmers are a rare thing.

Many ordinary programmers already get into Hot-Water when they use two threads and access data where a Semaphore might be needed.

In addition many algorithms are sequential, so parallelizing them is tricky or gives you no true reward due to cross-thread communication. Add to that the OO-Software Structures that subtle encourage using sequential programming.

I think the future in parallel-programming is actually hiding the parallel programming completely - accept the fact that most humans are not made for it, allow for experts to unlock the ability to override that behavior- let compilers go as far as they can and live with the results.

It will suffer the same fate as functional programming. Really useful, but never dominant, due to limitations in the applying humans.

blauditore · on Feb 21, 2018

I think part of the reason why many programmers haven't wanted to deal with parallel execution is because concurrency is not easy to handle. It has several pitfalls and can be painful to debug. Also, it needs proactive efforts implementing it, so as long as not required, devs just stick to serial execution.

Now, with helpful systems like no-side-effect functional languages and reactive stream frameworks, a lot of gory detail can be abstracted away. I think this has recently lead to more parallel-by-default software development.

adrianN · on Feb 21, 2018

Most software is fast enough when written in a naive sequential style. For the parts that parallelize well and matter, there are already decently mature ways of using all cores. Languages like Go, Rust, and Erlang make it fairly easy to write concurrent programs.

MichaelMoser123 · on Feb 21, 2018

> However, most programmers, and programming languages, remain stuck in a serial-by-default paradigm

You still have to decide upon the unit of work that is going to be sent to a different thread/core/processor/NUMA node/whatever. The different units of work that are distributed should not share state; one really doesn't want to be sharing a lot of state between different processors, because synchronizing the processors memory caches in NUMA is a extremely slow.

I guess it is really hard to break up both the program and data and decide upon the optimal granularity of the work units, it is not something that can be easily done behind the scenes - human intervention is still required.

jonathanstrange · on Feb 21, 2018

I agree, programming languages have not caught up yet.

The right kind of language looks at serial program formulations and based on flow-analysis automatically identifies parallelizable fragments that are large enough to benefit from multicore, then schedules these fragments e.g. by using work-stealing in a system of green threads, i.e., mapping green threads to OS cores as efficiently as possible. Something like that.

In a good parallel language there need to be many immutable constructs by default, exception handling is tricky, and ordinary flow control needs to be compatible by default with parallel evaluation. The languages I've seen such as Parasail are not yet production ready.

Making the programmer control parallelism can be okay, like in Go and Ada, but in the end it should be automatic.

Edit: The problem is also that finding a neat way of solving the problem academically does not readily translate into an efficient implementation, so much that I wonder whether green threads are actually worth it over OS threads. In most languages/VMs they aren't but Go seems to be an exception.

Santosh83 · on Feb 21, 2018

One reason is that our thinking process is inherently serial and we do really badly at multitasking naturally. At least those involving deliberate thought. And our programs are almost an extension of our way of thinking, so we will have to push the boundaries of how we think and model systems in our mind before we can build excellent parallel programming languages. Not that it isn't being done, but the weight of the "serial" legacy is long...

monocasa · on Feb 21, 2018

It's not just our thought process; there are upper limits to the gains from increased parallelism.

https://en.wikipedia.org/wiki/Amdahl%27s_law

dragontamer · on Feb 21, 2018

> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

CUDA. OpenCL. Vulkan Compute Shaders. DirectCompute. C++ AMP. AMD's ROCm. Intel's SPMD. Khronos SYCL.

tomc1985 · on Feb 21, 2018

This argument made sense 10 years ago. We've had dual cores for more than a decade, and CPU speeds stopped growing years ago. If you still can't think in threads and their primitives, or require crutches to handle multithreaded situations, then the problem is with you.

0xcde4c3db · on Feb 21, 2018

Beyond the issues of languages and mindset, many problem domains parallelize poorly. Too many important phenomena fundamentally involve feedback loops evolving over time, and when that happens you can't just compute f(t) and f(t + 1) on different cores. At that point, throwing more cores at the problem might let you make the model bigger, but will very quickly hit a wall in terms of making it faster.

pjmlp · on Feb 21, 2018

A few attempts have already been done, the issue is with developer adoption, not lack of trying.

One of the best examples was StarLisp for the Connection Machine.

convolvatron · on Feb 22, 2018

starlisp required that you rephrased your problem to be large vector, with support for turning off logical cpus, very much like programming a modern gpu. if your problem could be recast that way it was pretty nice. however the generalized scatter/gather was so expensive, it had to be used very sparingly. you had to use the grey coded nearest neighbor hypercube network as much as possible.

the really* cool language from Hillis and Steele was cmlisp, but I don't know how far they got, they never released anything.

ajross · on Feb 21, 2018

> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system

There has. It's called a GPU. Things like OpenCL and CUDA are the new languages.

vvanders · on Feb 21, 2018

Yeah, the tricky bit is it's not just a new language but also laying out your data in a new way that compatible with vectorization/etc. Modern OO/etc techniques love to litter pointers to random places in memory at nearly every step.

It's partly what made the PS3 so hard to write for, the SPUs only have 256kb of directly addressable memory, everything else is DMA'd. That said when you had your code+data fitting in 256kb it screamed everywhere else as well since you fit in L1+L2 cache neatly.

fulafel · on Feb 21, 2018

But it would be much better if a single language + hardware system emerged, rather than a multitude of mutually incompatible hardware systems and languages.

With the current fragmented and sometimes proprietary forest of programming platforms, an application needs to be quite specialized to warrant investment in GPU compute outside the original niche of graphics acceleration. There are other giant problems too, after you get over rewriting your application for numerous different platforms - atrocious quality of GPU drivers causing OS crashes for users, lack of any common way to debug GPU code, the colourful quality of compilers, the wildly different performance characteristics of different platforms necessitating per-platform algorithm changes, etc...

Consider what a minority of applications bother to even put in the work to exploit large amounts of CPU parallelism, which is vastly easier. There is after all >10x parallelism available on a typical PC CPU, after you count cores, threads and SIMD lanes.

ajross · on Feb 21, 2018

> But it would be much better if a single language + hardware system emerged

There will, but it takes time. The world of scalar hardware in the 1970's was no less fragmented. Honestly most of the incompatibilities in the SIMD world at this point are bugs and not fundamental problems. The vector world has settled on a broad architecture at this point for most things.

fulafel · on Feb 22, 2018

Maybe. I feel is's at least equally likely that app dev perceived ROI on GPGPU will get worse due to relative slowdown in GPU advances and increasing parallel programming productivity on CPUs, and an attractive GPGPU platform won't emerge before it's irrelevant.

Androider · on Feb 21, 2018

With just Chrome and IntelliJ running on Ubuntu, I have 281 processes and kernel workers running (as reported by ps). Just stitching the apps together for display on your screen requires 3-4 different processes (and a GPU). That's why 4 logical cores is a bare minimum these days for a desktop, even if no individual application takes advantage of more than a single core.

On the server side, when we deploy a Node.js web service on AWS we start one instance per logical core, for 4-64 processes all independently serving connections.

It seems the process has become the new thread, the smallest unit you should design for. So today's workloads actually make pretty good use of all those cores. Unless you're doing high performance computing and need to squeeze every last drop of performance, processes are a straightforward way to parallelize.

rbanffy · on Feb 21, 2018

Having 281 processes is not a biggie. Having 281 processes that can actually use a piece of the CPUs is quite a different thing.

I think we should, really, start thinking about such things. Maybe prefixing instructions with the execution unit that should handle them (and overflow back to the first one in a circle if we have more EU's in software than the actual hardware provides), separating dependencies within code flow in a more explicit way and, at the same time, not bothering with creating threads.

quietbritishjim · on Feb 21, 2018

One option is that you write in a high-level language where the top-level control code is single threaded, but you call APIs that perform multi-threaded operations seamlessly. The prototypical example of this is Python with numpy/blas (or with deep learning libraries like TensorFlow).

Sean1708 · on Feb 21, 2018

Chapel[0] and Fortress[1] come to mind. Wikipedia also has a list of parallel programming languages[2] (although it seems to play somewhat fast and loose with the definition of "parallel programming language").

[0]: https://en.wikipedia.org/wiki/Chapel_(programming_language)

[1]: https://en.wikipedia.org/wiki/Fortress_(programming_language...

[2]: https://en.wikipedia.org/wiki/List_of_concurrent_and_paralle...

z3t4 · on Feb 21, 2018

I think JavaScript is nice because it's async in nature. Concurrency is hard so it's nice to deal with it using a simple language, so that everything besides the business logic is abstracted. Yes you do not get the same performance, but CPU cores are relatively cheap compared to engineer salaries.

pdpi · on Feb 21, 2018

The notion that cpu cores are cheap compared to engineer salaries only scales so far.

Scaling upwards, your opinion on that changes when a single engineer’s service is running on 10k machines.

At the other end of the spectrum, if you’re developing high-performance applications for small systems (desktop, laptops, mobile), Your workload isn’t going to look like tens of thousands of concurrent independent requests, So the approach of getting parallelism by deploying multiple copies of the application no longer works

dx034 · on Feb 21, 2018

> but CPU cores are relatively cheap compared to engineer salaries

I often experienced that this backfired. Single machines are still constrained in their power and while it's easy to spin up additional VMs in the cloud, scaling a program properly to run on dozens of machines takes a lot of work. It can be faster to develop a program that is really efficient and can solve the problem on one machine than to develop faster only to then spend the time scaling it to a large fleet of servers.

z3t4 · on Feb 21, 2018

I kinda regret adding the note about performance, because switching to a lower level language used to yield orders of magnitude more performance, but optimizers have evolved and today there's not much difference. Sometimes the higher level language will be even faster because of optimizations. And bad performance is often not to blame on the language, instead blame the programmer or more likely the business people as they think it's great to waste resources as it gives them an excuse to charge more.

ska · on Feb 21, 2018

   used to yield orders of magnitude more performance, but optimizers have evolved and today there's not much difference.

If this were true, you'd expect see a lot more native python and the like.

deepnotderp · on Feb 22, 2018

Python is interpreted so super optimizing compilers, SIMD auto vectorization and other recent goodies necessary to get that performance won't work.

ska · on March 7, 2018

Python is not always interpreted. And when it isn't, it's still slow.

PaulHoule · on Feb 21, 2018

Materials other than silicon can support higher clock rates.

From reading the open literature and advertisements by foundry companies, I think you could make a 6502 equivalent processor with Indium Phosphide with 64kb of static RAM that clocks at 30 GHz. With a more refined process you might push 90 GHz and a much more complex processor.

Yes, InP is more expensive than Silicon but part of that is the low volume that InP parts are made in. Advances in Silicon are getting much more expensive, and one InP microprocessor could do the work of ten Silicon-based cores so you can save on die area without the "race to the bottom" in size.

The main issue with high clocks is fast access to memory, probably you would need an optical interface to off-chip RAM, also I don't know what the InP equivalent of DRAM is. (Something like Optane?)

deepnotderp · on Feb 21, 2018

1. Compound semiconductors often suffer from p-type and n-type conduction problems. Plus they're ridiculously expensive and power hungry. They also often lack a native oxide.

2. The problem today in CPUs is not really clock speed but much more the memory access latency, optane is much slower* than DRAM and has much lower endurance.

*Even though silicon HKMG transistors use high k gate dielectrics now, they still use a silicon dioxide interfacial layer.

nikofeyn · on Feb 21, 2018

> I'm surprised that there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it to keep things going forward.

there has and it's called labview, although by hardware you may have meant processors. labview has many quirks, but it surprisingly gets many things right, even in futuristic ways. when i move back to text-based languages it's always a jolt primarily due to the serial nature of them, even those that support asynchronous computation. it's really hard to recalibrate to having to assign things to temporary variables and the like. and the lower dimensions of a text file compared to a higher dimensional canvas is something that sticks out as a limiting factor in supporting parallel by default.

> I find the apparent stagnation extremely depressing.

agreed.

powerbook5300CS · on Feb 21, 2018

Golangs’s web server, by default serves every request in a new goroutine (thread of execution) making it parallel by default with no effort on the user’s behalf.

Obviously this problem domain is easily parallelized but it’s nice to see parallelism be the refs to standard when possible and reasonable to do so.

ninkendo · on Feb 22, 2018

Just to be clear, goroutines aren't threads (although they are multiplexed onto threads), and they're only pre-empted at certain points in the go runtime (function calls are the big ones.)

If your request spins in a for-loop doing lots of work without function calls, other goroutines on the same thread won't get a chance to run, and you'll be limited to GOMAXPROCS simultaneous requests. In practice this never really happens though.

kbart · on Feb 21, 2018

While modern languages support for parallelism is adequate, tools are still lacking imho. I avoid parallelism when it's not necessarily, because debugging all these race conditions, deadlocks, synchronization etc. is a nightmare.

pjmlp · on Feb 21, 2018

That is mostly an issue with the "avoid IDE" crowd.

While IDE tooling can still be improved, the parallel debugging tools in .NET and Java eco-systems are already quite good.

On VS, I can have at any given moment a graphical snapshot on how all threads and tasks are interacting with each other, or just execute some of the threads.

It doesn't solve everything, but it makes it easier than a typical gdb session.

kbart · on Feb 21, 2018

.NET has one of the best ecosystem overall, so it's more an exception than a rule (can't comment on Java as I don't work with it). As I'm mostly in low level, embedded system programming, you are still stuck with gdb, Valgrind and other primitive tools there, because if there's some legacy "IDE" at all, it's most often just some half-assed Eclipse plugin.

pjmlp · on Feb 21, 2018

At least looking to their product sites, Microchip and Green Hills seem to have quite good tooling, then again I don't have embedded experience on modern systems, beyond mobile devices.

kbart · on Feb 21, 2018

Do Microchip even produces multi-core MCU's? Haven't seen one, though I worked mostly with Cortex-A series so might miss something.

pjmlp · on Feb 21, 2018

As I said, I don't have much experience in real embedded domain outside mobile devices (iOS, Android, UWP), but aren't Cortex-A5 supposed to handle up to 4 cores?

kbart · on Feb 21, 2018

Yes, so it looks like Microchip produces multi-core MCUs after all, though as I've mentioned, I haven't encountered them and can't comment about quality of their tools.

jbergens · on Feb 22, 2018

I guess it's not going to help a lot except in special cases. See for example what Linus had to say about many cores and the long discussion after.

https://www.realworldtech.com/forum/?threadid=146066&curpost...

kamaal · on Feb 21, 2018

New languages are coming up, it will just take long adoption cycles, given the library and surrounding tooling ecosystem has to be mature enough to go with it.

Prominent examples being Go and Perl 6.

Perl 6 especially, given how audacious the project is. There are performance issues with it currently though, from what I hear they are working to fix it soon.

b2gills · on Feb 27, 2018

I remember someone posting some Perl 6 code and C code that did the same thing. The Perl 6 code was shorter, easier to understand, more correct (Unicode), and was reported to be faster for what they were doing.

There are things which are slower, but since it is a higher level language it may be easier to try multiple algorithms one of which may be significantly faster. It also has many useful features included, which can be optimized in ways that aren't recommended for user code. (writing the algorithm in NQP) There is also a code specilizer (spesh) and a JIT.

Basically for many things it can be fast enough. Also if you profile your code and find something that is egregiously slow you should report it. Many times such things get optimized quickly.

eloisant · on Feb 21, 2018

There are many languages that handle well parallelism, but not all problems really need parallelism in the program itself.

For example, for web programming, you can throw several machines (or processes) of your serial program to have it run in parallel for all practical purpose.

swiley · on Feb 21, 2018

There are limits to how much parallelism will improve things, see amdahl's law.

kamaal · on Feb 21, 2018

Sure, but let's start moving towards parallelism first. May be a few decades from now we would be close to breaking those limits.

Something1234 · on Feb 21, 2018

That law is a fact of life. Its a form of diminishing returns, there's only so much you can do.

Entalpi · on Feb 21, 2018

X10 is kind what you are describing.

https://en.wikipedia.org/wiki/X10_(programming_language)

contravariant · on Feb 21, 2018

Haskell kind of qualifies. You can't have race conditions when everything is immutable.

Of course this makes some other things a bit more difficult.

ddorian43 · on Feb 21, 2018

Seastar framework

krylon · on Feb 21, 2018

Honest question: Is that name a pun on C* (https://en.wikipedia.org/wiki/C*)?

elorant · on Feb 21, 2018

Microsoft has developed Parallel Linq which is a subset of .net framework and works like a charm.

wiz21c · on Feb 21, 2018

Is it depressing or is it a kind of signal we should listen to and try to interpret ?

TickleSteve · on Feb 21, 2018

look at VHDL or Verilog programming for FPGAs. Its effectively exactly what you're talking about.

gameswithgo · on Feb 21, 2018

shaders?

Bitcoin_McPonzi · on Feb 21, 2018

In fact, there is! Erlang is extremely parallel. Processes are lightweight, share nothing, and run on as many CPUs you have.

djsumdog · on Feb 21, 2018

And also its derivatives, like Elixir.

losteric · on Feb 21, 2018

Elixir is amazing! The original author, José Valim, took a brilliant approach: Take a 30 year-old battle-tested highly-parallel VM, and build a modern language on top of it. Syntax is inspired by the good parts of Ruby (very clean), but nothing comes close in terms of the ease of parallelism... it's so natural, and insanely fast.

The unit tests are what convinced me this will be the next big thing. Beautifully clear syntax, succinct tests, and most importantly: Parallel out of the box. Hundreds of unit tests run instantaneously. Ruby TDD setups run tests that changed with maybe 1-2s lag... Elixir runs all the tests so fast that, at the beginning, I wasn't sure the tests were running.

digi_owl · on Feb 21, 2018

I can't shake the feel that this parallelism thing will be nothing but a wild goose chase, because nature seems to be highly serial in all but the most macroscopic of senses.

simonh · on Feb 21, 2018

How can you say that? Almost every part of every organism functions simultaneously all the time. At the community scale, organisms cooperate and communicate continuously in real time. My eldest daughter is going through an ant obsession phase at the moment, their communities and how they coordinate hundreds of individuals simultaneously are amazing. Nature and natural selection are the ultimate optimizers - if there is any way for an organism or species to extract even the tiniest survival or reproductive advantage, nature ruthlessly optimises for it. Parallel function and behaviour is one of it's primary tools.

pjc50 · on Feb 21, 2018

All of your cells and neurons operate in parallel.

adwn · on Feb 21, 2018

And yet you only have a single train of thought [1] – or multiple interleaved trains, but that's concurrency, not parallelism.

[1] Plus a couple of background processes, like breathing, that execute in parallel.

boomlinde · on Feb 21, 2018

That's a huge simplification not pertaining to the fundamental nature of the brain. Maybe it is in the nature of our consciousness (which in itself is probably only an abstract concept) to perceive the processes it emerges from as a sequence of single trains of thought rather than a chaotic, continuous consolidation of processes both internal to the brain/body and outside of it.

You could look at society as a whole and see the zeitgeist as a singular "train of thought", but you'd probably still recognize humans as individual agents. I think we have a bias towards thinking of ourselves as the ultimate individuals, neither recognizing the processes within us (like those of our cells) or the processes beyond us (like those of a group of people, animals, plants etc.) as having a similar nature. This is probably a genetically advantageous trait.

adwn · on Feb 21, 2018

> That's a huge simplification not pertaining to the fundamental nature of the brain.

I disagree. Conscious thinking is a crucial process that's inherently single-threaded, even though it runs on highly parallel hardware (the billions of neurons).

However, I must admit that my point doesn't necessarily contradict pjc's point, and I don't really agree with digi_owl's claim that parallelism is a "wild goose chase".

boomlinde · on Feb 21, 2018

> I disagree. Conscious thinking is a crucial process that's inherently single-threaded, even though it runs on highly parallel hardware (the billions of neurons).

I don't even know what to make of that. What do you mean by "single-threaded" if at the same time you recognize that it "runs on highly parallel hardware"? If you actually mean that our consciousness emerges from a purely sequential process that happens to go on in a highly parallel system, no, that's clearly wrong. Experience, the fundamental basis of consciousness, actuates many parts of the brain at the same time. They process this information largely independently in different ways, and sometimes those processes result in a clear "train of thought" but most of the time they do not. You can not reason about the inherent nature of our consciousness in terms of trains of thought if you recognize any subjectivity to our experience that exists without reasoning or language. That's a matter of definition, of course, and without agreeing on a precise definition it's probably no use talking about what is inherent about it.

If that's not what you mean, CPU execution models are probably not a very helpful metaphor to explain your idea. The clearly defined layer of abstraction that separates a fully pipelined CPU design built with simultaneously operating logic gates from "single threaded" programs being executed on it doesn't exist in brains. I guess that's what bugs me most about this type of discussion on HN. It seems developers are very fond of taking their (admittedly versatile) hammers and hammer away at anything they can think of, for better or for worse.

adwn · on Feb 22, 2018

> If you actually mean that our consciousness [...]

I'm not talking about consciousness (as in qualia or subjective experience), but about conscious thinking as in train of thought or intentionally thinking about something. For me, this process itself feels very sequential.

tyingq · on Feb 21, 2018

"there hasn't emerged a "parallel-by-default C++" kind of language + hardware system to exploit it"

That's roughly what Intel tried with Itanium. I don't know if whatever barriers they hit are still barriers today.

pjc50 · on Feb 21, 2018

Not really - Itanium tried to move the instruction-level parallelism and the burden of re-ordering execution to the compiler. You could only execute the maximum six instructions per cycle if there were no data dependencies. So it only really works for certain kinds of algorithmic data-heavy execution; that's why VLIW is only found in DSP architectures today.

Good SE answer here: https://softwareengineering.stackexchange.com/questions/2793... especially the ones focusing on cache misses.

tyingq · on Feb 22, 2018

"Itanium tried to move the instruction-level parallelism and the burden of re-ordering execution to the compiler"

That sounds a lot like "parallel-by-default C++" kind of language + hardware system to exploit it"

Just swapping "compiler" for language. They didn't succeed, but they did try

Edit: helping me understand where I'm off might be more helpful than a downvote. Does swapping "compiler" for "language" not respresent what Intel was trying to do?

TheLostOne · on Feb 21, 2018

With the web and requests the default would look to be parallel for most developers. The container runtime (Ruby, Python, Java) just abstracts away the parallelism on different cores.

slivym · on Feb 21, 2018

Seems like an incredibly long winded way of saying 'To go faster you either need to split up each instruction into lots of parts or increase the voltage for the transisters. We've split the instructions as much as we can, and power consumption is proportional to Voltage cubed, so it's not a scalable plan.'

IIAOPSW · on Feb 21, 2018

More importantly than mere power consumption, we don't have a way to remove the waste heat generated. Dennards law (like Moore's law but for power consumption per transistor) ended 10 years ago. Exactly the same time clock speeds stopped improving. There are actually a few computers out there that run around 10Ghz but they all have impractical cooling systems.

If there ever were a return to exponential scaling, we would very soon run into the Launder limit.

inspector-g · on Feb 21, 2018

For those who are curious, see the Wikipedia article on Landauer's Principle [1]

I hadn't heard of this before, but it sounds like we are a long way off from reaching the limit; as the article states, modern computers use millions of times more energy than what Landauer's Principle implies is the lowest possible amount.

1: https://en.wikipedia.org/wiki/Landauer%27s_principle

deepnotderp · on Feb 22, 2018

> If there ever were a return to exponential scaling, we would very soon run into the Launder limit.

No we wouldn't. We're around ~10,000X off and would run into thermal danger zones long before.

IIAOPSW · on Feb 22, 2018

BUT....The Launder limit is over optimistic because unless your computer runs at absolute zero you need to keep redundant copies of each bit for error correction. Transistors are implicitly error correcting in the sense that each bit is represented by a current of a few thousand electrons.

Factoring in the redundancy requirement we are likely only off by somewhere between 100x and 1000x. If there were ever a return to exponential technological improvement, we would run out of road after a few years.

deepnotderp · on Feb 22, 2018

That's my point though, the bottleneck is not going to be Landauer's Limit.

navjack27 · on Feb 21, 2018

Re: all the programming replies.

Preface, I'm not a programmer, I'm a hardware guy.

It's all well and good to make sure your programs and future programs are able to be run in a parallel fashion but there is a big hole to that and it's the operating systems methods of handling cores and threads.

Let's use folding@home as an example. Very multithreaded. Now let's use, at first, Ryzen 1800x as the hardware we'll run it on. We have 8 physical cores. We also have two separate dies. Each die has four cores. Each die module has their own level 3 cache. As you use your system and you are also folding, even in the newest Linux kernel, data and instructions might get evicted and bounced around and take latency hits and thus performance hits. Nothing really locks the work to the cores or threads taking into account locality. You can adjust this with HTOP and set each thread of folding manually.

Beyond AMD, even Intel has similar issues still with the 8700k. Hell, in general just efficient multithreading seems like a tough compromise for OS development. "Users" want things to be smooth upon interaction, so you have preemption. Work wants to get done but it also wants to be a good citizen to the rest of the system.

Developers are going to have to learn about, and keep up to date with, much more then a fancy new language. You're going to have to learn each new CPU inside and out and how each OS treats it.

vondur · on Feb 21, 2018

How did the BeOS designers make the BeOS so good at multiprocessing? I remember how well the operating system scaled with more than one CPU.

TickleSteve · on Feb 21, 2018

Applications scale, not operating systems.

There was nothing magic in BeOS, simply multithreading that seemed novel at the time in consumer-level hardware.

Hate to say it, but there was no magic, just engineering that everyone now has.

da_chicken · on Feb 21, 2018

Probably because it ran on PowerPC. If I remember my systems design class from 20 years ago, RISC makes implementing or scaling multiprocessing easier.

nradov · on Feb 21, 2018

Intel processors essentially are RISC now internally. They just expose more complex instructions as a higher level API.