GCC already has something alike, the parallel mode [1]. It is based on the Multi-Core STL (MCSTL) developed at Karlsruhe University. In [2] you can also find some publications. As far as I know this already works quite well.
If you have a finite number of totally independent load balanced operations to run in parallel, then you don't really have a parallelism problem in the first place.
We need libraries to help us where the operations are not independent and not balanced.
Except this is an implementation of a proposed standard parallel for loop. The main purpose of a standard is to codify old-hat technologies so people don't have to reinvent the wheel anymore.
But this is not another #pragma omp parallel for. It's a parallel implementation of STL algorithms like std::sort or std::nth_element etc. - you can replace your sequential calls by multithreaded versions easily. Like the top post, I would recommend having a look at GNU's parallel mode (aka MCSTL, Multi Core STL). Peter Sanders recognized the potential of such an implementation very early, his group published the first version of MCSTL in 2006: http://algo2.iti.kit.edu/singler/mcstl/
Sadly, it looks like it didn't get continued by the GNU folks after they integrated it. It still exists, though.
Spot on. There either needs to be some kind of OS level control (Grand Central), or tweaks through the environment, like OpenMP - where you set in advance how much threads are to be used by the process.
I think Microsoft's PPL had something where it would've cooperated with the OS, but things did not worked out as expected and it wasn't delivered. Or I could be completely wrong, some links here:
You're thinking of C++'s "executor", which is currently being discussed for addition to the standard. The parallel STL will use the executors once they're added to a technical specification. There's agreement on how integration this will work, and that it's the right thing to do, but executors don't have a fully agreed-upon API yet.
We (the C++ standards committee) discussed these things further this week :-)
I was going to suggest GCD as well. IIRC it's FreeBSD implementation is great and it's Linux port is adequate. But yes, queues, producers, consumers, and OS level parallelism is the way to go!
It's just we see so many parallel collections, parallel streaming, parallel map, parallel for-loop efforts, and they always rely on the problem being embarrassingly parallel in the first place!
You are clearly familiar with Intel's Threading Building Blocks. Why don't you read up on the submitted implementation and answer your own question here for eveyone's benefit?
[1] https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mod... [2] http://algo2.iti.kit.edu/singler/mcstl/