I'm going to remain skeptical until they publish more details. The report[0] linked from their homepage seems to contain more/better information, but it's not much.
What seems pretty clear is that they use a proprietary ISA and have developed dynamic binary translators for ARM and x86 code. It's not clear if it's a VLIW architecture or if the ISA has any other properties required by the hardware.
From this report it sounds like they're essentially doing thread-level speculation in hardware.
Based on the linked article I would have been tempted to think they have reconfigurable pipelines, but based on the report I'm somewhat sure it was just a misunderstanding.
I generally think it's a bad sign when a company implements relatively well known concepts from research and then doesn't use the standard terminology and tries to pitch their creation as something 100% new and original. In any case, there's no need to rush a judgment, I guess they'll publish better technical information in time.
Came here to say exactly the same thing. I get that you could create a fabric of hard cores which could be combined into a complete execution pipeline, and I love the idea of imagining something like an FPGA with integer execution units rather than complex logic blocks as building blocks. But it feels like the Lego Technic kit that builds a motorcycle, sure you can build anything you want with the parts, but the motorcycle is the only thing that makes sense. In the same way I wonder if a programmable fabic layered over a series of chip building blocks wouldn't resolve down to a 'best' (or 'least bad') solution and nothing else, at which point why not just add a metal mask and make the chip non programmable?
The article didn't make very clear the differences
between Soft Machines and some now quite old work by
Michael J. Flynn on universal host machine and
some work by Kemal Ebcioglu on using very long
instruction word (VLIW) on code for nearly any
instruction set.
Last I heard, Ebcioglu's execution timing
simulations were getting 9:1 speedup on IBM's 370
code via 24-way VLIW.
My favorite old idea was to find and offer some
programming language constructs that, in their
implementation, could make good use of multiple
threads without the programmer having to consider
multiple threads.
I would have been blown away if the article made that connection. Heck, I'm blown away that you made the connection.
Just to expand a bit on your apropos points: Michael Flynn is the originator of SISD/SIMD/MISD/MIMD computer architecture classifications. He also originated the notion Directly Executable Languages:
A Directly Executed Language (DEL) is the interface between the
output of a higher level computer language translator and the input
to the interpretive process of a computer. New developments in
computer technology (especially fast read-write control storage)
allow 'soft' computer architectures in which a range of flexible
'machine' or DEL's can be introduced. These DEL's are in many
respects unlike conventional instruction sets, especially in format
flexibility and specified operations.
(As an undergrad I was fascinated by the work, which was introduced to me by Bob Wedig and Gus Uht at CMU ECE, as well as their extensions for automated concurrency detection and the unit which kept track of which instructions had and had not executed: the Advanced Execution Matrix.)
Flynn also was IBM's guy on the 360/91 -- 8 double
word instruction cache, 16 way interleaved memory,
60 ns cycle time. He
was also a prof at Johns Hopkins but left before
I got there; I'd been hoping to take a course
from him.
VLIW in reality was actually worse (see Itanium as an example) than the currently favored superscalar out of order execution with register renaming. Both of these techniques try to take advantage of instruction level parallelism inherent in pretty much all code.
The reason for failure is that VLIW is basically stuck with whatever the compiler decides can be done. OoO does it dynamically. In the end the only real drawback with OoO vs VLIW is the chip area and increased power consumption required for the logic.
You are correct that VLIW, e.g., as in Itanium,
flopped and out of order execution, speculative
execution, branch prediction, etc. won out.
The 9:1 speed up I reported was directly from Ebcioglu:
For a while he was in our AI group at Watson. I don't
recall just what he did, but he may have had an intermediate
step that, for a given program, analyzed the stream of
370 instructions and rewrote them for the 24 way
VLIW.
An idea I had him consider was very long addresses. Or,
who the heck really wants the addresses in main memory
to be 0, 1, 2, ...? Of course, no one! That's why
we have lots of work, from old link edit relocation to collection classes, memory management (garbage
collection), etc.
So, what do we actually do? Sure,
darned near everything comes out of level 1, 2, 3 or so
cache which works with just a hash of the
main memory address.
So, since we are going to hash the main memory
addresses anyway, let's just do that! So, have
a very long address, say, 1024 bytes long.
So, the 1024 bytes are immediate from the
source code! That is, just concatenate,
say, address space name, program name,
function name, class name, instance name,
member name, key name (as in key-value pairs).
That's the address. Then hash it.
Ebcioglu
checked how fast the execution logic could be,
and it seemed okay.
I know; I know; I left out
a lot and need a better explanation! I'm
concentrating on my software for my project
and have gotten away from such low level hardware
issues; heck, I don't even remember if
we hash the real or the virtual address!
But, the OP was a little light on a lot of
history maybe relevant to the work of
Soft Machines.
Those of you wanting to know more about this may be interested in Cliff Click's Crash Course in Modern Hardware.[1] It does a pretty good job of explaining how pipelined, superscalar, OoO CPUs came to be.
Well, if programs are created to process simple instructions on lots of data there are huge speedups to be had on current x86 processors. By doing this memory latency is no longer a bottleneck and SIMD instructions can be used much more often. If unnecessary heap allocations have already been taken out, structuring a program like this can result in very substantial speedups.
"Industry skepticism is likely. The notion of abstracting software from chip hardware has been tried by companies such as Transmeta, a startup born in the mid-1990s that labored for years in secret on technology based on translating computing instructions in novel ways. The startup ultimately failed."
The idea of breaking a single thread into parallel parts automatically at runtime has been thrown around in academia for a while -- anyone interested should search for "dynamic multithreading" and "thread-level speculation". So industry is going to be (rightly) skeptical. But if they've managed to build something that actually works, and on real code (not just toy benchmarks), this is a huge accomplishment.
Modern CPU's do this on very limited level by basically looking at the instructions and executing those in parallel which have no dependencies with eachother.
The thing I'm wondering the most is that how on earth can they get so many instructions per clock with short pipeline? Without knowing the details on how they compiled the SPEC benchmark it's really hard to say. Who knows, maybe they cheated and ran parts of the benchmark parallel on their cpu and not on others, with "It's the natural way for this chip!" as an excuse.
What superscalar and out-of-order processors are exploiting is instruction level parallelism, while their technology seems to use thread level speculation.
> Without knowing the details on how they compiled the SPEC benchmark it's really hard to say. Who knows, maybe they cheated and ran parts of the benchmark parallel on their cpu and not on others
That's most likely the case, but I wouldn't consider it cheating. As long as from the software perspective only a single thread is running (and SPEC CPU 2000 and 2006 are single threaded), I think it's fair game. The whole point of their project is to expose parallelism without requiring the programmer / execution environment to explicitly support it.
If the software is compiled into a normal single threaded program then what else there is left except instruction level parallelism?
And if you can compile it to work with two threads then we have Hyperthreading to take advantage of that even with a single core.
Their [Soft Machines] latest patent is about basically an OoO method in overdrive, it abuses only instruction level parallelism. And based on that their claim that their pipeline would be short is not really valid.
Thread-level speculation is exploiting ILP. In some contexts it's compiler-assisted but in this context I would imagine it is fully microarchitectural (i.e., in hardware/translation firmware, running a single instruction stream of user code, invisible to the user). Dynamic multithreading (Akkary 1998?) did this by splitting the thread at predictable points like function calls/returns and (IIRC) backward branches. So it's still ILP within a single thread, but at a much longer distance than what an OoO scheduling window provides.
Not sure if the article did not do into enough detail, or I am not technical enough in this field. But would this be able to bypass python and other interpreters GIL. ie "VISC’s secret sauce" is to hook into raw machine code instruction set and determine if it is changing anything in other cached/executing instructions, and if it is not pipe it to another core. Thus if Python issues instructions it can immediately determine if it can be piped to multiple cores.
Can someone help my ignorance? I understand the appeal of automatic parallelization, but what is the advantage to creating your own chip? It seems to me that this is a translation that could be done in software either at runtime or at compile time.
Trying to launch line of processors, even without any of the translation magic, seems like a very difficult venture all by itself.
I'd really love to know what those might be. Unfortunately Soft Machines don't tell any details.
EDIT:
Jackpot! Google patent search to the rescue. Based on their patents in the last few year (latest one was published March 2014) one can get an understanding what the fuss is all about. Have to read those trough today.
Wow. This is like the holy grail of processor performance optimization. It's so weird that 1000's of PhDs in the academia have not been able to solve this yet 250 engineers can. If this were true, it will drastically alter the course of research in computing. I'd see "low-hanging fruit" optimizations in ISCA, MICRO, ASPLOS, HPCA and the likes in the coming years. Again, assuming this technology will deliver.
Multithreading shouldn't be my job as a programmer. I love the idea. If this takes off and become standard it's going to save us from dealing with multithreading issues.
What seems pretty clear is that they use a proprietary ISA and have developed dynamic binary translators for ARM and x86 code. It's not clear if it's a VLIW architecture or if the ISA has any other properties required by the hardware.
From this report it sounds like they're essentially doing thread-level speculation in hardware.
Based on the linked article I would have been tempted to think they have reconfigurable pipelines, but based on the report I'm somewhat sure it was just a misunderstanding.
I generally think it's a bad sign when a company implements relatively well known concepts from research and then doesn't use the standard terminology and tries to pitch their creation as something 100% new and original. In any case, there's no need to rush a judgment, I guess they'll publish better technical information in time.
[0] http://www.softmachines.com/wp-content/uploads/2014/10/MPR-1...