Stealthy startup Soft Machines launches virtual CPU cores

lgeek · on Oct 25, 2014

I'm going to remain skeptical until they publish more details. The report[0] linked from their homepage seems to contain more/better information, but it's not much.

What seems pretty clear is that they use a proprietary ISA and have developed dynamic binary translators for ARM and x86 code. It's not clear if it's a VLIW architecture or if the ISA has any other properties required by the hardware.

From this report it sounds like they're essentially doing thread-level speculation in hardware.

Based on the linked article I would have been tempted to think they have reconfigurable pipelines, but based on the report I'm somewhat sure it was just a misunderstanding.

I generally think it's a bad sign when a company implements relatively well known concepts from research and then doesn't use the standard terminology and tries to pitch their creation as something 100% new and original. In any case, there's no need to rush a judgment, I guess they'll publish better technical information in time.

[0] http://www.softmachines.com/wp-content/uploads/2014/10/MPR-1...

ChuckMcM · on Oct 25, 2014

Came here to say exactly the same thing. I get that you could create a fabric of hard cores which could be combined into a complete execution pipeline, and I love the idea of imagining something like an FPGA with integer execution units rather than complex logic blocks as building blocks. But it feels like the Lego Technic kit that builds a motorcycle, sure you can build anything you want with the parts, but the motorcycle is the only thing that makes sense. In the same way I wonder if a programmable fabic layered over a series of chip building blocks wouldn't resolve down to a 'best' (or 'least bad') solution and nothing else, at which point why not just add a metal mask and make the chip non programmable?

graycat · on Oct 25, 2014

The article didn't make very clear the differences between Soft Machines and some now quite old work by Michael J. Flynn on universal host machine and some work by Kemal Ebcioglu on using very long instruction word (VLIW) on code for nearly any instruction set.

Last I heard, Ebcioglu's execution timing simulations were getting 9:1 speedup on IBM's 370 code via 24-way VLIW.

My favorite old idea was to find and offer some programming language constructs that, in their implementation, could make good use of multiple threads without the programmer having to consider multiple threads.

kjhughes · on Oct 25, 2014

I would have been blown away if the article made that connection. Heck, I'm blown away that you made the connection.

Just to expand a bit on your apropos points: Michael Flynn is the originator of SISD/SIMD/MISD/MIMD computer architecture classifications. He also originated the notion Directly Executable Languages:

  A Directly Executed Language (DEL) is the interface between the
  output of a higher level computer language translator and the input
  to the interpretive process of a computer. New developments in
  computer technology (especially fast read-write control storage)
  allow 'soft' computer architectures in which a range of flexible
  'machine' or DEL's can be introduced. These DEL's are in many
  respects unlike conventional instruction sets, especially in format
  flexibility and specified operations.

Ref: http://books.google.com/books/about/Directly_Executed_Langua...

(As an undergrad I was fascinated by the work, which was introduced to me by Bob Wedig and Gus Uht at CMU ECE, as well as their extensions for automated concurrency detection and the unit which kept track of which instructions had and had not executed: the Advanced Execution Matrix.)

graycat · on Oct 26, 2014

Flynn also was IBM's guy on the 360/91 -- 8 double word instruction cache, 16 way interleaved memory, 60 ns cycle time. He was also a prof at Johns Hopkins but left before I got there; I'd been hoping to take a course from him.

sharpneli · on Oct 25, 2014

VLIW in reality was actually worse (see Itanium as an example) than the currently favored superscalar out of order execution with register renaming. Both of these techniques try to take advantage of instruction level parallelism inherent in pretty much all code.

The reason for failure is that VLIW is basically stuck with whatever the compiler decides can be done. OoO does it dynamically. In the end the only real drawback with OoO vs VLIW is the chip area and increased power consumption required for the logic.

graycat · on Oct 26, 2014

You are correct that VLIW, e.g., as in Itanium, flopped and out of order execution, speculative execution, branch prediction, etc. won out.

The 9:1 speed up I reported was directly from Ebcioglu: For a while he was in our AI group at Watson. I don't recall just what he did, but he may have had an intermediate step that, for a given program, analyzed the stream of 370 instructions and rewrote them for the 24 way VLIW.

An idea I had him consider was very long addresses. Or, who the heck really wants the addresses in main memory to be 0, 1, 2, ...? Of course, no one! That's why we have lots of work, from old link edit relocation to collection classes, memory management (garbage collection), etc.

So, what do we actually do? Sure, darned near everything comes out of level 1, 2, 3 or so cache which works with just a hash of the main memory address.

So, since we are going to hash the main memory addresses anyway, let's just do that! So, have a very long address, say, 1024 bytes long. So, the 1024 bytes are immediate from the source code! That is, just concatenate, say, address space name, program name, function name, class name, instance name, member name, key name (as in key-value pairs). That's the address. Then hash it.

Ebcioglu checked how fast the execution logic could be, and it seemed okay.

I know; I know; I left out a lot and need a better explanation! I'm concentrating on my software for my project and have gotten away from such low level hardware issues; heck, I don't even remember if we hash the real or the virtual address!

But, the OP was a little light on a lot of history maybe relevant to the work of Soft Machines.

ggreer · on Oct 26, 2014

Those of you wanting to know more about this may be interested in Cliff Click's Crash Course in Modern Hardware.[1] It does a pretty good job of explaining how pipelined, superscalar, OoO CPUs came to be.

1. http://www.infoq.com/presentations/click-crash-course-modern...

tittat · on Oct 25, 2014

Yeah. One would wonder if we werent so stubborn with the way we write our programs what kind of performance we can get from our processors.

CyberDildonics · on Oct 26, 2014

Well, if programs are created to process simple instructions on lots of data there are huge speedups to be had on current x86 processors. By doing this memory latency is no longer a bottleneck and SIMD instructions can be used much more often. If unnecessary heap allocations have already been taken out, structuring a program like this can result in very substantial speedups.

markrages · on Oct 25, 2014

I've seen this movie before. The next step is to hire Linus Torvalds.

epoxyhockey · on Oct 25, 2014

For those that don't get the reference: http://en.wikipedia.org/wiki/Transmeta

melling · on Oct 25, 2014

It's also covered in the article:

"Industry skepticism is likely. The notion of abstracting software from chip hardware has been tried by companies such as Transmeta, a startup born in the mid-1990s that labored for years in secret on technology based on translating computing instructions in novel ways. The startup ultimately failed."

_random_ · on Oct 25, 2014

But is it the fastest way to burn through the cash? Moving to a fancy office would be more efficient.

shepardrtc · on Oct 25, 2014

Aeron chairs. Lots of them.

tdicola · on Oct 25, 2014

Standing desks, motorized standing desks are the new Aeron of this bubble.

cfallin · on Oct 25, 2014

The idea of breaking a single thread into parallel parts automatically at runtime has been thrown around in academia for a while -- anyone interested should search for "dynamic multithreading" and "thread-level speculation". So industry is going to be (rightly) skeptical. But if they've managed to build something that actually works, and on real code (not just toy benchmarks), this is a huge accomplishment.

sharpneli · on Oct 25, 2014

Modern CPU's do this on very limited level by basically looking at the instructions and executing those in parallel which have no dependencies with eachother.

The thing I'm wondering the most is that how on earth can they get so many instructions per clock with short pipeline? Without knowing the details on how they compiled the SPEC benchmark it's really hard to say. Who knows, maybe they cheated and ran parts of the benchmark parallel on their cpu and not on others, with "It's the natural way for this chip!" as an excuse.

lgeek · on Oct 25, 2014

What superscalar and out-of-order processors are exploiting is instruction level parallelism, while their technology seems to use thread level speculation.

> Without knowing the details on how they compiled the SPEC benchmark it's really hard to say. Who knows, maybe they cheated and ran parts of the benchmark parallel on their cpu and not on others

That's most likely the case, but I wouldn't consider it cheating. As long as from the software perspective only a single thread is running (and SPEC CPU 2000 and 2006 are single threaded), I think it's fair game. The whole point of their project is to expose parallelism without requiring the programmer / execution environment to explicitly support it.

sharpneli · on Oct 25, 2014

> seems to use thread level speculation

If the software is compiled into a normal single threaded program then what else there is left except instruction level parallelism?

And if you can compile it to work with two threads then we have Hyperthreading to take advantage of that even with a single core.

Their [Soft Machines] latest patent is about basically an OoO method in overdrive, it abuses only instruction level parallelism. And based on that their claim that their pipeline would be short is not really valid.

cfallin · on Oct 25, 2014

Thread-level speculation is exploiting ILP. In some contexts it's compiler-assisted but in this context I would imagine it is fully microarchitectural (i.e., in hardware/translation firmware, running a single instruction stream of user code, invisible to the user). Dynamic multithreading (Akkary 1998?) did this by splitting the thread at predictable points like function calls/returns and (IIRC) backward branches. So it's still ILP within a single thread, but at a much longer distance than what an OoO scheduling window provides.

philliproso · on Oct 25, 2014

Not sure if the article did not do into enough detail, or I am not technical enough in this field. But would this be able to bypass python and other interpreters GIL. ie "VISC’s secret sauce" is to hook into raw machine code instruction set and determine if it is changing anything in other cached/executing instructions, and if it is not pipe it to another core. Thus if Python issues instructions it can immediately determine if it can be piped to multiple cores.

rwmj · on Oct 26, 2014

Semiaccurate have a bit more information. I'm still skeptical that the magic box which turns serial code into threaded code without any overhead from Amdahl's law can be done. http://semiaccurate.com/2014/10/23/soft-machines-breaks-cove...

jewel · on Oct 25, 2014

Can someone help my ignorance? I understand the appeal of automatic parallelization, but what is the advantage to creating your own chip? It seems to me that this is a translation that could be done in software either at runtime or at compile time.

Trying to launch line of processors, even without any of the translation magic, seems like a very difficult venture all by itself.

wmf · on Oct 25, 2014

The auto-parallelization requires new hardware features that don't exist in existing processors.

sharpneli · on Oct 25, 2014

I'd really love to know what those might be. Unfortunately Soft Machines don't tell any details.

EDIT: Jackpot! Google patent search to the rescue. Based on their patents in the last few year (latest one was published March 2014) one can get an understanding what the fuss is all about. Have to read those trough today.

dbancajas · on Oct 25, 2014

Wow. This is like the holy grail of processor performance optimization. It's so weird that 1000's of PhDs in the academia have not been able to solve this yet 250 engineers can. If this were true, it will drastically alter the course of research in computing. I'd see "low-hanging fruit" optimizations in ISCA, MICRO, ASPLOS, HPCA and the likes in the coming years. Again, assuming this technology will deliver.

3327 · on Oct 25, 2014

another great informative article full of details by the WSJ.

Here is a better article:

http://www.pcworld.com/article/2838018/stealthy-startup-soft...

dang · on Oct 25, 2014

Ok, we changed the url to that from http://blogs.wsj.com/digits/2014/10/23/secretive-startup-unv.... Thanks!

msoad · on Oct 25, 2014

Multithreading shouldn't be my job as a programmer. I love the idea. If this takes off and become standard it's going to save us from dealing with multithreading issues.