I have yet to go through this talk but I have been waiting for this for a while....

nraynaud · on July 31, 2014

I really feel like a brainless fanboy too, I'm watching everything about this company. Not being an expert in CPU design, I know I lack the knowledge to criticize what I see, I'm just slurping marketing material straight from the bullshit hose, but I can't help myself, I love it.

hahainternet · on July 31, 2014

One of the things about Mill that feels reassuring is that the information is being presented slowly, step by step to people who have good reason to give a shit.

If it was purely marketing, they'd have a kickstarter up and be promising performance figures without a billion caveats.

XorNot · on July 31, 2014

Hate to say it, but there's nothing new under the sun when it comes to marketing. Everyone tries everything and modifies based on how well it works. A lot of slow information is the same as trying to go for the quick money.

At the end of the day, the problem with the Mill is there's no hardware implemented. No real world benchmarks to be had, and the world of hardware design is full of seemingly slamdunk ideas which have giant problems in actual implementation.

This isn't to disparage the Mill or its team - but the content and results are important. Not the style of the message. Anyone can style a message.

RyanMiller · on July 31, 2014

I see this posted every time Mill gets linked. Could someone simply explain the significance of the Mill architecture?

willvarfar · on July 31, 2014

(Mill team, so biased; thanks for the chance to pitch ;)

We are a DSP that can run general purpose code.

Traditionally, to run general purpose code fast you needed an out-of-order superscalar architecture, as all the x86 and RISC cores are these days.

DSPs have substantially better performance and substantially better efficiency, but have traditionally been ineffective executing general purpose code (such as the web browser you are using to read this).

The Mill is a synergy of lots of small breakthroughs that together deliver significant improvements to general purpose single threaded code.

Its been held that cores have stopped getting faster. We're faster.

And we have similar as yet not filed improvements for multicore too.

RyanMiller · on July 31, 2014

Cheers for the pitch. He wasn't kidding about being fascinating. Could probably lose that synergy, though. Can't wait for some more real competition in the architecture space.

Are you being funded by any major silicon giants or are you being backed at all?

How many years do you think before you reach some sort of manufacturing or are you still in the "When it's done" phase?

igodard · on Aug 1, 2014

No funding by giants; not a public company. The SEC rules prevent us from talking further (we're too busy to go to jail for breaking the securities regs), but if you are interested in the business side of the company then you can sign up at MillComputing.com/investor-list; it's a low-traffic mailing list where we announce opportunities.

In heavy semiconductor you don't really move out of "when its done" until the FPGA proof-of-principle is working. That's over a year plus "when it's done" :-)

tomjen3 · on July 31, 2014

Normally I am a cynically old bastard, but if that is true then you really are the most important thing happening in hardware right now (or possible HPs the Machine if it isn't vaporware).

One crucial question - are you compatible with X86?

ema · on July 31, 2014

The mill is not compatible with x86, but the goal is to not require more than a recompile.

alex-g · on July 31, 2014

The Mill certainly raises a lot of interesting code generation and optimization issues. I'm sure there's plenty of scope for figuring out good optimization strategies, as a lot seems to depend on the ability of the compiler to make good choices about instruction scheduling and belt slot allocation. Sure, that's the case for traditional architectures as well, but there's more prior art there too. There may also be ingenious algorithms which work better on the Mill architecture specifically. I'd love to know if there's any theory on the hardness of allocating positions on the belt, compared to traditional register allocation.

willvarfar · on July 31, 2014

Actually, as its co-designed by a compiler writer (read the bio we paste with the talks:

> Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.

), its actually designed to be easy to write a compiler for. It's he polar opposite of the "sufficiently smart compiler syndrome" :)

I keep suggesting we do a "sufficiently dumb compiler syndrome" talk, but it'd contain nothing novel; the art is well established by all the VLIW machines that have come before.

igodard · on Aug 1, 2014

It's actually rather easy to allocate belt positions, because the belt is a FIFO and allocates them itself :-)

However, the scheduler must track lifetimes and make sure that nothing still live falls off the end. This is also (not quite so) easy to do, because the scheduler knows exactly what is the belt behavior of each operation it schedules (the Mill is exposed-pipeline), so it is sufficient to symbolically execute a candidate schedule to know if it is feasible.

If not, the scheduler inserts a spill/fill pair at appropriate places and reschedules. This is guaranteed to terminate, and in practice usually is immediately feasible and rarely takes more than one iteration.

The operation scheduling itself is the standard time-reversed tableau scheduler used in VLIWs, probably 40 years old at this point.

vonmoltke · on July 31, 2014

That would be awesome, if accomplished. It will not be easy, though. My experience with TI DSPs and the POWER6 (the most recent major in-order processor) taught me that we are currently a long way from that with existing compilers. Even x86-64<->POWER required some platform-specific code for performance-sensitive blocks.

_pmf_ · on July 31, 2014

I wonder why, if the IP is solid and promising, the company is not outright bought out by ARM or Intel.

pjc50 · on July 31, 2014

"Promising" != "validated"; also, Not Invented Here. Intel and ARM are their instruction sets, to a large extent.

nly · on July 31, 2014

Intel will probably just wait until they're profitable, and then threaten patent a infringement case.

badsock · on July 31, 2014

Lately I've been looking into having CPUs run untrusted machine code (e.g. Google's Native Client). Hope you don't mind a completely off-topic question: is it anywhere on your team's radar to have provisions for that? So far it's been essentially chance whether the architecture has useful features to make this happen (e.g. x86 segment register abuse).

willvarfar · on July 31, 2014

It's very much on our radar. The team have a lot of experience with capability systems, and although the Mill is not a capability based architecture, it is much finer grained than mainstream e.g. x86. There is a security talk

http://millcomputing.com/topic/security/

badsock · on July 31, 2014

Awesome, great to hear and thank you for the link!

rrmm · on July 31, 2014

Can you say anything about where you are on the path towards silicon implementation or is it all still under wraps?

willvarfar · on July 31, 2014

We are working towards the FGPA. Its very early days, but the HW team is very experienced.

thesz · on July 31, 2014

Wow. Finally.

I've seen this Mill occdasionally pitching and always have been asking the same question: are your results from simulation or from FPGA? Now I know the answer.

The most suspicious thing in Mill is that belt thing. To produce operands for N operations you need N*3 (2 for reads, one for write) ports of RAM. For even two operations that means 6 ports. No FPGA allow that out-of-the-box. Given that, you have to implement that in registers and logic, wasting FPGA resources.

(AFAIK, silicon fabs also does not have such RAM blocks. you have to build them themselves, either from registers and logic (and make them slow) or using transistors (which make development process slow). this is THE source of relative slowness of Itanium and Elbrus thing from Russia.)

If you want an advice, go for Tabula. You'll need many R/W ports per block of RAM, they seem to have those (12 ports RAM blocks). Maybe your design won't be as slow as I think it will.

igodard · on Aug 1, 2014

If you watch the Belt talk on our site, and know how a modern OOO machine works on the inside, then you will recognize that the Belt is a forwarding network, sometimes also called a bypass. There is no RAM, "S" or otherwise, no general registers, and no ports.

Bypasses are nothing new; what is novel is how we are able to handle triple the number of data paths than other machines, the fact that the bypass is exposed to the program rather than being hidden behind the register metaphor, and that the program model is a single-assignment FIFO. See those talks for more.

There is no better way to speed up registers and SRAM than by having none at all.

thesz · on Aug 4, 2014

Okay. Your model looks like TTA - a machine that is built on bypasses.

They are not new and they can be used to create very efficient chips (in terms of operations/watt) for some fixed functions (precisely, FFT of 2^N).

But they are 1) not fast in terms of raw performance for general purpose tasks, 2) not fast in terms of operating frequency and most important 3) prone to stall when present with non-deterministic delays like access to RAM.

You can add whatever functional units you like to TTA design, including content-addressable memory in disguise as FIFO. TTA design with such device will be identical to what you've described above.

I won't think you will improve performance very much with this trick.

PS

"not fast for general purpose tasks" - in some benchmarks TTA architectures executed gcc 10+ times slower than general purpose CPU with same frequency.

"not fast in operating frequency" - TTA requires crossbar, which is slow in 2D. You cannot make it fast.

"prone to stall" - you have to stop complete pipeline for a cache miss, otherwise you'll have divergence in execution.

hahainternet · on Aug 1, 2014

Thanks for commenting Ivan. I'm excited for the future and hope to see more material produced by you. The original belt video absolutely blew my mind and I feel that Mill has the opportunity to really revolutionise the entire industry.

pjc50 · on July 31, 2014

you need N3 (2 for reads, one for write) ports of RAM

What? This isn't the 1980s, that's not how you register file. Realistically a "register" these days is an abstract concept, a label attached to a value somewhere in the pipeline. It's the job of instruction decode to keep track of exactly where to load it from.

cfallin · on July 31, 2014

Yes, renaming exists in modern cores, but where do you think the physical values are stored? Answer: there's still a "physical register file". The renamer maps logical register slots to PRF entries. This is well-covered in academic comparch literature from the 90s onward (and is how it's done in real modern OoO cores too, see e.g. articles on Intel Sandy Bridge).

(disclaimer: I'm rebutting the general point about microarchitectures and register files/SRAMs, but haven't studied the Mill in any depth...)

Rusky · on July 31, 2014

If I understand correctly from what they've said so far, there is no RAM/register file with N*3 ports, it just reuses the outputs of the functional units.

That's also nowhere near the source of slowness in the Itanium.

thesz · on Aug 4, 2014

Okay.

The reuse of output from functional units was done in TTA CPUs (Transport Triggered Architectures). Guess how they fare if you probable never heard of them?

Guess also how fast or slow they are compared to regular OOO CPUs.

You will be right if you guess that they are not that good in terms of raw performance and they are not that fast in terms of operating frequency.

They are not fast in either way precisely because they use crossbar as a operand delivery network. They also can have FIFOs as the switching network or as an another functional unit dedicated to spreading information, but most often it is not used.

marcosdumay · on July 31, 2014

You don't put a RAM block in the CPU pipeline. RAM is much slower, thus you keep is several subsystems away.

cfallin · on July 31, 2014

"RAM" here means SRAM (static RAM). That just means "array of storage elements (latches) connected by bitlines and wordlines", which is way more efficient than "random latches we scattered throughout the chip". SRAMs are used extensively for indexed storage such as physical register files, queues, predictor arrays, etc in modern microarchitectures.

marcosdumay · on Aug 1, 2014

Anyway, unless you have a very small array, even addressing is enough to make it slow. And arrays were already big at the time I programmed FPGAs, I can only imagine they are much larger now.

gizmo686 · on July 31, 2014

Very impressive sounding work. If the Mill came out today, it would be the first time I considered buying a computer component just because I was interested in it.

If you can answer this, what is the bussiness plan for the Mill? Who do you expect to buy it when it comes out. Has any company expressed interest is using it?

willvarfar · on July 31, 2014

Ivan talks business models in this hackaday interview:

http://hackaday.com/2013/11/18/interview-new-mill-cpu-archit...

Hope this helps!

Guvante · on July 31, 2014

> And we have similar as yet not filed improvements for multicore too.

Any chance you have a solution to the cache dilemma? Unified cache's are brilliant for simplifying implementations but can often lead to stalling.

igodard · on Aug 1, 2014

Mill multicore has fully sequentially consistent cache coherency; there are no barrier operations. Sorry, how it's done is still NYF (Not Yet Filed). We expect a talk on the subject this fall.

hahainternet · on July 31, 2014

> And we have similar as yet not filed improvements for multicore too.

Goodbye mutexes?

igodard · on Aug 1, 2014

Mill uses optimistic concurrency, similar to the IBM and Intel versions. From that you can build mutexes if you are willing to put up with the drawbacks of locking.

Keyframe · on July 31, 2014

That's very interesting! What are your thoughts on RISC-V?

thesz · on July 31, 2014

RISC-V looks nice.

It avoids raising exceptions wherever possible, which I like EXTREMELY. This saves space, allows for faster hardware and makes life of systems/compilator programmer easier.

It should be praised for that matter alone.

ansible · on July 31, 2014

We've got Mill team members in this thread, but anyway here's the short, short version:

It's about using as much of the die area on the CPU chip for actual computation, rather than supporting an instruction set with outdated design.

The programmer's model of computation today has little correspondence to the realities of current semiconductor process technology in terms of what's fast, and what's easy to implement in hardware. The Mill is a bottom-up redesign that takes into account many of the design constraints with current technology, and attempts to design a good architecture that can maximize actual computational throughput.