Hacker News new | past | comments | ask | show | jobs | submit login

I have yet to go through this talk but I have been waiting for this for a while. Mill is possibly the most interesting thing in all of computing right now.



I really feel like a brainless fanboy too, I'm watching everything about this company. Not being an expert in CPU design, I know I lack the knowledge to criticize what I see, I'm just slurping marketing material straight from the bullshit hose, but I can't help myself, I love it.


One of the things about Mill that feels reassuring is that the information is being presented slowly, step by step to people who have good reason to give a shit.

If it was purely marketing, they'd have a kickstarter up and be promising performance figures without a billion caveats.


Hate to say it, but there's nothing new under the sun when it comes to marketing. Everyone tries everything and modifies based on how well it works. A lot of slow information is the same as trying to go for the quick money.

At the end of the day, the problem with the Mill is there's no hardware implemented. No real world benchmarks to be had, and the world of hardware design is full of seemingly slamdunk ideas which have giant problems in actual implementation.

This isn't to disparage the Mill or its team - but the content and results are important. Not the style of the message. Anyone can style a message.


I see this posted every time Mill gets linked. Could someone simply explain the significance of the Mill architecture?


(Mill team, so biased; thanks for the chance to pitch ;)

We are a DSP that can run general purpose code.

Traditionally, to run general purpose code fast you needed an out-of-order superscalar architecture, as all the x86 and RISC cores are these days.

DSPs have substantially better performance and substantially better efficiency, but have traditionally been ineffective executing general purpose code (such as the web browser you are using to read this).

The Mill is a synergy of lots of small breakthroughs that together deliver significant improvements to general purpose single threaded code.

Its been held that cores have stopped getting faster. We're faster.

And we have similar as yet not filed improvements for multicore too.


Cheers for the pitch. He wasn't kidding about being fascinating. Could probably lose that synergy, though. Can't wait for some more real competition in the architecture space.

Are you being funded by any major silicon giants or are you being backed at all?

How many years do you think before you reach some sort of manufacturing or are you still in the "When it's done" phase?


No funding by giants; not a public company. The SEC rules prevent us from talking further (we're too busy to go to jail for breaking the securities regs), but if you are interested in the business side of the company then you can sign up at MillComputing.com/investor-list; it's a low-traffic mailing list where we announce opportunities.

In heavy semiconductor you don't really move out of "when its done" until the FPGA proof-of-principle is working. That's over a year plus "when it's done" :-)


Normally I am a cynically old bastard, but if that is true then you really are the most important thing happening in hardware right now (or possible HPs the Machine if it isn't vaporware).

One crucial question - are you compatible with X86?


The mill is not compatible with x86, but the goal is to not require more than a recompile.


The Mill certainly raises a lot of interesting code generation and optimization issues. I'm sure there's plenty of scope for figuring out good optimization strategies, as a lot seems to depend on the ability of the compiler to make good choices about instruction scheduling and belt slot allocation. Sure, that's the case for traditional architectures as well, but there's more prior art there too. There may also be ingenious algorithms which work better on the Mill architecture specifically. I'd love to know if there's any theory on the hardness of allocating positions on the belt, compared to traditional register allocation.


Actually, as its co-designed by a compiler writer (read the bio we paste with the talks:

> Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.

), its actually designed to be easy to write a compiler for. It's he polar opposite of the "sufficiently smart compiler syndrome" :)

I keep suggesting we do a "sufficiently dumb compiler syndrome" talk, but it'd contain nothing novel; the art is well established by all the VLIW machines that have come before.


It's actually rather easy to allocate belt positions, because the belt is a FIFO and allocates them itself :-)

However, the scheduler must track lifetimes and make sure that nothing still live falls off the end. This is also (not quite so) easy to do, because the scheduler knows exactly what is the belt behavior of each operation it schedules (the Mill is exposed-pipeline), so it is sufficient to symbolically execute a candidate schedule to know if it is feasible.

If not, the scheduler inserts a spill/fill pair at appropriate places and reschedules. This is guaranteed to terminate, and in practice usually is immediately feasible and rarely takes more than one iteration.

The operation scheduling itself is the standard time-reversed tableau scheduler used in VLIWs, probably 40 years old at this point.


That would be awesome, if accomplished. It will not be easy, though. My experience with TI DSPs and the POWER6 (the most recent major in-order processor) taught me that we are currently a long way from that with existing compilers. Even x86-64<->POWER required some platform-specific code for performance-sensitive blocks.


I wonder why, if the IP is solid and promising, the company is not outright bought out by ARM or Intel.


"Promising" != "validated"; also, Not Invented Here. Intel and ARM are their instruction sets, to a large extent.


Intel will probably just wait until they're profitable, and then threaten patent a infringement case.


Lately I've been looking into having CPUs run untrusted machine code (e.g. Google's Native Client). Hope you don't mind a completely off-topic question: is it anywhere on your team's radar to have provisions for that? So far it's been essentially chance whether the architecture has useful features to make this happen (e.g. x86 segment register abuse).


It's very much on our radar. The team have a lot of experience with capability systems, and although the Mill is not a capability based architecture, it is much finer grained than mainstream e.g. x86. There is a security talk

http://millcomputing.com/topic/security/


Awesome, great to hear and thank you for the link!


Can you say anything about where you are on the path towards silicon implementation or is it all still under wraps?


We are working towards the FGPA. Its very early days, but the HW team is very experienced.


Wow. Finally.

I've seen this Mill occdasionally pitching and always have been asking the same question: are your results from simulation or from FPGA? Now I know the answer.

The most suspicious thing in Mill is that belt thing. To produce operands for N operations you need N*3 (2 for reads, one for write) ports of RAM. For even two operations that means 6 ports. No FPGA allow that out-of-the-box. Given that, you have to implement that in registers and logic, wasting FPGA resources.

(AFAIK, silicon fabs also does not have such RAM blocks. you have to build them themselves, either from registers and logic (and make them slow) or using transistors (which make development process slow). this is THE source of relative slowness of Itanium and Elbrus thing from Russia.)

If you want an advice, go for Tabula. You'll need many R/W ports per block of RAM, they seem to have those (12 ports RAM blocks). Maybe your design won't be as slow as I think it will.


If you watch the Belt talk on our site, and know how a modern OOO machine works on the inside, then you will recognize that the Belt is a forwarding network, sometimes also called a bypass. There is no RAM, "S" or otherwise, no general registers, and no ports.

Bypasses are nothing new; what is novel is how we are able to handle triple the number of data paths than other machines, the fact that the bypass is exposed to the program rather than being hidden behind the register metaphor, and that the program model is a single-assignment FIFO. See those talks for more.

There is no better way to speed up registers and SRAM than by having none at all.


Okay. Your model looks like TTA - a machine that is built on bypasses.

They are not new and they can be used to create very efficient chips (in terms of operations/watt) for some fixed functions (precisely, FFT of 2^N).

But they are 1) not fast in terms of raw performance for general purpose tasks, 2) not fast in terms of operating frequency and most important 3) prone to stall when present with non-deterministic delays like access to RAM.

You can add whatever functional units you like to TTA design, including content-addressable memory in disguise as FIFO. TTA design with such device will be identical to what you've described above.

I won't think you will improve performance very much with this trick.

PS

"not fast for general purpose tasks" - in some benchmarks TTA architectures executed gcc 10+ times slower than general purpose CPU with same frequency.

"not fast in operating frequency" - TTA requires crossbar, which is slow in 2D. You cannot make it fast.

"prone to stall" - you have to stop complete pipeline for a cache miss, otherwise you'll have divergence in execution.


Thanks for commenting Ivan. I'm excited for the future and hope to see more material produced by you. The original belt video absolutely blew my mind and I feel that Mill has the opportunity to really revolutionise the entire industry.


you need N3 (2 for reads, one for write) ports of RAM

What? This isn't the 1980s, that's not how you register file. Realistically a "register" these days is an abstract concept, a label attached to a value somewhere in the pipeline. It's the job of instruction decode to keep track of exactly where to load it from.


Yes, renaming exists in modern cores, but where do you think the physical values are stored? Answer: there's still a "physical register file". The renamer maps logical register slots to PRF entries. This is well-covered in academic comparch literature from the 90s onward (and is how it's done in real modern OoO cores too, see e.g. articles on Intel Sandy Bridge).

(disclaimer: I'm rebutting the general point about microarchitectures and register files/SRAMs, but haven't studied the Mill in any depth...)


If I understand correctly from what they've said so far, there is no RAM/register file with N*3 ports, it just reuses the outputs of the functional units.

That's also nowhere near the source of slowness in the Itanium.


Okay.

The reuse of output from functional units was done in TTA CPUs (Transport Triggered Architectures). Guess how they fare if you probable never heard of them?

Guess also how fast or slow they are compared to regular OOO CPUs.

You will be right if you guess that they are not that good in terms of raw performance and they are not that fast in terms of operating frequency.

They are not fast in either way precisely because they use crossbar as a operand delivery network. They also can have FIFOs as the switching network or as an another functional unit dedicated to spreading information, but most often it is not used.


You don't put a RAM block in the CPU pipeline. RAM is much slower, thus you keep is several subsystems away.


"RAM" here means SRAM (static RAM). That just means "array of storage elements (latches) connected by bitlines and wordlines", which is way more efficient than "random latches we scattered throughout the chip". SRAMs are used extensively for indexed storage such as physical register files, queues, predictor arrays, etc in modern microarchitectures.


Anyway, unless you have a very small array, even addressing is enough to make it slow. And arrays were already big at the time I programmed FPGAs, I can only imagine they are much larger now.


Very impressive sounding work. If the Mill came out today, it would be the first time I considered buying a computer component just because I was interested in it.

If you can answer this, what is the bussiness plan for the Mill? Who do you expect to buy it when it comes out. Has any company expressed interest is using it?


Ivan talks business models in this hackaday interview:

http://hackaday.com/2013/11/18/interview-new-mill-cpu-archit...

Hope this helps!


> And we have similar as yet not filed improvements for multicore too.

Any chance you have a solution to the cache dilemma? Unified cache's are brilliant for simplifying implementations but can often lead to stalling.


Mill multicore has fully sequentially consistent cache coherency; there are no barrier operations. Sorry, how it's done is still NYF (Not Yet Filed). We expect a talk on the subject this fall.


> And we have similar as yet not filed improvements for multicore too.

Goodbye mutexes?


Mill uses optimistic concurrency, similar to the IBM and Intel versions. From that you can build mutexes if you are willing to put up with the drawbacks of locking.


That's very interesting! What are your thoughts on RISC-V?


RISC-V looks nice.

It avoids raising exceptions wherever possible, which I like EXTREMELY. This saves space, allows for faster hardware and makes life of systems/compilator programmer easier.

It should be praised for that matter alone.


We've got Mill team members in this thread, but anyway here's the short, short version:

It's about using as much of the die area on the CPU chip for actual computation, rather than supporting an instruction set with outdated design.

The programmer's model of computation today has little correspondence to the realities of current semiconductor process technology in terms of what's fast, and what's easy to implement in hardware. The Mill is a bottom-up redesign that takes into account many of the design constraints with current technology, and attempts to design a good architecture that can maximize actual computational throughput.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: