You can indeed and should assume there is a heavy JIT component to it.
At the same time, it is important to note that this is geared for already highly parallel code.
In other words, while the JIT can be applied to all code in principle, the nature of accelerated HW is that it makes sense where embarrassingly parallel workloads are around.
Having said that, NextSilicon != GPU, so different approach to acceleration of said parallel code.
Yeah, it's an unfortunate overlap.
The Mill-Core in NextSilicon terminology is the software defined "configuration" of the chip so to speak that represents swaths of the application that are deemed worthy of acceleration as expressed on the custom HW.
So really the Mill-Core is in a way the expression of the customer's code. really.
So, a Systolic Array[1] spiced up with a pinch of control flow and a side of compiler cleverness? At least that's the impression I get from the servethehome article linked upthead. I wasn't able to find non-marketing better-than-sliced-bread technical details from 3 minutes of poking at your website.
I can see why systolic arrays come to mind, but this is different.
While there are indeed many ALUs connected to each other in a systolic array and in a data-flow chip, data-flow is usually more flexible (at a cost of complexity) and the ALUs can be thought of as residing on some shared fabric.
Systolic arrays often (always?) have a predefined communication pattern and are often used in problems where data that passes through them is also retained in some shape or form.
For NextSilicon, the ALUs are reconfigured and rewired to express the application (or parts of) on the parallel data-flow acclerator.
My understanding is no, if I understand what people mean by systolic arrays.
GreenArray processors are complete computers with their own memory and running their own software. The GA144 chip has 144 independently programmable computers with 64 words of memory each. You program each of them, including external I/O and routing between them, and then you run the chip as a cluster of computers.
Text on the front page of the NS website* leads me to think you have a fancy compiler: "Intelligent software-defined hardware acceleration". Sounds like Cerebras to my non-expert ears.
Thank you for writing the obvious.
Instruction Byte count is the wrong metric here 100%.
Instruction Count (given reasonable decoding/timing constraints) is the thing to optimize for and indeed variable length encoding is very bad.
Instruction byte count matters quite a lot when you're buying ROM in volume. And today, the main commercial battleground for RISCV is in the microcontroller space where people care about these things.
For those of us without the expertise, could you elaborate on why that is?
On the one hand we have byte count, with its obvious effect on cache space used.
But to those of us who don't know, why is instruction count so important?
There's macro-op fusion, which admittedly would burn transistors that could be used for other things. Could you elaborate why it's not sufficient?
And then the fact that modern x86 does the opposite to macro-op fusion, by actually splitting up CISC instructions into micro-ops. Why is it so bad if they were more micro-ops to start with, if Intel chooses to do this?
For those not understanding the context of the parent's comment, this HN post originally linked to @damageboy's https://twitter.com/damageboy/status/1194751035136450560 tweet showing a 20% performance hit, but was later changed by mods to link to the phoronix.com article.
Fault? He is getting free publicity to the point he is even on the front page of HN (not that he care about this specifically). Show me the last time this happened with a French book up for a prize.
Would be interesting to see how xsv compared to miller (https://johnkerl.org/miller/doc/index.html) in terms of perf, this tool comes exactly as I am about to munge 1TB of gzipped csv files.
Unfortunately, the main operation I need is not supported by xsv...
In other words, while the JIT can be applied to all code in principle, the nature of accelerated HW is that it makes sense where embarrassingly parallel workloads are around.
Having said that, NextSilicon != GPU, so different approach to acceleration of said parallel code.