Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia CEO Introduces Nvidia Ampere Architecture, Nvidia A100 GPU (nvidia.com)
453 points by bcaulfield on May 14, 2020 | hide | past | favorite | 333 comments



I'm confused. Is there any relationship between the recent Ampere Arm64 servers (https://news.ycombinator.com/item?id=22475036) and Nvidia's "Ampere Architecture", or is it just a case of them using the same name?


I don't like that people downvoted you for asking a question.

If someone thinks the question is stupid or not doesn't mean that a downvote is warranted. (nor an upvote, answer the question and move on.)

To answer though; it's just a coincidence, as you might already know Nvidia uses famous scientists (especially in the field of electricity) as the names of their microarchitectures.

* Volta (Alessandro Volta, inventor of the electric battery)

* Tesla (inventor/designer of A/C current)

* Maxwell (James Clerk Maxwell, founder of electromagnetic radiation)

* Pascal (Blaise Pascal, lots of science around "pressure", arguably his work led to the creation of vacuum tubes used in early computers)

Ampere (from André-Marie Ampère, who lent his name to his discovery and classification of "amps") is just an electrical scientists name.

Coincidentally a new company founded in 2017 decided that it was a good name for them, and thus the confusion.


Tesla is famous in the US for inventing polyphase AC and induction motors, but this is really one of these stories were a bunch of people invented the same thing very closely to each other due to a precipitating reaching of understanding.

Note that Tesla's designs were IIRC two-phase which is largely inferior to three-phase. The push for three-phase and associated designs and inventions (three phase transformers on a single core etc.) came from outside the US.


> but this is really one of these stories were a bunch of people invented the same thing very closely to each other due to a precipitating reaching of understanding.

This is by far the dominant case of invention. Truly independent work is incredibly rare.


The thing that's a little different about Tesla in the US at least, is he is so incredibly fetishized by eg high-profile idiots: https://theoatmeal.com/comics/tesla, or conspiracy nutjobs like the International Tesla Institute (http://teslatech.info/ttevents/prgframe.htm http://tesla.org/tesla_fair_abq.htm) publishing and promoting Tesla related conspiracies and hawking investments in snake-oil technology like Rand Cam engines, VMSK, and all manner of "over-unity" machines


I've always enjoyed the oatmeal's comics. What is your reason for referring to him as an idiot?


Actually at first I thought idiot was a bit harsh.. but then I reread what I had forgotten about that post - I think if he's juvenile to promote the vandalism of wikipedia over a completely fantastical, out of context reinterpretation of history to label Thomas Edison a "douchebag", well then it is fair to label him an idiot on a limited forum.

Also, there's no accounting for taste but I find even his non-serious comics puerile and not terribly funny - pretty much one step above Taboola chum, or Jim Davis for millenials. Most of the "humor" and overplayed hook is simply describing everyday things with odd adjectives, ie hair cave=vagina, saliva=evil mouth juice, wow. There's probably a term for this trope.. Anyway, it gets clicks on Facebook.


And yet we grant patents so liberally, giving a windfall to the first person who files.


Remember the long view - patents cause people to hurry to publish and share their ideas publically. Why shouldn't they be granted liberally.

In a few years, the temporary monopoly falls away and the benefit passes to everyone.

I think they should work to make them even cheaper and easier to file.


20 years is a long time. For some fields, it is perfectly reasonable, but 20 year patents on many recent CS inventions would have significantly bottlenecked development of the industry - Look at how much mess was created by the JPEG patents, for example, and similar problems have existed for every other not-explicitly-libre A/V codec.


While we're on the subject of patents and Nvidia, their patent on using quasi Monte Carlo in rendering is allowing them to hold basically the whole path tracing world hostage, e.g. possibly forcing people to use CUDA who might otherwise have used OpenCL.

They didn't even invent the numerical methods themselves (pure mathematics from other countries from long ago), they were just first to file for a particular application.


There's a strong adverse selection effect, though. Because you need to publish to be granted a patent but can sue whenever anyone infringes (whether willful or not), the incentive is to patent obvious approaches that don't work well and hold the best approach that you're actually using as a trade secret. That way, anyone attempting to replicate you likely ends up in a patent minefield, yet you don't give away the keys to the castle in a patent where you have to detect infringement yourself.


I believe a few years is 20 years though. I haven't thought of patents from this perspective but 20 years is still a long time (and large chunk of your working years) to benefit from something.


Indeed. I found it quite enlightening to go through this list: https://en.wikipedia.org/wiki/List_of_multiple_discoveries


Ironically, that's what they put on the PCB. Those yellow things on both side of the chip package are actual micro transformers made of some material with very tricky magnetic properties.

I wonder, was the name Ampere a reference to its titanic current consumption?

Those only reason to put those on the PCB would be to provide current above 1kA


You're looking at an SXM module, so that PCB is the whole thing. TDP is 400 W. I assume these run at a sub-1 V core voltage due to their relatively low clocks, so yeah, you are looking at a core supply current that may well exceed 500 A at full load and has to be provided by those VRMs crammed on that board.

Worth pointing out that that's not really new. Gaming cards have been running at about a Volt for a bunch of years now and all of those chuck 250+ W, so the currents are rather substantial.


If rebuilding a grid today, would going to DC power be advantageous to AC?


What has changed between then and now? High voltage DC also works well for long distance transmission but is only preferable to AC at even higher voltages.


HVDC works for linking grids and for very long distance transmission between few stations. It does not work well at all for distribution. Never did. Never will. Sorrynotsorry.


And even before that there was:

* GeForce (for Andrea Geforce, the first to use the color electric green)

* Riva (for Jose Riva, the discoverer that you can use TNT to generate electricity.)


Wow Riva, that brand name brings back memories. Almost as magical as 3dfx and Voodoo. (which NVIDIA ultimately acquired). Truly game changing hardware.

Thanks Scott, Gary and Ross (founders of 3Dfx) you cards gave me and my friends thousands of hours of enjoyment.


I bought a Voodoo as my first 3D card and was so disappointed that it did not support my motherboard (PCI 2.1, I had PCI 2.0), so I had to return to the shop and exchanged it for a TNT. :(


Wow, I didn't know that. Wasn't it done in contest by Gamers?

So what does "Radeon" mean?

Edit: Yes according to Wiki. So we may never know whether Before was actually dedicated to the person. I remember vaguely reading PC Gamer at the time that was not the case at all.


Radeon comes from ATI (later acquired by AMD) so probably doesn't follow the same naming scheme as Nvidia does.


Who is Andrea Geforce? I can't find any information about him/her.


Wikipedia tells the following about the Geforce name:

> The "GeForce" name originated from a contest held by Nvidia in early 1999 called "Name That Chip". The company called out to the public to name the successor to the RIVA TNT2 line of graphics boards. There were over 12,000 entries received and 7 winners received a RIVA TNT2 Ultra graphics card as a reward.[2][3]

- https://web.archive.org/web/20000608011648/http://www.nvidia...

- https://tweakers.net/nieuws/1967/nVidia-Name-that-chip-conte...

So not sure the origin is actually from Andrea Geforce, I can certainly not find any sources that confirm that.


Because he was not a famous scientist, like Alexander Ball, inventor of round inflatable objects.


you just need to create its wikipedia entry and it will exist soon.


well done


Tesla was also the name of a Czechoslovak elektronics company, known among other thing for their electron microscopes. Even though this Tesla is long gone, it has a lasting legacy here in Brno due to many electron microscope manufacturers (Delong, FEI, Thermo Fisher, etc.) being present and often libking their origin or many employees to the old Tesla company.


Ha, my cousin works in one of those electron microscope companies. Thought it was some one-off company saving some costs maybe due to good graduates coming out of Brno technical university, good to know there is more to it!


One of the old Tesla company buildings is now the Brno Museum of Technology and the have (among many other) an exposition about the history of electron microcopy in Brno. One of the mashines that is part of the exposition (a huge experimental electron beam litography machine) is pretty much bolted in place in the same spot the old Tesla company built it all those years ago. There are even some "legends" about how they built it so well, that it is still holding vacuum inside today.

Another legend says that they once could not get a new model of electron microscope working before showing it on an international exposition. The image was always blurry. They ran out of time, so they just took it to the exposition to fix it there. But it worked flawlessly there! Turns out this was due to the tram troley lines running next to the company building interferring with the electron optics. :)


And a museum in Belgrad.


Thank you for calling out the downvote issue. I've refrained from asking questions for this exact reason.

People should be encouraged to ask questions, even if from a position of lesser knowledge of the matter at hand. Those that do answer are probably not only helping the person asking the question, but those who may not ask the question even if it is one in their mind.

How many others refrain from enriching the dialogue for the same reasons?


Commenting about voting is discouraged in the HN guidelines.

Also, questions that could be answered with a quick internet search don't make good discussion.


Thanks for reminding me as I had forgotten. Still it seems a bit weird not to be able to discuss. Again, thanks for the reminder (I suspect others who read this thread may also be reminded).

You are correct in saying some questions are better than others. But again that's no reason to downvote as it actively discourages people from participation. I would think the best course would be to ignore and move on as the original comment on this thread mentioned. Downvoting can be a hostile action. Ignoring is neutral.

As per the guidelines themselves: >Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.

Case in point, my comment was downvoted, even though the subsequent comment was an illuminating reminder for myself and possibly others. Those downvotes seem somewhat arbitrary at best, hostile at worst.

I've no more to say so I will refrain from any further comments on voting dynamic.


Pascal also invented a few mechanical calculators[0] which might considered in a way precursor to modern computers.

[0] https://en.wikipedia.org/wiki/Pascal%27s_calculator


Also, that's not a dumb question! That's a very specific name, and two products coming out in a window of time both using the same name is enough to cause some confusion. I found the question and answer to be very useful.


I was actually surprised that NVIDIA went through with it, it seems like a straightforward trademark case. Two types of computer processor that share a marketing name.

Obviously if you are "in the know" they are not really the same type of processor but it is closer than you usually see companies get with their trademarks


Pedantic, but electromagnetic radiation's "founding" precedes Mr. Maxwell. :)


Not just electricity, but basic physics. Don't forget the Kepler architecture, named after the astronomer Johannes Kepler, and the Fermi architecture, named after the nuclear physicist Enrico Fermi (although the same physics is important in semiconductors). Although there;s also an odd exception, which is the Turing architecture.


One possible reason to downvote is that the question was the top comment for this post, but it does not directly discuss the new technology introduced in the post. I don't know how HN ranking works, but I'd assume that the top comment to a front-page topping article has already collected a sizable share of upvotes.


I came to the comments with the exact same question


If someone thinks the question is stupid or not doesn't mean that a downvote is warranted. (nor an upvote, answer the question and move on.)

The type of questions being downvoted sometimes makes me question sanity. In general I believe that downvoting is probably one of the most poisonous thing invented in newsgroups.


downvoting is toxic yea, but look at twitter for what happens when its not a thing, its just as bad if not worse in this direction, comments have no almost no coherency and trolls are incredibly rampant. on sites with downvoting like reddit the problem is somewhat less, but the hivemind effect takes root. theres really not many good solutions right now.


How about moderators and the ability to ban? I understand it might not be economically sound.


I'm sure Nvidia is paying a nice licensing fee to Ampere for the registered trademark. Otherwise it'd be sheer stupidity.


I don’t know if you’re being sarcastic, but trademarking a famous scientist’s name seems legally dubious at best. Is that allowed? I.e. Tesla the car company and nvidia Tesla’s GPUs; is there a gentleman’s agreement between these two or is such a name simply not trademarkable?


In general a trademark is registered for a specific class (~industry). As long as the trademarks are for different classes and there is no risk of confusion, multiple companies/products can have the same name. For example Delta is the name of an airline, a computer company, a faucet company and a family of orbital rockets. There's little risk anybody would confuse these four, and they don't belong to the same class, so they can coexist just fine.

The US is quite generous in what you can trademark. In the EU for example you can't (as easily) trademark common words. But a quick search shows about 50 active EU trademarks on the word Tesla.


My understanding (IANAL) is that "Tesla" is trademarked by the company. [1] Note that doesn't mean no one else can ever use that name for anything. It does mean I'd probably get a cease and desist letter if I tried to go into the car business under the name Tesla. But I can probably get a trademark in other areas where there isn't a serious possibility of confusion. How close I'm willing to get to that line probably depends on how badly I want the name and how conservative the lawyers are.

[1] https://www.lexology.com/library/detail.aspx?g=e4eaa83d-47dc...


trademarking a famous scientist’s name seems legally dubious at best

It's pretty common - the Apple Newton(TM) and Salesforce Einstein(TM) spring to mind.


Not sure why this is being downvoted because it's a good point.

Ampere, Ampere Semiconductor and Ampere Computing are all registered trademarks of the computing company in the class NVidia would need, and they'd have a good case this this is causing confusion (they'd probably cite this thread).

It'll be interesting to see what happens.


Same name, but Ampere the company probably came first (2017).

They're both named after the famous mathematician and physicist. https://en.wikipedia.org/wiki/Andr%C3%A9-Marie_Amp%C3%A8re


Ampere has been on NVidia's roadmap for the better part of a decade.


> Ampere has been on NVidia's roadmap for the better part of a decade.

Not publicly, at least, since rumors of the "Ampere" naming surfaced around late 2017. https://www.kitguru.net/components/graphic-cards/matthew-wil...


Ah, I thought I remembered seeing the name on their roadmap earlier than that, but I must have been mistaken. In any case I think it's safe to say the name wasn't ripped off anyone give the names of pervious products and the timing of those rumors.


ARM server chip is from a company named Ampere that has nothing to do with Nvidia


Important to remember the half-precision tensorcore misrepresentations where the 8x improvement over fp32 claimed on Imagenet with tensorcores (V100) was actually only 1.2-2x [1,2]. Furthermore, there are major precision issues with network architectures like variational autoencoders and many others.

We use V100s for Richardson-Lucy like deconvolutions for example, where we have near-exact photon counts up to 10,000 per pixel. fp32 is sufficient, tf32 is not.

V100 claimed 15 teraflops of FP32, A100 claims 19.5 teraflops. For most pytorch/tensorflow workflows out there, where FP32 dominates, this approximates closer to 30% improvement of last generation, which is reasonable and typical. Although FP64 does get a nice boost.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v... [2] https://www.pugetsystems.com/labs/hpc/TensorFlow-Performance...


I am not that much into ML, just fiddled with it a bit, is tf32=fp16?


Not quite, but close. “tf32” is 18 bits, but with the same 10 bits of exponent that fp32 has. It’s the range of fp32 with the precision of fp16. It’s a shame to see such unoriginality in new number representations. I’d much rather see Posit hardware acceleration: https://web.stanford.edu/class/ee380/Abstracts/170201-slides...


TF32 is 19 bits, not 18 bits. There's an additional bit for sign.

https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-prec...


No, tf32 has the same size exponent field as f32 but the mantissa size of f16, 10 bits.


Why is fp32 sufficient but not tf32 for that task?


Probably even more closed than ever. They tend to become more and more restrictive with every new hardware generation. I wonder where their promised open source announcement they preannounced before.


Yeah. That's been my general problem with adopting NVidia for anything. They make good hardware, but there's a lot of lock-in, and not a lot of transparency. That introduces business risk.

I'm not in a position where I need GPGPU, but if there wasn't that risk, and generally there were mature, open standards, I'd definitely use it. The major breakpoint would be when libraries like Numpy do it natively, and better yet, when Python can fork out list comprehensions to a GPU. I think at that point, the flood gates will open up, and NVidia's marketshare will explode from specialized applications to everywhere.

Intel stumbled into it by accident, but got it right with x86. Define an open(ish) standard, and produce superior chips to that standard. Without AMD, Cyrix, Via, and the other knock-offs, there would be no Intel at this point.

Intel keeps getting it right with numerical libraries. They're open. They work well. They work on AMD. But because Intel is building them, Intel has that slight bit of advantage. If Intel's open libraries are even 5% better on Intel, that's a huge market edge.


> They make good hardware, but there's a lot of lock-in, and not a lot of transparency.

This sounds like you'd like NVIDIA to open-source all their software. I see this type of request a lot, but I don't see it happening.

NVIDIA's main competitive advantage over AMD and Intel is its software stack. AMD could release a 2x powerful GPGPU tomorrow for half the price and most current NVIDIA users wouldn't care because what good is that if you can't program it? AMD software offer is just poor, of course they open-source everything, they don't make any software worth buying.

ARM and Intel make great software (the Intel MKL, Intel SVML, ... libraries, icc, ifort, ... compiler), and it doesn't open-source any of that either for the same reasons as NVIDIA.

Intel and NVIDIA employ a lot of people to develop their software stacks. These people aren't probably very cheap. AMD strategy is to save a lot of money in software development, maybe hoping that the open-source communities or Intel and NVIDIA will do it for free.

I also see these requests that Intel and NVIDIA should open-source everything together with the explanation that "I need this because I want to buy AMD stuff". That, right there, is the reason why they don't do it.

You want to know why NVIDIA has 99% of the Cloud GPGPU hardware market and AMD 1%? If you think 10.000$ for a V100 is expensive, do the math on how much does an AMD MI50 costs: 5000$ for the hardware, and then a team of X >100k$ engineers (how much do you think AI GPGPU engineers cost?) working for N years just to play catch on the part of the software stack that NVIDIA gives you with a V100 for free. That goes into multiple million dollars more expensive really quickly.


> AMD could release a 2x powerful GPGPU tomorrow for half the price and most current NVIDIA users wouldn't care because what good is that if you can't program it?.

Correction: Nobody will be able to use the AMD hardware (outside of computer graphics) because everybody has been locked-in with CUDA on Nvidia. They can not even change even if they want to: it is pure madness to reprogram an entire GPGPU software stack every 2 years just to change your hardware provider.

And I think it will remain like that until NVidia get sued for anti-trust.

> ARM and Intel make great software [..] doesn't open-source any of that either for the same reasons as NVIDIA.

That's propaganda and it's wrong.

Intel and ARM contribute a lot to OSS. Most of the software they release nowadays is Open Source. This includes compiler support, drivers, libraries and entire dev environment: mkl-dnn, TBB, BLIS, ISPC, "One", mbedTLS.... ARM has even an entire foundation only to contribute to OSS (https://www.linaro.org/) .

Near to that, NVidia does close to nothing.

There is no justification to NVidia's attitude related to OSS. It reminds me the one of Microsoft at its darkest days.

The only excuse I can see to this attitude is greed.

I hope at least they do not contaminate Mellanox with their toxic policies. Mellanox was an example of successful Open Source contributor/company (up to now) with OFabric (https://www.openfabrics.org/). It would be dramatic if this disappear.


Amd doesn't even have software for GPGPU on some of their cards. I have an rx5700xt and I cant use it for anything but gaming because ROCm doesn't support navi cards, a whole year after its release.


As a 5700 owner, I agree.

It gets even worse. There was recently a regression in the 5.4, 5.5, and 5.6 kernels that hit me hard for a week or so on Manjaro last month. System just decided to lock up or restart. Thought the graphics card had died when it happened once on Windows. Working fine now-these drivers have been out for 10 months now.

Even worse, AMD has locked down the releases of some of their 'GPUOpen' software.

https://www.phoronix.com/scan.php?page=news_item&px=Radeon-R...

https://www.phoronix.com/scan.php?page=news_item&px=GPUOpen-...

I did not expect the second one to be open source; just not on their GPUOpen website.

I did expect the first one to 'stay' open source. Not to be made proprietary on their 'GPUOpen' website.

I am definitely keeping an eye on Intel graphics now.


I think at this point AMD wants anything Compute to concentrate on CDNA, and graphics remain on RDNA.


>> ARM and Intel make great software [..] doesn't open-source any of that either for the same reasons as NVIDIA.

> That's propaganda and it's wrong.

Very convenient of you to have omitted what was in the square brackets:

> Intel MKL, Intel SVML, ... libraries, icc, ifort, ... compiler

Show me the open source MKL, Intel SVML, icc and ifort.

Some (all?) of it may be free, but it's not open source.


I don't necessarily have an opinion either way in this discussion but wanted to point out that Intel's latest MKL library does seem to be done as an open source project https://github.com/oneapi-src/oneMKL


> Nobody will be able to use the AMD hardware (outside of computer graphics) because everybody has been locked-in with CUDA on Nvidia.

But numpy can be ported. So can pytorch.

I don't think the lock-in is that big of an issue. GPUs do only simple things, but do them fast.


> GPUs do only simple things, but do them fast

GPUs are immensely complex systems. Look at an API like Vulcan, plus it's shading language, and tell me again it's simple. And that's a low-level interface.

Now add to that the enormous amount of software effort that goes into implementing efficient libraries like cuBLAS, cuDNN, etc. There's a reason other vendors have struggled to compete with NVidia.

Disclaimer: currently employed at NVidia.


Part of Nvidia's advantage comes from building the hardware and software side by side. No one was seriously tackling GPGPU until Nvidia created Cuda, and if you look at the rest of the graphics stack Nvidia is the one driving the big innovations.

GPUs are sufficiently specialized in both interface and problem domain that GPU enhanced software is unlikely to appear without a large vendor driving development, and it would be tough for that vendor to fund application development if there is no lock in on the chips.

which leads to the real question. What business model would enable GPU/AI software development without hardware lock-in? Game development has found a viable business by charging game publishers.


Would you agree that that your observations somewhat imply that a competitive free market is not a fit for all governable domains (and don't mistake governable for government there, we're still talking about shepherding of innovation)?


Early tech investments are risky, but if your competition has tech 10 years more advanced than yours, there is probably no amount of money that would allow you to catch up, surpass, and make enough profits to recover the investment, mainly because you can't buy time, and your competitor won't stop to innovate, they are making a profit and you aren't, etc.

So to me the main realization here is that in tech, if one competitor ends up with tech that's 10 years more advanced than the competition, it is basically a divergence-type of phenomenon. It isn't worth it for the competition to even invest in trying to catch up, and you end up with a monopoly.


This is a good callout, unlike manufacturing the supply chain is almost universally vertically integrated for large software projects. While it's possible to make a kit car that at least some people would buy, most of the big tech companies have reached the point of requiring hundreds of engineers for years to compete.

Caveat that time has shown that the monopolies tend to decay over time for various reasons, the tech world is littered with companies that grew too confident in their monopoly.

- Cisco - Microsoft Windows - IBM

etc.


The problem with vertically integrated technology is that if a huge advancement appears at the lowest level of the stack that would require a whole re-implementation of the whole stack, a new startup building things from scratch can overthrown a large competitor that would need to "throw" their stack away, or evolve it without breaking backward compatibility, etc.

Once you have put a lot of money into a product, it is very hard to start a new one from scratch and let the old one die.


I think you would need to take a fine tooth comb to the definitions here. I could see a few different options emerge for non-Nvidia software including

- Cloud providers wishing to provide lower CapEx solutions in exchange for increased OpeX and margin. - Large Nvidia customers forming a foundation to shepherd Open implementations of common technology components

From a free market perspective both forms of transaction would be viable and incentivized, but neither option necessarily leads to an open implementation.


I have been stating similar thing on GPU for a very long time.

The GPU hardware is ( comparatively ) simple.

It is the software that sets GPU vendors apart. For Gaming, that is Drivers. For Compute that is CUDA.

On a relative scale, getting a decent GPU design may have a difficulty of 1, getting a decent Drivers to work well on all existing software is 10, getting the whole ecosystem system around your Drivers / CUDA + Hardware is likely in the range of 50 to 100.

As far as I can tell, under Jensen's leadership, the chance of AMD or even Intel to shake up Nvidia's grasp in this domain is partially zero in the foreseeable future.

That is speaking as an AMD shareholder and really wants AMD to compete.


> But numpy can be ported. So can pytorch.

Letting AMD or Intel port themselves everything that has been developed in CUDA like it was done for Pytorch is not substainable and will always lag behind.

It can only help to create a monopoly on the long term.


As Hip continues to implement more of CUDA, I think we'll see more developers doing it themselves when the barrier to porting is smaller. AMD has a lot of work to do, and I don't know whether they'll succeed or not, but IMO they have the right strategy.


Intel don't release BLIS, though there is some Intel contribution. Substitute libxsmm, which originally beat MKL.


> Correction: Nobody will be able to use the AMD hardware (outside of computer graphics) because everybody has been locked-in with CUDA on Nvidia.

NVIDA open-sourced their CUDA implementation to the LLVM project 5 years ago, which is why clang can compile CUDA today, and why Intel and PGI have clang forks compiling CUDA to multi-threaded and vectorized x86-64 using OpenMP.

That you can't compile CUDA to AMD GPUs isn't NVIDIA's fault, it's AMD, for deciding to pursue OpenCL first, then HSA, and now HIP.


> Do you work for AMD

I do not. And I use NVidia hardware regularly for GPGPU. But I hate fanboyism.

> NVIDA open-source their CUDA implementation to the LLVM project 5 years ago

Correction: Google developped an internal CUDA implementation for their own need based on LLVM that Nvidia barely supported it for their own need afterwards.

Nothing is "stable" nor "branded" in this work.... Consequently, 99% of public Open Source CUDA-using software still compile ONLY with the CUDA proprietary toolchain ONLY on NVidia hardware. And this is not going to change anything soon.

> one from PGI, that compile CUDA to multi-threaded x86-64 code using OpenMP.

The PGI compiler is proprietary and now property of NVidia. It was previously properitary and independant but mainly used for its GPGPU capability through OpenACC. OpenACC backend targets directly the nvidiaptx (proprietary) format. Nothing related with CUDA.

> Intel being the main vendor pushing for a parallel STL in the C++ standard

That's wrong again.

Most of the work done for the parallel STL and by the C++ committee originate from work from HPX and the STELLAR Group (http://stellar-group.org/libraries/hpx/).

They are pretty smart people and deserve at least respect and parent-ship for what they have done.

More information from Hermut Kaiser (Very Nice Guy btw) here (https://www.youtube.com/watch?v=6Z3_qaFYF84).

They have been the precursor of the idea of parallel "algorithms" in the STL and the concept of "Execution policy" you have in C++17 comes from them.

To the defense of Intel (and up to my knowledge) they have provided the first OSS implementation for compilers for it.


> But I hate fanboyism.

"The only excuse I can see to this attitude is greed" sounds pretty fanboyish to me. :-)

I've never understood why Microsoft, or Adobe, or Autodesk, or Synopsys, or Cadence or any other pure software company is allowed to charge as much as the market will bear for their products, often more per year than Nvidia's hardware, but when a company makes software that runs on dedicated hardware, it's called greed. I don't think it's an exaggeration when I say that, for many laptops with a Microsoft Office 365 license, you pay more over the lifetime of the laptop for the software license than for the hardware itself. And it's definitely true for most workstation software.

When you use Photoshop for your creative work, you lock your design IP to Adobe's Creative Suite. When you use CUDA to create your own compute IP, you lock yourself to Nvidia's hardware.

In both cases, you're going to pay an external party. In both cases, you decide that this money provides enough value to be worth paying for.


> Correction: Google developped an internal CUDA implementation for their own need based on LLVM that Nvidia barely supported it for their own need afterwards.

This is widely inaccurate.

While Google did developed a PTX backend for LLVM, the student that worked on that as part of a GSOC got later hired by NVIDIA, and ended up contributing the current NVPTX backend that clang uses today. The PTX backend that Google contributed was removed some time later.

> Nothing is "stable" nor "branded" in this work.

This is false. The NV part of the backend name (NVPTX) literally brands this backend as NVIDIAs PTX backend, in strong contrast with the other PTX backend that LLVM used to have (it actually had both for a while).

> OpenACC backend targets directly the nvidiaptx (proprietary) format.

This is false. Source: I've used the PGI compiler on some Fortran code, and you can mix OpenACC with CUDA Fortran just fine, and compile to x86-64 using OpenMP to just target x86 CPUs. No NVIDIA hardware involved.

> That's wrong again. > > Most of the work done for the parallel STL and by the C++ committee originate from work from HPX and the STELLAR Group

This is also widely inaccurate. The Parallel STL work actually originated with the GCC parallel STL, the Intel TBB, and NVIDIA Thrust libraries [0]. The author of Thrust was the Editor of the Parallelism TS, and is the chair of the Parallelism SG. The members of the STELLAR group that worked on HPX started collaborating more actively with ISO once they started working at NVIDIA after their PhDs. One of them chairs the C++ library evolution working group. The Concurrency working group is also chaired by NVIDIA (by the other nvidia author of the original parallelism TS.

AMD is nowhere to be found in this type of work.

[0] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n372...


> While Google did developed a PTX backend for LLVM, the student that worked on that as part of a GSOC got later hired by NVIDIA, and ended up contributing the current NVPTX backend that clang uses today.

You more or less reformalized what I said. It might become used one day behind a proprietary blob, rebranded blob of NVidia, but fact is that today, close to nobody use it for production in the wild and it is not even supported officially.

> This is false. The NV part of the backend name (NVPTX) literally brands this backend as NVIDIAs PTX backend.

It does not mean it's stable or used. I do not now a single major GPGPU software in existence that ever used it in an official distribution. Like I said.

> CUDA Fortran just fine

CUDA fortran, yes you said it, CUDA fortran. The rest is OpenACC.

> The Parallel STL work actually originated with the GCC parallel STL, the Intel TBB, and NVIDIA Thrust libraries

My apologies for that. I was ignoring this precedent work.

> AMD is nowhere to be found in this type of work.

I do not think I ever said anything about AMD.


> CUDA fortran, yes you said it, CUDA fortran. The rest is OpenACC.

You can also mix C, OpenACC, and CUDA C, and compile to x86-64. So I'm really not sure about what point you are trying to make here.

You were claiming that OpenACC and CUDA only runs on nvidia's hardware, yet I suppose you now agree that this isn't true I guess.

I do agree that PGI is still nvidia owned, but there are other compilers that do what PGI does.


> You were claiming that OpenACC and CUDA only runs on nvidia's hardware, yet I suppose you now agree that this isn't true I guess.

I do not think I ever said that OpenACC runs only on NVidia hardware. However CUDA I still affirm that CUDA runs only on NVidia hardware yes. For anything else, it is based on code converter in best case.


> That you can't compile CUDA to AMD GPUs isn't NVIDIA's fault, it's AMD, for deciding to pursue OpenCL first, then HSA, and now HIP.

Using a branded & under patent concurrent proprietary technology and copying its API for your own implementation is Maddness that will lead you for sure in front of a court.

It seems that even Google understood that the hard way (https://en.wikipedia.org/wiki/Google_v._Oracle_America)


How come? There is a CUDA C++ and CUDA C toolchains available under a MIT license, large part s of which are contributed by NVIDIA.

How can they sue you for using something that they give you with a license that says "we allow you to do whatever you want with it" ?


the MIT license doesn't have an express patent grant. If Nvidia has a patent on some technology used by the open source code, they could sue you for patent infringement if you use it in a way that displeases them. What they can't do is sue you for copyright infringement.


Google v Oracle is still unsettled.

Most other legal precedent was that it was fine to clone an API.


> Most other legal precedent was that it was fine to clone an API.

CUDA is more than an API. It is a technology under copyright and very likely patented too. Even the API itself contains multiple reference to "CUDA" in function calls and variable name.


None of that protects it from being cloned under previous 9th circuit precedent except maybe patents, but I'm not aware of any patents that'd protect another against CUDA implementation.


>Intel and PGI have clang forks compiling CUDA to multi-threaded and vectorized x86-64 using OpenMP.

Where are these forks?



For PGI, all pgi compilers can do this, just pick x86-64 as the target. There are also other forks online (just search for LLVM, CUDA, x86 as keywords), some university groups have their forks on github, where they compile CUDA to x.


People who are into RISC-V and other side projects/open stacks obviously have not worked on mission critical problems.

When you have a Jet engine hoisted up for a test rig, and something fails in your DSP library, you don't hesitate to call Matlab engineering support to help on within next 30 mins. Try that with some python library. People give a lot of flak to Matlab for being closed source but there is a reason they exist. Not for building a stupid toy project, but for real things where big $$$ is on the line. Python is also used in production everywhere, but if your application is a niche one and using PyVISA library to connect to some DSP hardware that you git cloned is not very "production" ready. You need solid deps.

Don't get me wrong - open source software runs in prod all the time - PostgreSQL/Linux, etc. The smaller the application domain (specific DSP libraries or analysis stacks for wind turbines and such), the lower the availability of high quality open source software (and support).

My point is that reality hits you hard when it is anything where a lot of $$$ or people's time depend on it. Don't blame their engineers for using closed source tools.


> People who are into RISC-V and other side projects/open stacks obviously have not worked on mission critical problems.

"People who are into RISC-V" nowadays includes folks like Chris Lattner, who has worked on more mission-critical problems than most everyone here.


Yes, and not all of them were turned into gold. I don't have any hopes on Swift for Tensorflow.


It would suffice for NVIDIA to open-source enough specifications and perhaps some subset of core software to enable others to build high quality open source (or even proprietary) software that targets NVIDIA's architecture. They can't hire every programmer in the world; if other programmers can build high-performance software that takes advantage of their platform, that increases the value of their hardware.

Your comparison to Intel isn't valid: most software that runs on Intel processors isn't built with icc, and customers have a choice: they can use icc, gcc, clang, or a number of other compilers. The NVIDIA world isn't equivalent.


Anyone is free to target PTX and do their own compiler on top.

In fact, given that it is there since version 3, there are compilers available for almost all major programing languages, including managed ones.

While OpenCL is a C world, and almost no one cares about the C++ extensions and even less vendors care about SPIR-V.

Also the community doesn't seem to be bothered that for a long time, the only SYCL implementation was a commercial one from CodePlay, trying to extend their compilers outside the console market.


> the community doesn't seem to be bothered that for a long time, the only SYCL implementation was a commercial one

Bothered has nothing to do with it. Implementing low level toolchains generally seems to require both a gargantuan effort and an incredible depth of knowledge. If it didn't, I think tooling and languages in general would be significantly better across the board.

What am I supposed to do, implement a SYCL compiler on my own? Forget it - I'll just keep writing GLSL compute shaders or OpenCL kernels until someone with lots of resources is able to foot the initial bill for a fully functional and open source implementation.


Which is why CUDA won, most researchers can't be bothered to keep writing C based shaders with printf debugging.


This is wrong - triSYCL is roughly the same age as ComputeCpp, and hipSYCL is only slightly younger. There has been a lot of academic interest in SYCL, but as with any new technology (especially niche technologies) it's always going to take time to get people on board.

Also, from a quick look at your profile, you seem to have quite a lot of comments criticizing or commenting on CodePlay. Do you have some sort of relationship or animosity with them?


I wish all the luck to CodePlay, the more success the better for them.

They are well appreciated among game developers, given their background.

My problem is how Khronos happens to sell their APIs, and let everyone alone to create their own patched SDKs and then act surprised that commercial APIs end up winning the hearts of the majority.

The situation has hardly changed since I did my thesis with OpenGL in late 90's, porting a particles visualization engine from NeXTSTEP to Windows.

Nothing that compares with CUDA, Metal, DirectX, LibGNMX, NVN tooling.

Hence my reference to CodePlay, as for very long time their SDK was the only productive way to use SYCL.

Khronos likes to oversell the eco-system, and usually the issues and disparities across OEMs tend to be "forgotten" on their marketing materials.


Rust has a PTX backend.


This has literally been a back and forth argument since a 100 point post on slashdot was a groundbreaking event. I don't see it changing any time soon - honestly if anything on tech forums this argument frequently overshadows just how well NVIDIA is doing.


It is just like game forums as well.

The culture here and on those forums couldn't be further apart.


>> NVIDIA's main competitive advantage over AMD and Intel is its software stack. AMD could release a 2x powerful GPGPU tomorrow for half the price and most current NVIDIA users wouldn't care because what good is that if you can't program it?

I always wonder why it is so hard for AMD to develop a true competitor to CUDA, but for AMD hardware? Not try to solve GPGPU programming through open standards like OpenCL, just copy the concept of CUDA wholesale. They could still build it on top of LLVM etc and release the whole thing as open-source, but have the freedom to not have to deal with design-by-committee frameworks like OpenCL, so they can ensure focus on GPU programming and nothing else, and only on those platforms where the majority of the demand is. There is not much wrong with OpenCL, it's just not nearly as good/capable/easy-to-use as CUDA if all you are interested in is GPGPU programming.

AMD is a big company with a lot of revenue, especially recently, so why would it be so hard to have a team working full-time on creating a direct CUDA knock-off ASAP?


Two thoughts that come to mind:

1. AMD has struggled in the past and even today on being profitable with their GPUs. Makes it difficult to entice an army of knowledgeable devs without consistent cash flow. Granted, the tide is turning with their profitable CPU business and equity has shot up.

2. More importantly I think that, being the underdog, AMD has to have a cheaper, open solution to compete. Why would a customer choose to go with AMD’s nascent and proprietary stack over Nvidia’s well established and nearly ubiquitous proprietary stack?

To be clear, I don’t think the problems are insurmountable. AMD won a couple HPC deals recently which should afford them the opportunity to build up their software and invest in a competitive hardware solution.


To be fair, Nvidia has open sourced some key libraries lately. See cutlass and cufftdx.


> Intel keeps getting it right with numerical libraries. They're open. They work well. They work on AMD.

What Intel numerical libraries are you thinking of? When I think of Intel numerical libraries, the first that comes to mind is MKL. MKL is neither open-source nor does it work well on AMD without some fragile hacks [0].

[0] https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...


Well, OP didn't say MKL works well on AMD. But you can at least run it on a non-Intel CPU. Compare CUDA.


The nvidia pgi compiler compiles CUDA to multi-core x86-64. There are other third-party compilers for CUDA->x86-64 (one LLVM-based one from Intel).

There is a "library replacement" for CUDA from AMD called HIP, that you can use to map CUDA programs to ROCm. But... it doesn't work very well.

NVIDIA also open-sourced CUDA support for Clang and LLVM. So anybody can extend clang to map CUDA to any hardware supported by LLVM, including SPIRV. The only company that would benefit from doing this would be AMD, but AMD doesn't have many LLVM contributors.

Intel drives clang and LLVM development for x86_64, paying a lot of people to work on that.


It sounds like people want nvidia to write drivers for AMD.

This criticism makes even less sense when any bystander could implement CUDA suppport on AMD by connecting open source software.


> any bystander

You aren't seriously implying than any bystander is capable of extending LLVM to map CUDA to SPIR-V? What percentage of present day gainfully employed software engineers do you suppose even has the background knowledge? How many hours do you suppose the work would require?


If LLVM has a SPIRV backend, probably very little. For a proof of concept, a bachelor CS thesis would probably do.

Clang already has a CUDA parser, and all the code to lower CUDA specific constructs to LLVM-IR, some of which are specific for the PTX backend. If you try to compile CUDA code for a different target, like SPIRV, you'll probably get some errors saying that some of the LLVM-IR instructions generated by clang are not available in that backend, and you'll need to generate the proper SPIRV calls in clang instead.

Its probably a great beginner task to get started with clang and LLVM. You don't need to worry about the C++ frontend side of things because that's already done, and can focus on understanding the LLVM-IR and how to emit it from clang when you already have a proper AST.


FWIW, there already exists LLVM to SPIR-V compiler: https://github.com/KhronosGroup/SPIRV-LLVM-Translator

Alas, this supports SPIR-V to 1.1.


Late response I know, but I would say anyone who needs that feature could learn to do it, at least if they are on Hacker News. Maybe bystander isn't the most accurate term, but certainly anyone with criticism could take the gauntlet.

LLVM is very well documented and so are these standards. The open source community is also huge and full of talented contributors and more are always welcome to join. I think there's a reason why Linux and GitHub exist.

So in short, if it's a question of motivation and it's something you need, then become motivated to make it happen. That's more likely to happen then convincing a company to invest in supporting a competitor.


CUDA appears to have come out well before even OpenCL. I don't see why there would be expectation that nVidia would design their framework to work on a competitors product.


Cuda was also a response to ATI's own efforts at a proprietary effort which they eventually have up on.


ATI came out with CTM, which was just an assembler. CUDA was released a month or so after that. It was a full C compiler and already had a pretty large set of examples and library functions.

I downloaded CUDA about the day it was released, and used it for real some months later when I bought a 8600 GT GPU.

To call CUDA a response to CTM is too much praise for Nvidia, because it suggests that their response included cobbling a compiler and SDK in just a month. :-)


Not on ARM, or POWER, you can't. Why you'd want to run it on AMD, I don't understand. I don't know what fraction of peak BLIS and OpenBLAS get, but it will be high.


> The major breakpoint would be when libraries like Numpy do it natively

That already happened [0]. NVIDIA has a 1:1 replacement for Numpy called CuPy that does this and is what powers their RAPIDS framework (which is a 1:1 replacement for Pandas that runs on GPUs).

Some people were complaining in [0] about CuPy reproducing numpy's bugs..

[0]: https://news.ycombinator.com/item?id=22830201


Pretty much everyone these days uses a library for driving the GPU calculations. And they tend to either support multiple hardware targets directly (TensorFlow) or have API-compatible replacements (CuPy/NumPy).

So the lock-in risk here is that you might have to run your stuff on CPU if future NVIDIA GPUs are too overpriced.

I mean they are super expensive. But there's nothing that comes close to their cuBLAS library in terms of performance. So unless AMD ponys up and hires GPU algorithm engineers, NVIDIA will win simply due to their superior driver software.

I once had to optimize a CPU matrix multiplication algorithm. 10 days of work for a 2x speedup. Now imagine doing that for every one of the thousands of functions in the Blas library...


Yeah I think most people don't quite appreciate the difficulty and cost of optimizing for hardware and continually maintaining that through hardware cycles. In keeping things closed source Nvidia products have both the advantages of being easier to on-board due to simpler abstraction, and faster technical progress because there is less pushback from myriad parties when big inconvenient changes might need to happen at lower levels for hardware performance reasons kind of like if instead of x86 we instead settled on LLVM.


That is a very good metaphor :)

Actually, I wonder why we went with un-compilable Java bytecode and JIT instead of advancing projects like gcj.


I'm not surprised expecting to beat implementations of the basic Goto strategy for BLAS didn't turn out well. BLIS only needs a single, pared-down GEMM kernel for level3, and maybe one for TRSM. (It doesn't currently have GPU support, but I think there was an implementation mentioned in an old paper.)


NVIDIA has no ethical or moral responsibility to give their competitors the benefit of software they have paid to develop in-house. It is probably a safe bet that you yourself do not develop your projects under the Affero GPL, and so on some level you agree with this.

What you see as "ecosystem lock-in" is properly viewed as software that you pay a premium for as part of your purchase price, above and beyond the pricing of the competitor's hardware. NVIDIA costs more than AMD because they have to employ people to write all that software, and you are "buying" that software when you purchase NVIDIA's product.

Analogously - Amiga has no moral responsibility to let you run AmigaOS on anything except their hardware. This sort of "hardware exclusivity" used to be very common and widely accepted. Today, Apple has no moral responsibility to let you run OS X on anything except their hardware (the existence of underground hackintoshing is irrelevant here). The software is part of what you are buying when you buy the product.


It's not a safe bet. I've build project under AGPL, and made plenty of money doing it. There are places where open is good business, and there are places where proprietary is good business, and there's everything in between. AGPL was nice since I could be open, which had huge market advantage, but release code which my competitors would /never/ take advantage of. It had, quite literally, zero downsides, and a lot of upsides.

There are projects where I do 100% proprietary too, and a mix. It's a business decision. It's not as stupid as proprietary=profit and open=charity. It's a business calculation in every case.


Given that I hardly saw any clone vendors other than AMD, I really doubt that they had any influence on Intel's market share.

What worked out was IBM not being able to prevent PC clones, but given the wide adoption of laptops, tablets and phones that hardly matters nowadays.


Intel was forced to license to AMD for government contracts. There's a super-complex story there I won't get into.

There were a few clone vendors aside from AMD. None were ever a serious threat, and AMD itself didn't become more than a bottom-feeder until after maybe 15 years. But their existence did drive a lot of adoption.

And yes, I did oversimplify. MS-DOS, IBM not being able to prevent clones, and so on, all really played together here as part of the same story.


That's why Intel realized that fab technology was the true differentiator.

The only way to outcompete in a sea of clones is to secure exclusive access to a valuable resource they can't.

Intel with fabs. Dell with lean supply chains. The surviving hard drive and memory companies with scale.

I think IBM and Sun show what happens when you try to fight a stand-up brawl in a commodity space.


>That's why Intel realized that fab technology was the true differentiator.

But now the situation is completely reversed. Intel has faced all kinds of problems, costs, and delays due ultimately to the fact that they made a bad choice on their chip architecture but were forced to make it work because they invested so much in the fab.

What TSMC is fabbing for nvidia is working out really well, and if it was not nvidia could walk away without being stuck with billions of dollars of fab facilities they have to own forever.

edit: reversed is the wrong choice of words. It IS all about the fab, but Intel could not/did not accept that maybe someone else had the key differentiator now.


I think it's the other way around. The architecture was being limited by their fabs ability to yeild large chips and in the absence of any CPU perf pressure from AMD the natural push would lean more towards increacong graphics performance in order to push more pixels. As in I think Intel probably had the same yeild issues as everyone else ~10-32 nm but only Intel had the high margins small chip volume to make it profitable to ramp until Apple and TSMC happened.


The architecture is definitely far ahead of anyone else. When you look at Intel chips still being competitive despite manufacturing being a generation behind and with 1/6th the cache per core.

I'm an AMD shareholder and my biggest fear is Intel figuring out their manufacturing.


Maybe not recently, but in the years that cemented Intel dominance, there were many clones on the market.


Yeah, but never in an amount that was actually meaningful, only by a couple of non branded PC OEMs.


I think the parent only means that if Intel somehow shut down, decided to radically pivot or to close everything, because it is somewhat open, you still had alternatives and neither your code, your product, or your company would face insurmontable hardship or die because of it.


Jax is an implementation of numpy on GPU.


Have you heard about CuPy?


Yes. I won't bet my business on a one-vendor solution with a medium-sized community which might disappear at some point.

If CuPy supported NVidia and AMD, and was folded into Numpy, I'd buy the biggest, beefiest GPU I could find overnight.


What technology would you bet your business on then?

Today, you can write numpy code, and that runs on pretty much all CPUs from all vendors, with different levels of quality.

A one line change allows you to run all numpy code you write on nvidia GPUs, which at least today, are probably the only GPUs you want to buy anyways.

In practice, you would probably be also running your whole software stack on CPUs, at least for debugging purposes. So if you change your mind about using nvidia hardware at some point, you can just revert that one line change and go back to exclusively targeting CPUs. Or who knows, maybe some other GPU vendors might provide their numpy implementation by then, and you can just go from CuPy to ROCmPy or similar.

Either way, if you are building a numpy stack today, I don't see what you lose today from using CuPy when running your products on hardware for which that's available.


shrug I'll bet my business on waiting an extra 15 minutes for analytics code to run.

Seriously. There's little most businesses really needs that I couldn't do on a nice 486 running at 33MHz. Now, if a $5000 workstations gives even 5% improvement to employee productivity, that's an obvious business decision. That doesn't mean it's necessary for a business to work. So dropping $1000 on an NVidia graphics card, if things ran faster and there were no additional cost, would be a no-brainer.

There are additional costs, though.

And no, you can't just go back from faster to slower. Try running Ubuntu 20.04 on the 486 -- it won't go. Over time, code fills up resources available. If I could take a 2x performance hit, it'd be fine. But GPUs are orders-of-magnitude faster.


Please, show us how to train Alexa or BERT on a 486. That'll definitely win you the Turing and Gordon Bell prices, and probably the Peace Nobel price for all those power savings!


Please show me a business (aside from Amazon, obviously) who needs Alexa.

Most businesses need a word processor, a spreadsheet, and some kind of database for managing employees and inventory. A 486 does that just fine.

Most businesses derive additional value from having more, but that's always an ROI calculation. ROI has two pieces: return, and investment. Basic business analytics (regressions, hard-coded rules, and similar) have high return on low investment. Successively complex models typically have exponentially-growing complexity in return for diminishing returns. At some point, there's a breakpoint, but that breakpoint varies for each business.

If the goal is to limit GPGPU to businesses whose core value-add is ML (the ones building things like Alexa), NVidia has done an amazing job. If the goal is to have GPGPU as common as x86, NVidia has failed.


> Please show me a business (aside from Amazon, obviously) who needs Alexa.

I'll bite.

Have you ever been getting a haircut, and the hair dresser had to stop to pick up the phone to make an appointment?

Have you ever go to actually pick up a pizza at a small pizzeria and noticed that from 4 employees, 3 are making pizzas, and one is 99% of their time on the phone?

Every single business that you've ever used in your life would be better off with an Alexa that can handle the 99% most common user interactions.

In fact, even small pizzerias and hair salons nowadays are using third-party online booking systems with chat bots. Larger companies are able to turn a 200 people call center into a 20 man operation by just using an Alexa to at least identify customers and resolve the most common questions.


CuPy has experimental support for Rocm.


Awesome! I did not know that.


The large majority of researchers and business getting into NVidia products doesn't seem to find it that relevant, rather what tools, GPU programming languages and hardware they are able to put their hands on.


It's irrelevant to researchers. Research operates on rapid cycles: prototype, publish, move on.

It does impact businesses. It doesn't prevent adoption for e.g. deep learning, but I haven't seen e.g. GPU-based databases reach broad adoption, or many other places where MIMD/SIMD would reduce costs or improve performance. Using classical hardware is clearly cheaper than the business risk and engineering time of relying on a proprietary, closed hardware solution.

I'm at the edge, where my workloads don't require GPU, but could benefit from it. This sort of thing factors into decision-making. I dabble in GPU, but never beyond prototypes, for those reasons.

I think this is one of the reasons why these devices haven't reached wide-spread marketshare. Most computers sold have an integrated chipset. People buying NVidia GPUs are researchers (who don't care), deep learning applications (who don't have a choice), and gamers. There have been predictions for two decades that GPU-style SIMD and MIMD architectures would displace the centrality of the GPU.

Technically, it makes sense. If I type a list comprehension in Python, it would run at higher speed and lower power on a SIMD or MIMD platform.

I think the reason that hasn't happened is because x86 and x64 are open and widely-supported. NVidia is a walled garden, and is only practical for markets NVidia explicitly targets.

This story plays out over and over. Business people push for closed. Eventually, open comes along, and wipes it out. Sometimes, as with x86 or the iPhone, that leads to increased profits. Sometimes, as with Wikipedia, that kills businesses.


NVidia is not to blame if the competition is stuck using C, printf debugging for computing shaders, cannot make their minds about which bytecode to support for heterogenous GPGPU programming.

The situation is so bad that OpenCL 1.2 got promoted to OpenCL 3.0 and SYSCL is now backend independent, while hip only works on Linux.

As for Python, guess who is on the forefront of GPU Programming with Python,

https://www.nvidia.cn/gtc/session-catalog/?search=python

41 results, including CUDA based JIT improvements.

Meanwhile, at IWOCL & SYCLcon 2020,

https://www.iwocl.org/iwocl-2020/conference-program

2 sessions, where it is mentioned that PyFR might need OpenCL 3.0 extensions going forward.

So if competition is not able to provide, most just get to buy NVidia.


> NVidia is not to blame if the competition is stuck using C, printf debugging for computing shaders, cannot make their minds about which bytecode to support for heterogenous GPGPU programming.

I don't bite this argument.

Nvidia made close to no effort to support OpenCL and promoted their own technology CUDA. Even in 2020, OpenCL support for Nvidia hardware is close to nonexistent.

When the main actor of the market does not support a technology, why the hell would you use it or even develop its ecosystem (SYSCL)

> The situation is so bad that OpenCL 1.2 got promoted to OpenCL 3.0 and SYSCL is now backend independent.

OpenCL 3.0, presented by an Nvidia official(https://khr.io/ocl3slidedeck), head of the working group on it. A company that made close to 0 effort to support OpenCL 2.0 revert the spec to 1.2. Astonishing right ?


OpenCL isn't also supported on Android, where Google pushes Renderscript instead, their own C99 dialect, yet I don't see any uprising against Google.

If the 139 member companies (taking NVidia out) listed here aren't able to provide the same quality in hardware, programming language and eco-system improvements against those from NVidia, and vote for a NVidia employee as chairman, then they deserve what they get.

https://www.khronos.org/members/list

It is so easy to find a villain instead of acknowledging failure.


OpenCL used to be supported on Android (But not required on by Google). Currently, Vulkan _is_ required by Google[1] and is probably the future path to GPU compute.

[1] https://www.androidpolice.com/2019/05/07/vulkan-1-1-will-be-...


OpenCL is only supported via hacks to install shared libraries into one own's device.

Vulkan even if optional until Android 10, it is supported by the SDK since version 7, which is something that OpenCL never had.

No serious Android developer would make their life even harder than it already is with the official APIs, by making use of an API that requires device owners to manually install libraries via ADB.


Hmm, is the OpenCL situation really that fringe on android? Eg this OpenCL info app[1] description says "Even though OpenCL™ isn't part of the Android platform, it's available on many recent devices. On Android it's usually used as a back-end for other frameworks like Renderscript. Some manufacturers are providing SDKs for developers to use OpenCL™ on Android. "

I addition to the mentioned PowerVR and Intel platforms there seems to be Android OpenCL also for Mali and Adreno GPUs.

[1] https://apkpure.com/opencl%E2%84%A2-info/com.xh.openclinfo


Yes it is, it is not exposed via the SDK, regardless how Renderscript ends up being compiled to machine code.

You won't find any reference to OpenCL here,

https://source.android.com/

https://developer.android.com/

So while those SDKs do exist, they aren't for application developers, rather for OEMs themselves.

You as application developer have zero control about what GPUs the customers might be using, there is no way to control it on Android manifests, only to specify what APIs are expected, which again, don't have OpenCL as part of the list.

https://developer.android.com/guide/topics/manifest/manifest...

So if you as application developer want to use said SDKs, it is only for your own device, most likely rooted, don't expect to sell applications on the store using OpenCL.


Yep, you can't of course rely on OpenCL support, you have to provide a fallback. But are there any issues with including the SDK supplied OpenCL ICDs with the APK, for use with validated GPUs?


Yes, that is not how Android works.

You are not allowed to ship drivers like that, and since version 7, there is kernel validation about stable NDK APIs.

So yeah just because there is an OpenCL SDK for Mali, doesn't mean that a random Android device with Mali will have the drivers or kernel support in place, because that isn't something that Android requires.

Google has collaborated with Adobe in porting their OpenCL shaders to Vulkan.

If they actually cared, they would have made OpenCL available instead.


I see, thanks for the explanation. Oh well, let's see where things go with Vulkan compute.


> I don't bite [buy] this argument.

I don't think NVidia's half-hearted support for OpenCL has anything to do with the argument at all.

It's a really good point that other tool vendors haven't stepped up and provided modern development tools for OpenCL (or any of AMD's various attempts), while NVidia has great dev tools.


It's not about blame. It's about getting to an ecosystem where GPGPU is used for things beyond deep learning, bitcoin mining, video encoding, and similar niche applications to one where I can fluidly use MIMD to speed up my JavaScript and Python with first-class language constructs to support that.

If that happens:

1) We'll get back on some kind of curve where computer performance starts increasing again.

2) The GPU will become more important than the CPU, and the market will explode.

Until that happens, the GPU market will be for gamers, video editors, and machine learning nerds.

I don't much care who does that, or why it hasn't happened.


It is easy to talk about the proprietary practices done by NVidia, yet none of the other GPGU device makers that are on Khronos weren't able to offer a better experience.

So Khronos has 140 members, about 10 of them producing hardware, and they can't provide a proper developer experience, with software that looks like EE toolchains of the 90's.

The market has already exploded, and CUDA has won.


You're missing the point, and taking everything as an attack on NVidia. It's not an attack on NVidia, or have anything to do with Nvidia versus Khronos. It's clear you've got enough baggage there that I won't go there (not like I was trying to go there in the first place).

But the market hasn't exploded. Most computers have built-in Intel graphics, and most apps can't make use of GPU. NVidia won the battle with AMD, but lost the battle with Intel. GPUs are still for gamers, deep learning applications, bitcoin miners, video editors, and a few other niche applications.

Given that CPUs aren't increasing in speed, and GPGPU is failing to make in-roads, for most workloads, computers are only marginally faster than they were a decade or two ago. If GPGPU made inroads into general computing, we'd still be on a Moore's Law curve, but that hasn't happened.

That's the problem.


Not at all, my complain is the poor service that Khronos keeps doing pushing their half backed supported APIs, expecting that OEMs pick up the tooling part.

What ends up happening is that OEMs, coming traditionally from the EE and embedded mindset almost never provide any tooling worthwhile using.

This is why platform APIs always end up winning the hearts of most developers that aren't into FOSS mindset.

Interesting that you mention Intel, they would rather have you using Open API or ispc, instead of pure Khronos APIs.

And Intel keeps failing at their GPU story anyway.


> but I haven't seen e.g. GPU-based databases reach broad adoption

NVIDIA announced today that Apache Spark 3.0 is built on RAPIDS and showed some benchmarks claiming same perf at 1/5 the cost.

Not your classical database application though.


No they have built a GPU accelerated XGBoost library.

Spark 3.0 is in preview stage already and there is no mention of RAPIDS anywhere in the code.


There's definitely a chicken-and-egg problem of the current user and developer base of GPGPU apps being a very tolerant bunch. "It's just a flesh wound" they say about a lot of things that are prohibitive to normal app developers. Maybe it's the natural order of things, or maybe the the "incompatible proprietary C++ dialects with different kinds of crashy drivers on each OS" approach will be suffiently unpalatable to some future generation of programmers.


GPU based databases haven't reached broad adoption because sending things over the PCIe link is a huge waste of time if you can avoid it. Working around this with custom design like NVLink/NVSwitch do is ridiculously expensive (and why a DGX costs a gajillion dollars), and there is simply not enough volume to subsidize it. They are largely analytics focused, because the parallel hardware can obviously map onto primitives like sequential scan and filters relatively easily. Futhermore, data sizes are not small. Thus the architectures tend to emphasize things like in-memory (VRAM) workloads that get scaled horizontally via RDMA (or RoCE, whatever people are doing these days), which is expensive and limited. Major businesses (i.e. people with money, who nvidia are targeting) already pay for proprietary databases, regularly, every day. That's not the barrier. All of the actual true secret sauce is in the hardware design, and you can't replicate that. You're always at Nvidia's mercy to design solutions to their customers needs. (And frankly, they've done that pretty well, I think.)

Sure, you can pay almost $10,000 per Tesla V100 (which aren't going to become magically cheaper, all of a sudden), and buy 8x of them. That's a 256GB working set, for the price of like, $70k USD. It might make sense for some things. For everyone else? Pay $30,000 for a single server, run something like ClickHouse, and you'll have a better overall TCO for a vast majority of workloads. It'll saturate every NVMe drive and all the RAM (terabytes) you can give it, and will scale out too. It's got nothing to do with openness and everything to do with system architecture. You can replicate all of this with whatever AMD has and it won't make a single bit of difference in the market.

I don't like the fact Nvidia keeps their software closed either (and in fact it was a motivating reason for replacing my old GTX in my headless server with a Radeon Pro card recently), but the problems you're talking about are not ones of openness.

> If I type a list comprehension in Python, it would run at higher speed and lower power on a SIMD or MIMD platform.

I think you vastly underestimate the complexity of these platforms and how to extract performance from them, if you think it's as simple as your list comprehension going faster now and you hang up your coat and you're done. Sure, when you're experimenting, that 5x raw wall clock time improvement is nice, and you don't think about whether or not you could have done it with comparable hardware under a different cost profile (5x faster is good, but 5x longer wall clock than the GPU but 15x lower power is a winner). But when you're paying millions of dollars for these systems, it's not a matter of "how to make this thing faster", it's "how do I utilize the resources I have, so 85% of this $300,000 machine isn't sitting idle". This thinking is what drives the design of the overall system, and that's much more complicated.


I don't underestimate the complexity. But I do claim that the complexity can and should be hidden behind programming language constructs. I've worked both on the design of MIMD hardware, back when I was a graduate student, and on programming languages. These aren't easy problems, but they are solvable.

The reason for openness isn't abstract. I don't think NVidia will solve these problems alone. NVidia can make really good tools for a few specific domains, but generalizing to how we apply this to JavaScript, databases, or Python interpreters requires an open community approach. It requires a lot of people experimenting and dabbling.

It's kind of like Nokia and friends thinking they could solve the problem of building phone apps alone. When Apple launched the iPhone, and there was a community pushing things forward, we were in a whole new world of progress.

I would argue NVidia underestimates both the potential and the complexity if they think they can go it (relatively) alone, come up with the right programming constructs, and provide the right set of tools for programmers to consume.


Except there is a community, a CUDA community and from GTC sessions, a very big one.

Ironically this walled garden as you put it, has produced more programming languages and tooling for GPGPU programming than the open conglomerate design by committee from Khronos has been able to achieve together against a single company, which kept pushing their C mantra until it was too late.


Yes. To have openness work, you need to execute well too.


I'm not making a claim about the necessity of experimentation. I spent years (and working a paid job) doing programming language work, and also design hardware these days in my spare time, so I'm not against that. I'm specifically addressing the claim that "GPU databases haven't taken off because of lack of open source CUDA" or whatnot. Database tech is one of the most R&D heavy engineering subfields, almost all major innovations come from it. The points I made up are not coming from thin air, they're the result of people (engineers) doing a lot of experimentation and coming to similar conclusions for many years. You don't need open source designs to prove this, by the way, you only need to do basic napkin math about the characteristics of the system, and how data moves around, to come to similar conclusions. You need a correct (and I hate this word) synergy of hardware, software, and programming model to do it. A programming language, a new model, does not change the theoretical bandwidth of PCIe 3.0, or the fact you have a memory hierarchy to optimize for best performance. Just having one and none of the others, or having lopsided characteristics, isn't sufficient, and innovations across the stack are one of the major things people are reaching for, in order to differentiate themselves.

That said, I agree and would love to see less crappy programming models here. As a PL geek, I have numerous reasons why I think that's necessary. It really needs to be easier to compose sets of small languages, and design them -- one for designing streaming systems, one for latency sensitive ones. They need to model the memory hierarchy available to us (a huge thing most do not do, and vital to system performance.) I'd love this. But it doesn't undermine anything I said earlier about why things are the way they are, today. No amount of fancy programming languages is going to change the fact a $10,000 Supermicro server is more cost effective than $70,000 worth of V100s for 90% of OLAP workloads you'd want a database for. Engineers design accordingly.

There is also the problem of needing huge amounts of capital, where most of this work can only be done by exceedingly well funded groups with deep ties to hardware divisions in question. The future of hardware innovation comes from billion dollar companies, because only they can sustain it, not plucky engineers. Sure, for us, CUDA being open source would be awesome. But you don't really need open source drivers when you're working directly with the vendor on your requirements and you pay them millions for support and you just use Linux for everything. You just let them solve it and move on. The engineering world is designed this way (both by engineers, and by capitalists), because it is how we make money from it in a capitalist society!

> I would argue NVidia underestimates both the potential and the complexity if they think they can go it (relatively) alone, come up with the right programming constructs, and provide the right set of tools for programmers to consume.

Nope. Nvidia understands that they alone may not hit a global optimum or whatever in all these fields. I suspect given that they have entire divisions of highly skilled engineers dedicated to programming tools -- they understand it better than either of us. But what they also understand is that their software stack is a differentiator for them, because it actually works (the competitors don't) and it makes them money to keep it that way. You're confusing a technical problem with one of politics and vision -- a categorical mistake about their priorities and where they lie. I don't want to sound crass, but people saying "I would argue that I, the sole, lone gun engineer, understand their business and future and everything way better than they do" is typical of engineers, and it is almost always a categorical mistake to think so.

Nvidia fully understands that maybe some nebulous benefit might come to them by open sourcing things, maybe years down the line. They understand plucky researchers can do amazing things, sometimes. But they understand much better that keeping it closed makes them money and keeps them distinct from their competitors in the short term. If you think this is a contradiction, or seemingly short sighted: don't worry, because you are correct, it is. What is more "surprising" is recognizing that all of capitalist society is built on these sorts of contradictions. I'm afraid we're all going to have to get used to waiting for FOSS nvidia drivers/CUDA.

EDIT: I'll also say that if this changes from their "major open source announcement" they were going to do at GTC, I'll eat my hat. I'm not expecting much from Nvidia in terms of open source, but I'd happily be proven wrong. But broadly I think my general point stands, which is that thinking about it from the POV of "open source drivers are the limitation" isn't really the right way to think about it.


Yes and no.

I wouldn't do mid/low storage tiers in a GPU b/c indeed, drinking through a straw. When it's all I/O, even the insane GPU bandwidth still assumes enough compute to go with it. A couple of GPU vendors pitch themselves as GPU DBs, and that's tough positioning when the assumption is all the data lives in the DB. From what I can tell, that only works for < TB in practice, and ideally < 10GB with few concurrent users.

But if you're doing a lot of Spark/Impala/Druid style compute, where storage is probably separate anyways (parquets in HDFS/S3 -> ...) and there is increasingly math to go along with it (analytics, ML, neural nets, data viz, ...), different story. Now that stuff like regex is pretty easy with RAPIDS, instead of doing pandas -> spark or pandas -> rapids, I try to start with cudf to beginwith. (But definitely still not quite there.) We partner a bunch with BlazingSQL here, and they've always been chasing the out-of-core story here. A couple of the lesser-known GPU 'DB's do as well, such as FastData focusing explicitly on replacing spark/flint wrt both batch & streaming.

A few trends you may want to reexamine the #s on:

-- CPU perf/watt (~= perf/$) vs GPU perf/watt (~= perf/$), especially in cloud over last 10 years: GPU is steadily dropping while CPU isn't

-- CPU-era Spark and friends are increasingly bound by network I/O, while GPU boxes go for thicker. You can also do Spark on a thicker box, but at that point, might as well go shared GPU and keep it there (RAPIDS)

-- Nvidia & cloud providers have been pushing on direct-to-gpu and direct gpu<>gpu, including at commodity levels. Mellanox used to be a problem there, and now they control them. My guess is the bigger challenge in ~2yr will be rewriting RAPIDS for streaming & serverless & more transparent multi-GPU; the HW is hard but seems more predictable and much better staffed.

GPU isn't an end-all, but when a lot of CPU data libs are going data parallel / columnar, and Nvidia is improving more than Intel for perf/watt (= perf/$), the choice between multicore x SIMD vs GPU keeps tilting in Nvidia's favor.


> There is also the problem of needing huge amounts of capital, where most of this work can only be done by exceedingly well funded groups with deep ties to hardware divisions in question. The future of hardware innovation comes from billion dollar companies, because only they can sustain it, not plucky engineers. Sure, for us, CUDA being open source would be awesome. But you don't really need open source drivers when you're working directly with the vendor on your requirements and you pay them millions for support and you just use Linux for everything.

I think the exact same argument could be made for mainframes and microcomputers before we standardized on x86. RISC architectures were cheaper and faster in the eighties and nineties than CISC, but x86 cleaned up because it was standard and had an ecosystem. NVidia is limiting its ecosystem to everyone who needs HPC, where the ecosystem should be everyone (no qualifier). All computers could benefit from a massively parallel MIMD co-processor.

> But what they also understand is that their software stack is a differentiator for them, because it actually works (the competitors don't) and it makes them money to keep it that way.

And I think Symbian made the same argument before being steamrolled by iOS and Android. And I've seen the same argument made by business folks at several businesses I've worked at.

By the way, "open" doesn't mean it's not okay to keep some pieces proprietary. NVidia can keep their differentiator by keeping key algorithms proprietary, while making the architecture open, and developing a common set of cross-platform APIs to target that architecture. For example, a cell phone maker can open source most of their OS, but keep pieces like the fancy ML integrated into their photography app (and other similar pieces) proprietary.

> Nvidia fully understands that maybe some nebulous benefit might come to them by open sourcing things, maybe years down the line.

I think you hit the nail on the head here. The benefits of open feel nebulous; it's a long-tail effect and difficult to quantify. It also takes time. On the other hand, the benefits of proprietary are short-term and easy to quantify. Wrong business decisions get made all the time. Indeed, bad business decisions sometimes get made where everyone can tell it's the wrong decision -- it's just org structures are set up to make those decisions. I think this isn't me claiming to be brilliant or smarter than NVidia so much as NVidia failing in the same exact way many organizations fail, by the design of the org structures.

> They understand plucky researchers can do amazing things, sometimes.

It's actually not just about amazing things. It's about a long tail of dumb stuff too. My phone has a few apps better than Google could build. It has dozens of apps Google chose not to build. Most of the stuff I want to do isn't big enough to ever show up on NVidia's radar, but there are a lot of people like me. Symbian didn't make a piano tuner app. It's not hard to make one. I have one, though.

Of course, there are brilliant pieces too. I have some VR/AR apps on my phone which Google would need to invest a lot of capital to make.

> EDIT: I'll also say that if this changes from their "major open source announcement" they were going to do at GTC, I'll eat my hat. I'm not expecting much from Nvidia in terms of open source, but I'd happily be proven wrong. But broadly I think my general point stands, which is that thinking about it from the POV of "open source drivers are the limitation" isn't really the right way to think about it.

I'm not holding my breath for NVidia to change. But I do hope at some point, we'll see a nice, open MIMD architecture which gives me that nice 10-100x speedup for parallel workloads. I actually couldn't care less about whether that speed-up is 50x or 100x (which is where NVidia's deep R&D advantage lies). That matters for bitcoin mining or deep learning. For the long tail I'm talking about, the baseline speedup is plenty good enough. The cleverness doesn't come from pushing extra CPU cycles out, but in APIs, ecosystem building, openness, standardization, etc. That stuff is a different kind of hard.


Hypothetically, from an ISA perspective, why couldn't Intel and AMD extend x86-64 more fully with SIMD / MIMD instructions? (as in, way more fully than MMX / SSE / AVX)

Naive question, because I literally don't know the link between CPU instruction stream and GPGPU instruction stream.

But it seems like there would be an opportunity to seize the higher (open) ground at the ISA level, and then force Nvidia to implement its own support for that standard.

With the point of being able to run identical code across CPU / CPU-with-embedded-GPU / CPU + GPU.

Understand we're talking about mind-boggling levels of complexity here, but it feels like the CPU shops ceded the role of graphical ISA to Nvidia & Microsoft (DirectX).


GPUs have gone far beyond just SIMD these days. To effectively program a GPU, you need to program it like a GPU, not a CPU. In particular, while most people are aware that GPUs don't like branching at a high level, branching can actually be fine as long as each block (small group of processors in the GPU) take the same branch. Block 1 taking the branch while block 2 not taking the branch will have little impact on performance. Additionally, the memory hierarchy is completely different for GPUs with blocks sharing cache and a huge number of registers per core while having very little memory for a typical stack.

Sure, treating a GPU as a SIMD blackbox may work for many problems as a suitable abstraction, but in doing so you also overlook many of its capabilities. x86-64 can emulate many of the SIMD aspects without too much trouble, but the aspects like huge number of processors with many registers is not something that is able to be reproduced without a large number of tradeoffs.

The only way that I see it as being possible to have a true CPU/GPU hybrid that is effective would be to basically have two separate chips for the GPU and the CPU, maybe multiple chips. I think the reason why such a product has not really taken off is because at that point there really isn't a point over using it versus a separate GPU and CPU. Maybe if hardware designers figured out how to greatly improve CPU to GPU communication in such a setup over having the motherboard in-between it might be worth it. CPU to GPU communication is a bottleneck for many applications.


learning about SIMT helped my understanding of the differences.


That is what Intel tried to do with Larabee and failed spectacularly.


Intel actually failed at least twice at this. More recently with Larabee and the Xeon Phi saga that got us AVX-512, and maybe Intel Xe saves it.

But in the late 80s and early 90s they actually had the i860 and its variants: https://en.wikipedia.org/wiki/Intel_i860

which contained a graphics processor! That wasn't a particular success but it landed us the MMX ISA on x86.

Surprisingly enough, Intel went then to make similar mistakes with Itanium, and then with Larabee. But both things landed us features on x86.


I wouldn't say they failed. I'd say they gave up on it before product maturity.


I'd say they aimed for the wrong market (graphics processing, where they were competing against very specialized and experienced competitors) and failed to partner.

Maybe Intel ~1998 could have solo-launched a new architecture, but the only way they'd get uptake now is something in cooperation with AMD.

And maybe the AMD partnership bridge is burnt from previous shenanigans, but it seems like both AMD and Intel would have incentive in more tightly coupling graphics compatibility to CPU ISA, vs Nvidia designing their own.

That said, in that hypothetical reality, Nvidia wouldn't have been able to innovate and execute nearly as fast as they have.

As one of my Comp-E professors once quipped, "If a structural engineer ever tells you programming close to processors is easy, ask them how they'd like their job if the physical properties of lumber changed every 2 years."


> I'd say they aimed for the wrong market (graphics processing, where they were competing against very specialized and experienced competitors)

Also, the game they were using to validate the performance of their hardware (Doom, I think?) ended up being so different from other software in how it used the GPU that their optimizations didn't really transfer.


They pivoted it to HPC (the Xeon Phi product line), produced a couple of generations of products, but that didn't really pan out either so they cancelled it.

I suppose some of the "DNA" lives on in AVX-512..


> why couldn't Intel and AMD extend x86-64 more fully with SIMD / MIMD instructions

I think there is that latency vs bandwidth trade-off where CPU likes lower latency and GPU higher bandwidth, but you can't achieve the same with a single chip.


I guess this is fundamentally a homogeneous vs heterogeneous ISA question. I.e. is your ISA intended to operate one chip, or multiple cooperative chips / complexes?


Does it make sense for a hardware ISA to express cooperation between chips? I would think HW ISA is meant to control it's local microarchitecture. I could see a virutal ISA or compiler IR built with a multi chip view.


GPUs are all a single chip atm.


> Technically, it makes sense. If I type a list comprehension in Python, it would run at higher speed and lower power on a SIMD or MIMD platform.

That assumes the operations, data types, and data arrangement are such that they are vectorizable. In databases at least this kind of optimization is typically done by hand because there are so many constraints. You can't just turn on a flag in the code and make it happen automatically. [1]

[1] https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf


> I haven't seen e.g. GPU-based databases reach broad adoption

Because it's very questionable whether GPU-based databases are generally valuable. GPUs accelerate compute, not all of the other things that databases do and often GPUs are not cost effective.

> I dabble in GPU, but never beyond prototypes, for those reasons

And I bet if you dabbled a little further you still wouldn't use it because it isn't cost-effective outside of intensive-compute applications.

> Technically, it makes sense. If I type a list comprehension in Python, it would run at higher speed and lower power on a SIMD or MIMD platform.

This is a very questionable claim. There are tradeoffs to these things (e.g. clock-speed, startup time, etc) and your list comprehensions are probably not compute heavy enough that the tradeoffs are worth it.


You're making a lot of lousy assumptions. As a few points of reference:

* My list comprehensions run over gigabytes of data (but sometimes 3 orders of magnitude bigger or smaller). Stream processing of big data. It's not deep learning, but it's slow and potentially deeply parallel. It would move to MIMD trivially, and SIMD with just a little bit of work.

* There are programming languages which support models almost exactly like this for data processing. Sun Labs Fortress comes to mind as an early example. This would generalize to a lot of contexts -- much smaller than you're giving credit for.

* Most of the issues, like startup times, are implementation-specific, rather than fundamental, and could be mitigated for much smaller data too. You do need to wrap your head around changes in programming paradigms to make that work. There is some overhead for latency (you'll probably never do well moving a list with 10 items to a GPU), but most of those aren't where programs are performance-bound.

* Many database operations map very well to MIMD.

* You're making deep assumptions that you're talking to an idiot. That doesn't make you look smart or right, or lead to a constructive discussion.


If this is the workload you are looking at, then you really should look at CuPy.

I know you discard this because you don't like the NVidia dependency, but it's not much different to switching a BLAS library or using Intel's numerically optimised Numpy distribution. Your code remains the same, you just change the import and get magic speed.

If you still refuse to look at it, then perhaps consider cuBLAS[1], which you can switch out for any other BLAS library (eg [2]). It's one thing that AMD actually has bothered to do and they have version available for CPUs[3] and GPUs[4].

[1] https://developer.nvidia.com/cublas

[2] https://towardsdatascience.com/is-your-numpy-optimized-for-s...

[3] https://developer.amd.com/amd-aocl/blas-library/

[4] https://rocmdocs.amd.com/en/latest/ROCm_Tools/rocblas.html


You're right about my assumptions, I apologize. Without knowing specifically what you do, I can't say if GPUs would make sense.

But I don't agree that many workloads could be moved to GPUs cost effectively. It's very hard to feed work in fast enough to keep the GPU busy enough to be cost effective given the limited amount of GPU memory you have to work with.


Well, there's a question of whether the GPU has to be fed fast enough. My CPU is nominally rated at around 150-200 gigaflops. My GPU is rated at about 5 teraflops. That's about a 30x speed difference (and there are obviously faster GPUs out there). That's enough to move me from compute-bound to IO-bound and make things a lot faster. Once I'm IO-bound, I'll obviously see no more performance increase, but I figure I'll get a good 5-10x before I get there.

Right now, code runs anywhere from a few seconds to overnight, depending on what I'm doing. I'll also mention I'm working on many projects, so that's not overhead I incur every day, just once in a while.

Moving to GPU would move that to running anywhere from more-or-less instantly to an hour, I figure, based on similar very back-of-the-envelope benchmarks and guesstimates. That's totally worth dropping $1000 on a new GPU, if that's all it took and things worked out-of-the-box. It'd pay for itself in a few programmer hours.

On the other hand, that's totally not worth weeks of programmer / dev-ops time for switching to a proprietary tool chain. An alternative there is to wait for my computation, or to optimize my code. Both of those seem cheaper than maintaining a GPGPU workflow, where GPGPU is right now. If GPGPU came batteries-included in Ubuntu+numpy, it'd be an entirely different story.


Agree. The reason GPUs are not widely adopted in certain areas is not because of open sourced or not, but because it is not cost effective. GPUs are optimized for throughput, not latency.


I suspect some of it is driven by trying to keep the gaming/ai and desktop/server markets from overlapping. Market segmentation. If it were more open, that would be harder.


I wanted to do GPU PCI passthrough in a VM (run Linux host, then for gaming run a Windows VM with the GPU passed through to get good performance). Nvidia disabled this for their consumer GPUs; the Nvidia drivers in the Windows VM will block this from working. It was a purely software thing; there was no reason for this aside from nvidia wanting companies to pay more for the Quadro/etc. GPUs.

In addition to that, there's always the proprietary blob running in my kernel.

So a few months ago I bought an AMD card for my new computer.


GPU passthrough is also doable pretty easily on NVidia nowadays. See here: https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVM...

/r/VFIO on Reddit is also pretty helpful.

That being said, I fully support you buying and using AMD. But no need to throw out perfectly fine hardware in case you still have NVidia lying arround.


> GPU passthrough is also doable pretty easily on NVidia nowadays.

By actively working against Nvidia who could break it again at any time if they wanted to:

> Starting with QEMU 2.5.0 and libvirt 1.3.3, the vendor_id for the hypervisor can be spoofed, which is enough to fool the Nvidia drivers into loading anyway.

If you already have Nvidia, fine, but to me this reads as a strong reason to not buy Nvidia if you can help it.


Oh, don't get me wrong, I surely don't want to encourage you to buy Nvidia.

To be fair here, AMD also has some gripes with VFIO: Namely, the reset bug on Navi (which I personally didn't experience, but read about quite a few times) and disabled vGPU support on their smaller cards, which is, as far as I know, only a software solution and not really something that would steal their business customers either.

Still, I'm rooting for AMD, if only for the fact that they're the reason it doesn't take six CPU generations any more to have a 50% performance bump.


Do you have a link for their open source announcement?


In my experience, Jensen Huang's keynotes are unprofessional in the best possible way.

I remember thinking during an entire GTC presentation "Wait, this guy is the CEO?"

He seemed like an excited engineer who happened to stumble onto stage.


Love his informal style. This video from him presenting at Stanford about the beginnings of Nvidia is so awesome exactly for that reason: "...You talk about that for about 6 months. The big event for the day would be 'Hey, where do you guys want to go for lunch?'" Beautiful when you pair it with the fact that they built a company worth many billions of dollars.

https://youtu.be/yU3GUHDf0mk


I thought he was nerdy business man trying to be cool and it was / is cringy, but years ago I saw him talking with TSMC founder on computer history museum and since I think highly of the guy and really appreciate him. There's a dose of genuineness behind him and you can tell.


I think that's as accurate of a description as I can imagine.

Most CEOs are nerdy business men trying to be cool. Jensen comes across more like a cool nerd trying to be business-y.


Did you see his video before the keynote? I had a good chuckle.

https://www.youtube.com/watch?v=So7TNRhIYJ8


That’s a fancy kitchen.


And he sure has a LOT of really nice spatulas!

His Spatula City Frequent Buyer Card must have a lot of stamps on it.

https://www.youtube.com/watch?v=2XbCWmY0eqY


Those kids bouncing up and down in the back seat with no seatbelts on looks like madness.


It's like the episode of Mad Men where the kids are playing spaceman: https://youtu.be/2XbCWmY0eqY


I rather watch him than those fake converstations on stage done by professional CEOs, or those guys in fancy suits that feel like someone from a TV sales channel.


> He seemed like an excited engineer

Is this a bad thing? I personally avoid presentations made by CEO of big corporations as it’s usually a Bingo of all the trendy buzzwords that as in fine no meaning.


Yes and no.

When a kludged together demo fails and there's an awkward moment, sometimes a little more prep might be nice.

But on the other hand, I feel better about a company focused on doing actual work rather than polishing demos.


Tech demos are notorious for failing on stage. I can’t even think of a well known CEO who hasn’t had one happen. I don’t think this has a thing to do with Jensen’s style.


Especially when he runs over his keynote by like an hour. It's frustrating (is he going to finish before I have to leave for this meeting?) and also really fun listening to a CEO who's so excited about what they've built that he completely loses track of time.


Saw him recently for the first time and really enjoyed his enthusiasm and "real-ness". At one point during the (long) talk, he opined that he really wished he hadn't drank so much right beforehand (i.e., full bladder). Ha.


He is an Amazing salesman though


Better breakdown of this architecture compared to previous architectures here:

https://www.anandtech.com/show/15801/nvidia-announces-ampere...



If the demonstrated speed ups translate to real world performance, then I’m truly blown away. Looks like Nvidia will be holding onto the AI crown a while longer.

The only thing I wonder is how difficult is it to take advantage of some of the new arch features, such as TF32 format or sparsity tensor ops.


Have you looked at Apex.Amp? TF32 sounds to be along similar lines, and the PyTorch usage is a breeze.


The TensorFloat-32 results look really impressive, but yikes that is not a good name. "TensorFloat" is extremely confusable with "TensorFlow," and it would really more accurately be called a 19 bit format.


'brain floating point' is also bad, but no one cares because it's just bfloat16. If this becomes popular, it will just be tf32 or tfloat32 or something.


I mostly agree. But bfloat16 at least uses 16 bits. tfloat32 is a 19 bit format and that just seems like a terrible naming decision to me.


I never knew what the “b” in “bfloat” was in all these new DL chips… until today. Man that’s bad.


Disclosure: I work on Google Cloud.

I wouldn't worry about it. Looks like the Anandtech article [1] doesn't either :)

> bfloat16, a format popularized by Intel

[1] https://www.anandtech.com/show/15801/nvidia-announces-ampere...


It's (Google) Brain float


It seems like the TF32 format is similar to BF16 but with 3 more precision bits (in other words, it is FP32 with 13 low-order bits dropped instead of 16).

Full adders in FP units scale with the square of the mantissa bits, so if the number of mantissa bits stay the same, they can reuse the existing units.

Since Nvidia already has existing FP16 units on die, using those units for TF32 calculations probably doesn't cost too much additional die area.


Oof. Double-precision is only 2.5x better, which is less impressive than 20x for float.

I still haven't found anything more cost-effective for double precision data processing (cost + dev time) than a rack full of used Xeons...


The double-precision number probably best represents the generational improvement.

The 20x 32-bit floating point improvement is probably achieved by comparing doing full FP32 calculations on the previous generation vs doing TF32 calculations on Ampere. This would not be an apple-to-apple comparison as the TF32 result is less precise.

That said, it is probably not terribly important for deep learning at least, given the success of BF16.


Actually it looks like the double precision number in general GPU usage only went up 25% (Volta did 7.8 TFLOPS). To get the 2.5x number, you need to use FP64 in conjunction with TensorCores, which then gets you 19.5 TFLOPS.

Considering how big the die is (826mm^2 @ TSMC 7nm) and how many transistors there are, they really must have beefed up the TensorCores much more than the general compute units.


Wow. TF32 is only 19 bits. Thats some dubious marketing.


I'm curious: why exactly do you need double precision digits? Not dismissing, just wondering what kind of application needs it.


Physics simulations. There's a rule of thumb that to get an n-bit accurate result after a long chain of calculations, intermediate results should be stored with 2n bits. Often using the full dynamic range of a float is necessary because the magnitude of different physical phenomena varies so wildly.

I guess people do store intermediate results in floats in order to take advantage of GPU acceleration. However, once you do that, you have to be careful in your programming and pay a lot more attention to underflow, overflow, and numerical accuracy. People who write scientific software usually also aren't experts in numerical analysis. Even if the code you write reliably works with floats, the library you use might not be. It's just a huge pain to make sure everything is accurate.


After moving from theoretical high energy physics to data science I m really happy I don't have to care about numerical precision on my computations.

The problem is that numerical errors when solving partial differential equations not only propagate but increase in magnitude during the propagation. If you are not careful you will end up with a 100% wrong answer at the end of a big computation.


I've always argued that if you are getting close to having to worry about underflow, overflow etc. then you have an ill-conditioned problem and just increasing the size of your intermediate results won't help you a huge amount because you need more precision from your inputs. There are very few fields where you need more than the 7 decimal digits afforded by floats. Maybe the only exceptions are in astrophysics.


When solving linear systems of equations (which arise pretty much everywhere) Krylov subspace methods are usually quite effective because the Krylov subspace is orthonormal.

If your floating-point precision isn't high enough, you'll end up instead with a "subspace" that isn't spanned by orthogonal vectors, and the consequences of this are pretty drastic (requiring n-times the number of iterations to solve the system, you'll never find a solution, etc.). These all happen even if your system of equations has a good condition number, you just need to make the precision low enough.

This is why most people use double precision, and some people use quad precision. For many systems, e.g., if you are using CG as your solver, quad precision can cut the number of iterations by a large factor (2x-4x). These problems are still bandwidth bound, and using quad precision duplicates your memory bandwidth requirements, but if it reduces the number of iterations by 4x, you just halved your time to solution.


Lattice QCD, especially near the physical point, has poorly-conditioned matrices that one wishes to solve Mx=b for. The state-of-the-art is that the [sparse] matrices can be as large as 4×3×(128^3 × 192) ~ 5e9 on a side. It's not so rare to find legitimately difficult problems in hard sciences.


Double precision has been the mainstay of scientific computing for decades, and no, it's not because all those scientists are dumb.


Pretty much any ODE solver you want double. All of CFD (fluid sims) and FEM (mechanics) use double. It can also be a big help in graphics. Basically, if you're doing fine meshing, doubles will help you. There's a balance between your meshing size as well. Your life is a whole lot easier and you'll get better answers if you just use double. Ask literally any computational {physicist,scientist,engineer}. We've all tried without double. Trust me, I'd love to have lower memory constraints.


That argument would be wrong. Double can be a big help, for example when performing solid modelling operations on triangle meshes. Best would be actually exact arithmetic, but it being often too slow, Double is often good enough, while Float isn't.


I thought for this the standard practice is fixed-point. Requires more planning and mental gymnastics but usually much faster and gives you full control of the precision.

Maybe this has changed from my DSP days.


I am confused. If you used to work in DSP then you know the standard practice involves MATLAB, where the default is double precision. This is also the default datatype in NumPy. Most engineers don't like working in fixed-point unless they have to for other reasons, like moving it onto an FPGA or something.


The person you're responding to probably worked with DSP chips, which are generally not floating-point. e.g. Motorola 56000, TigerSHARC, Blackfin.


Why not? Because of interop with other libs/tools? It's not _that_ hard, but I can see the problem if whole workflow isn't like that.


You know a cool thing you can do to help fixed point interop between operations in a complex system where you absolutely don't want to accidentally overflow anywhere? You can tack on some bits to the number to control an overall scale of the fixed point number. Let's call it an exponent ;)


Agreed. Let's standardize it :))


most of it was FPGA work but in some cases DSP processors. Even when you have floating point support it is much slower than if you use fixed point. As for MATLAB, modeled plenty of filters in it for fixed-point math.


Not OP, but like many other people I use linear algebra accelerators for scientific computing (in my case, simulating and controlling quantum-mechanical systems). We do need the precision if we want the solutions of our ODEs to converge.


Titan V100's are pretty good for double precision, needs almost same operation intensity as Xeons to fully utilize them.


This thing is mainly targeted at AI workloads I assume, so double precision isn’t so interesting.


I always thought that a real time scalable architecture would be beneficial. It's refreshing to see someone working on it, and exciting to see that it's nVidia. I always pictured a CPU with variable bit-width. Like a 256-bit ALU that could partition itself down into 16 or 32 bit ALU's as the workload allowed.


That's been around since MMX and AltiVec. It took a while for GPUs to adopt subword SIMD though.


SIMD works great for doing the same thing to multiple pieces of data, but it doesn't do the scaling up that I described.

I'm no chip engineer, so maybe what I'm envisioning isn't possible. In essence, instead of making 4x 64-bit cores you make 128x 2-bit cores and then some architecture on the die to select groups of cores to build a processor of the required size, execute some instructions with that processor, and then disassemble the processor back into a pool of resources.

So SIMD might be able to calculate two 16-bit sums on a 32 bit processor in one cycle, but the hypothetical CPU I'm describing will be able to calculate a single 128 bit sum and eight 16 bit sums in one cycle, at the same time.


What you're describing is basically a modern FPGA[1]. You can wire it up as you want at runtime, and they can contain specialized hardware like hardware multipliers and fast local memory to accelerate certain workloads.

[1]: https://en.wikipedia.org/wiki/Field-programmable_gate_array


Wasn't that how early Cray's worked, with variable bit-width floats?


For those in the industry:

When a new generation like this is released, will a typical AI company replace the current GPUs? Is there a chance to acquire the older versions for private use or is it too early for that?


> will a typical AI company replace the current GPUs?

(Not speaking for everyone of course)

No we keep the current GPUs as well, the one we throw away are several generations old (like K80s), so not very interesting.

For private use, those GPUs are really not worth it. Once their used price reaches an acceptable level, you can be sure that the new general public GPUs will outperform them.

For example a Titan RTX is probably still twice as cheap as V100, while only being ~20% slower.

Edit: Actually a Titan RTX is about the same price as a P100 now, while being much better.


A fun quick of GPU pricing economics is that on Google Cloud platform, the relatively recent T4 (on Turing and has FP16 support) is cheaper than the ancient K80s. https://cloud.google.com/compute/gpus-pricing


That's not a great comparison at all though. The K80 is a general purpose chip while the T4 is explicitly marketed as an inference chip. The K80 has more ram (super important for batch sizes during training), can access that ram faster (480 GB/s versus 320 GB/s), and is an overall more powerful chip than the T4 is.


Those metrics are for both GPUs on a board; the GCP K80 only uses one GPU, so those performance metrics in theory would be halved (notably, the GCP K80 has 12 GB VRAM vs. T4's 16 GB VRAM), and it's still more expensive than a T4.


I don't know any specifics about GCP but in general for datacenter design, heat dissipation and energy costs are really important. When you shrink the process width, you get incredible savings in power (saves $$) -> less heat dissipation per flop (saves $$ because you don't have to remove that heat).

That and I suppose there could be a premium for older setups of any kind if for whatever there are licensing agreements that are negotiated for certain hardware (or for certain numbers of cores).


7 times V100 performance for BERT. That is insane!


Mind you, this is an optimised version of BERT, from what I could see in their blog post.


The network topology is the same, no modules are removed, etc. Of course it will be optimized to run on Ampere


I'll link the the mention of the optimised versions, but that's not what I mean!

Say earlier model XYZ trained 4 epochs per hour, and BERT trained 2 epochs per hour. Now on a single card if you can train optimised BERT for 4 epochs per hour, that doesn't necessarily mean the same card will handle XYZ at 8 epochs per hour.

It's a technological achievement nonetheless, but the fact that it was heavily optimised for the new architecture, possibly beyond an extent infeasible for non-Nvidia developers, still has to be considered.

[0] https://nvidianews.nvidia.com/news/nvidia-achieves-breakthro...


Hmm, I was expecting 3080 ti.


I believe the RTX 3000 series is expected to be announced in August. They usually announce their GeForce cards a few months after they announce the Quadro ones.


Really? That's not too long to wait. Hopefully it won't get delayed by Covid-19.


5 petaflops in DGX? That alone would put one of those babies into TOP500 top 50, and a superPOD would make no. 1, no? Well, if it could do that performance on Linpack/Rmax.


No, the A100 has a 19.5 TFLOP theoretical peak for SGEMM[1], real world benchmarks will likely achieve 93% of that, and so the DGX A100 will be 145 TFlops of FP32 SGEMM performance or 0.145 FP32 PFLOPS. Maybe in 72 FP64 TFLOPS. FP64 is what the TOP500 benchmaks count.[2]

The 5 "petaflops" number is a creatively constructed marketing number based on FP16 TensorCore "flops", sparse matrix calculations, and then multiplying by 8x for some reason. They basically take the 19.5 FP32 TFLOPS number and multiply it by 32x to get to the claimed 624 "TFLOPS" for a single A100. 8 * 624 = 5 "petaflops". I see they get 2x by actually using FP16 instead of FP32, 2x from counting sparse matrix ops as dense ops, and 8x from somewhere else that I have no idea.

[1] https://devblogs.nvidia.com/nvidia-ampere-architecture-in-de...

[2] https://www.top500.org/resources/frequently-asked-questions/


> 8x from somewhere else that I have no idea.

8 GPUs in the box.


No, I already multiplied 624 TFLOPS / GPU * 8 GPU = 4992 TFLOPS (the 5 petaflops number).

I'm saying that you are still missing another 8x on the way from 19.5 TFLOPS / GPU to 624 TFLOPS / GPU. 19.5 (base FP32 theoretical peak performance) * 2 (FP16 instead of FP32) * 2 (counting sparse matrix ops as dense ops) * 8 (unknown) = 624 TFLOPS.


FP16 tensorcore = 312 tflops

x 2 (counting sparse as dense) = 624 tflops

x 8 GPUs = 5 "pflops"

The missing 8x you are looking for is just because tensorcore math is much faster than their normal fma path.


Top500 is generally based on double precision (64 bit).


A previous generation superPOD (with V100 GPU's) is currently at #20. Watch this space...


Yeah, use this with PCIe4 and add a dense PCIe facric.

Broadcom https://www.broadcom.com/products/pcie-switches-bridges/expr... and Microchip https://www.microchip.com/wwwproducts/en/PM42100 have 98/100 lane PCIe4 fabric switches to offer.


Less than that in double precision. If my math is right, they manage around 0.4 petaflops. Their previous-gen SuperPOD was #22. If they do indeed add four of the new ones that will still put it at #1, with something like 230 petaflops.


I'm mostly just amazed by that... what is it, that ivory thing above/around his stove?

Is this a thing newly rich americans have in their kitchens?


No one is talking about it but the mention of a focus on "datacenter computing" (in the video) with CUDA is interesting. In retrospect a GPU is basically running a map/reduce type workflow so its not that crazy after all.

Is this new or is CUDA already being used as a distributed language?


cuDF has supported Dask for distributed processing for a while now, maybe a year or two?


Nvidia is one of the major GPU suppliers (as well as AMD and others) as there are more compute needs in autonomous vehicles, robotics, etc. For future demand, will there be more increases in demand for CPU's (or decrease here in favor of GPU's), GPU's or some development other favorable processing unit?

Google had TPU's mentioned a few years back as one example. So curious to know what market segments would likely increase, which would require such processors to fill the increased demand for those units.


For dedicated neural network use cases like vehicles, robots, and datacenters, dedicated accelerator chips are going to be huge. Tesla's been shipping custom silicon for a while, I'd be surprised if Waymo cars didn't have a TPU in them, and there's like a dozen companies making AI accelerators but as far as I know, none of them are for sale yet.


In general computing, FPGAs will likely see a rise in popularity (latest mac pro offers an FPGA card as an option). In task specific computing, GPUs and ASICs will likely lead.

Tesla replaced their Nvidia GPUs over a year ago with a custom chip (I believe they hired a chip designer away from Apple to develop their own silicon).


Extending the tensor Ops to FP64 is an interesting, if not surprising, design choice. Are there many applications sure to leverage this capability? Aside from HPL, of course.


I am suspecting that this is specifically targeted at the HPC market as the lack of FP64 has always been a hindrance to HPC deployment.

You have to remember that the HPC market is $35B today. HPE makes $3B a year alone from that, maybe more with Cray acquisition.

So it's no surprise that NVIDIA wants to position themselves on that market.

Plus, you have too look at the long game with MLX acquisition ( heavy player in the HPC market) and Cumulus. I wouldn't be surprised to see NVIDIA trying to bypass Intel/AMD completely and offer Direct to Interconnect device. Rather than the hybrid CPU + GPU box.


Remember too that AMD and Intel won the contracts for Aurora, Frontier, and El Cap (the three exascale machines for the DOE). I imagine MLX is a big part of getting the next contracts as well as seeing that a lot of these projects are IO bound, not compute. If you can bring supercomputing like abilities to datacenters or AI labs, that'd be a huge advantage. If you could easily split a huge model across 64 GPUs and train like it was on a single node, this would change the space.


> Remember too that AMD and Intel won the contracts for Aurora, Frontier, and El Cap (the three exascale machines for the DOE).

This isn't very surprising I think. AMD cards often have higher FLOPs than NVIDIA's, and I can imagine that they run HPL really well.

I can't wait to try these systems. I want to see what the OpenMP performance there looks like for normal applications.


Lately they have also been working quite a bit with POWER, quite a few of the big HPC rusns on the POWER/Mellanox/NVIDIA combo. So that could also be something they look into more, instead of going with x86.


Is there a video or something?




The numbers for their SATURNV supercomputer are either untrue or absolutely staggering. 4.6 exaflops? #1 on the Top 500 list of supercomputers just barely passed 200 petaflops at peak performance. If you add up the entire list you only get 1.65 exaflops. And LINPACK isn't usually network-bound. How can this possibly be true?


They're counting TF32, which is a 19 bit format and comparing it to FP64.


It's tensor Ops, and so the precision might be even lower than single. It could be BF16, for example.


Yeah, next generation national super-computers are "pre-exascale" supercomputers. Theoretical 4.6 exaflops today is an unbelievable jump. It would be interesting to know what's the HPL perf that it can sustain.


AI only perf vs general perf.


Does anyone know what kind of GPU support for Spark is mentioned in this announcement? Are they talking about existing XGBoost acceleration, or is it something general-purpose?


This goes beyond the existing Spark+XGBoost GPU acceleration to include ETL, Spark SQL, etc. Coming for Spark 3. Full details here: https://www.nvidia.com/en-us/deep-learning-ai/solutions/data...


At first glance, it seems general purpose:

https://www.zdnet.com/article/nvidia-and-databricks-announce...

"based on the open source RAPIDS suite of software libraries" […] "will allow developers to take their Spark code and, without modification, run it on GPUs instead of CPUs"

https://rapids.ai/


They show a fictional render of the main chip surrounded by four large gold coated leadless packages. What are those supposed to be?


VRMs.


Probably a heatsink


How does RAPIDS relate to GOAI and Arrow? It seems like the same technology keeps getting a name change...?


GoAi was to get GPU developers on the same page and to work together to build an ecosystem for analytics on GPUs.

RAPIDS is a project that was born out of GoAi to bring that ecosystem to Python.

It is built on Apache Arrow (although on GPU memory), and has many of the original GoAi members like my team, BlazingSQL, and others such as Anaconda, Nvidia, and many MANY others.


Apache Arrow is used in RAPIDS.


I know this isn't a traditional supercomputer and these aren't LINPACK benchmarks, but I am still blown away by the 5 petaflops figure, considering where we were just 10 or 15 years ago.


Have to admit again, Huang is an amazing salesman


Did anyone notice that PCB? 50 pounds, 30k components, 1M drill holes, and 1 km of traces. Pretty much blew my mind.


Still waiting for Wave Computing to ship.


Well considering they are filing for bankruptcy you are going to be waiting a while...


Speaking of vapor, still waiting for rex computing to ship...


Me too; sadly we had silicon that ran great (We had better performance per watt for FP32 and FP64 on 28nm process vs the A100's 7nm GFLOPs/watt) but targeting a market that had 3 customers that didn't want to work with a startup, plus investors that were opposed to us going into the "risky" and "unproven" AI space. Still have 183 of the chips in my closet waiting to see the light of day :/


Source?


A very simple google search would suffice, but here: https://www.eetimes.com/wave-computing-files-for-chapter-11-...


> DGX-A100 - The First HPC System With 140 Peta-OPs Compute Shipping Now For $199,000

Crazy pricing for a chip that basically started as a game solution. Anybody actually seen these Nvidia DGX systems in the wild?


Yes, we use them (with V100 not A100 of course). Mind you that the full box is more than just the 16 GPUs.


Yup- they're cheaper than renting from AWS.


We are working with 20 of them at a customer....


Can you talk about what sort of things you're doing, in general terms? Does the new generation DGX look like a worthwhile upgrade for you?


Large scale deep learning. I don't make the decision on upgrades. But if i were buying new dgx-1s, i would buy these new ones. But i wouldn't buy a dgx-1 in the first place - it is an appliance with software nobody wants. Buy a commodity server with V100s and NVLink for 66% of the price. Like HPE or somebody else sells.


Stealing Epic Games/Unreal Engine's thunder.

If Nvidia was a human, they'd be the type to propose at someone else's wedding.


So wait, the GTC keynote which was scheduled weeks/months ago, happens to follow news that Epic dropped with absolutely no warning, and apparently that counts as stealing thunder. Is that really what you are saying?


Nvidia probably collaborated on the Epic demo but also they are aimed at two entirely different sets of customers. Nvidia's news cycle is data center updates in May, consumers/gamer updates in September.


Epic demo was running on PS5 so no NVIDIA involvement there.


Oh yes, I forgot they've entirely lost this console generation.


Well not entirely (and depending on what you count as this console generation): the Nintendo Switch is based on an NVidia SOC. Although that's probably not the biggest money maker.


And the entire HPC exa-flop super computers as well.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: