Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Aiter: AI Tensor Engine for ROCm (amd.com)
179 points by hochmartinez 5 months ago | hide | past | favorite | 88 comments


I just want to remind everyone that El Capitan, Frontier and LUMI supercomputers are powered by AMD instinct cards.

El Capitan is #1 in TOP500. Frontier is #2, LUMI is #8.

ROCm development is probably mainly driven by the needs of these supercompuers' users currently.

So, we're seeing the tip of the iceberg.

Also ROCm packages continue to land on Debian, so there's more than meets the eye.

Note: Search "AMD Instinct" at https://top500.org/lists/top500/list/2024/11/. There are way more systems.


> ROCm packages continue to land on Debian, so there's more than meets the eye

I've been volunteering with Debian to help package ROCm for four years now, but today it officially became my full-time job. AMA.


Congrats on the job! It's exciting to see developments in CUDA competitors.

One of the issues I've had with ROCm is not so great support for commercial GPUs. This is specifically with RX 7XXX series. Do you think there is any chance it will improve in future?


I'm not sure. What were your problems with the RX 7XXX series?


Not the GP, but I have a RX 7700S running Ubuntu and I cannot for the life of me get ROCm to play nice with my GPU. I tried all sorts of env vars but I keep getting seg faults when I try to run PyTorch. Or it just ends up running on my CPU


The RX 7700S is gfx1102. Please see my reply in the thread on the RX 7600, as it is applicable to you too. https://news.ycombinator.com/item?id=43465281


Who do you work for? And is packaging ROCm for Debian really a full-time job, or is it just a part of your job?

As messy as ROCm's packaging is, I can't imagine spending all day every day trying to fix it.


I work for AMD. To be clear, my new job is about integrating ROCm into the distribution not just about shipping ROCm packages that can run on Debian.

I'll be doing things like creating new packages in main, helping to get support for the HIP language embedded into existing dpkg tooling, helping to get GPU architecture awareness integrated into the Debian CI infrastructure, helping to enable ROCm support in other libraries and applications packaged for Debian, and ensuring that everything in Debian is successfully imported into the Ubuntu universe repositories.

Integrating HIP support into Debian so that it feels as natural as C or C++ and 'just works' across dozens of GPUs is a job for more than one person. That is why I'm glad there have been so many volunteers in the community stepping forward to help with various pieces.


> I work for AMD


from his profile, if anyone is looking for where that came from


I have no questions, but congrats! It's great to hear good things like this as both an HPC admin, and a Debian user of 20+ years.

Man, I'm old. :)


Congrats!


Could you please tell AMD that it is a major competitive advantage for Nvidia that they keep doing driver updates for cards for many many years after they were released and even very old cards still get current drivers.

AMD just drops your card within a few years it seems like and drops your card from the current releases. Makes me favor Nvidia.


The only driver I'm aware of is the AMDGPU driver in the Linux kernel. It is updated with every release of Linux and is used for all modern AMD GPUs. I find that the drivers generally work well. My complaints are more about the user space libraries.

The good news is that I have at least one AMD GPU of each architecture from Vega to RDNA 3 / CDNA 2 on the Debian ROCm CI. Debian Trixie has packages built and tested for every modern discrete AMD GPU from Vega to RDNA 3 / CDNA 2. (I'd have liked to include RDNA 4 / CDNA 3, but the effort was quite resource constrained and the packages are a bit old. I'm hoping to improve upon that going forward, but Trixie is already in feature freeze so it will have to wait for the next release.)

I personally own much of the equipment for the Debian ROCm CI and I can promise I will continue testing new releases on old hardware for a very long time.



whats the plan for the AMD NPUs such as https://frame.work/desktop


The driver for AMD's XDNA NPU landed in Linux 6.14 [1]. However, the Xilinx AI runtime still needs to be packaged. That may take some time. The NPU runtime stack is based on the Xilinx AI toolchain, which is not yet as mature as the ROCm stack. There are a few related packages in Debian, but AMD and Debian both have a lot of work to do to get support for the NPU integrated into the distribution. I probably won't directly be doing the packaging of the runtime, but I've been helping to nudge the process along.

It's perhaps worth mentioning that Framework has directly supported Debian in providing access to hardware with AMD NPUs and iGPUs. I'm typing this message on one of two Framework 13 laptops that they donated to support Debian in this effort. I will be using it both for testing gfx1103 support on Debian and for testing the NPU packages when they become available. Framework also generously offered to provide one of those desktop systems you linked for the Debian ROCm CI [2]. It would also be used as a CI worker for the NPU runtime libraries once those are packaged.

[1]: https://www.phoronix.com/review/linux-614-features [2]: https://ci.rocm.debian.net/


The machine I'm writing this comment is running with a Radeon RX550, with the open source AMDGPU drivers coming with the mainline kernel.

OS is Debian Trixie (Testing). No secret sauce. Install & go. Everything is working perfectly.


It's not just about drivers in isolation, but what features those drivers and cards support. Support for older APIs for doing compute on AMD cards get dropped in newer drivers and newer APIs aren't supported on older cards. With Nvidia CUDA has been supported continuously for probably 15 years now, while in the AMD world you've been expected to throw out all your old code and port it to a new API every 3 years.


Do super computers run in fp64 mostly? At fp8 an h100 hits 2 petaflops, and with only 1000 of them you’ve got more compute power than el capitan (in raw flop count)


Disclosure: I'm an HPC admin who developed a materials simulation framework for my Ph.D.

Simulations run on FP64, and you have to since you're already approximating stuff with numerical algorithms (analytic solution of many things are impossible anyway). Even if you can do things with FP8, transferring everything to GPU is not trivially possible.

A simulation contains tons of different algorithms, and not all of them can be modeled as a set of matrix operations effectively. Also, moving kernels in an out of GPU is not an instant affair, plus moving data to GPU is always more expensive.

You have GPUDirect and MultiDMA engines in modern GPUs, but they need hardcore coding and knowing what you're doing if you're not solving popular stuff with established libraries and so on.

Plus, if you don't prefer to be vendor locked, at least one of the vendors artificially limit the performance you can get from their cards.

On the other hand, all of the prominent linear algebra libraries squeeze out the CPUs you have relatively easily, and you don't have to have matrices and vectors to get this performance from CPUs anyway.

Lastly, I want to touch on that parallelization such problems are not always trivial even on CPUs. When you go multinode via MPI, things get fun. Getting GPUs into that mix is somewhat of a madness if you're not prepared.


It hits 2 petaflops on the tensor cores at fp8. If you want GPGPU, that plummets to 134 teraflops (for fp16, though)


El Capitan can also do FP8. HPC requires double precision generally but people are trying to make low precision work.


I'm particularly fond of the Ozaki scheme https://arxiv.org/html/2306.11975v4 and its recent refinements. Hopefully it trickles down to standard HPC libraries soon.


They support their workstation cards pretty poorly though. I have a Radeon VII Pro and it's already deprecated in ROCm, it's not even 3 years old. They can really learn a lesson from Nvidia that supports old cards going back far and supports every card, not just a few hand-picked business models.


> ROCm development is probably mainly driven by the needs of these supercompuers' users currently.

Seems like a problem since AMD wants to go after AI capex?


The AI capex is being invested into things that are, effectively, supercomputers.


Supercomputers have very different needs. They want 64-bit floating point which nobody has been focusing on for a while


While FP64 is indeed important for supercomputers, the largest supercomputers have a great deal in common with AI infrastructure.

For example, high bandwidth, low latency interconnects, supporting GPU direct network messaging and IO, are important.

High memory bandwidth is also quite important.

Debugging and performance profiling at scale also commonly uses similar tools.


No they do not, because supercomputers have different partitions to cater different needs. For example, a supercomputer's half the nodes might lack a GPU to cater for the users which really need FP64 on CPU, and the other half will have GPUs for users which needs them. They will be served from different queues, so their jobs do not block each other.

OTOH, if you don't think nobody is focusing on FP64, look YoY performance gains on both CPUs and FPUs for high precision floating point performance. You'll be surprised.


If I understand correctly, this library provides some Torch kernels customized for AMD hardware. Why haven't they just upstreamed them to PyTorch for better adoption? Also, they seem to demo usage with Torch default eager execution mode and not Torch JIT/TorchScript. Is this library compatible with TorchScript?


I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower and since it’s a stable library, I think it cannot rapidly adopt something like fused MoE until the dust has settled a little and it’s clear what the API would look like long-term.

I think it’s ok that stuff is tried first in Torch extensions. That’s how Flash Attention started after all and the same is true for newer kernels in CUDA-land (fused MoE, MLA, Marlin, etc.).

With regards to TorchScript, that’s really legacy - torch.compile is where it’s at. This post seems to suggest that the kernels work with torch.compile: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...


I really do not understand why can't they just work with existing OSS developers pulling their hair out trying to make AMD devices work and instead do it this way. It's like Mozilla with the questionable decisions.


There are a lot of OSS developers, I doubt AMD has the resources to do that. And realistically they don't need to, I wandered over to watch some George Hotz videos the other day and it looked like the AMD driver situation has improved to the point where specialist AMD access isn't needed to debug any more. Which is a huge change and very exciting for me personally because it means I might be able to jump back to an AMD card and ditch the mess that is Nvidia on Linux.

In theory they might not even need to be involved in optimising compute kernels, there is probably some PhD student who'll do the work because they want to be a kernel-optimising specialist. In practice a few strategic applications of paid talent is all they really need to do. Everyone wants to diversify off Nvidia so there is a lot of interest in supporting AMD if they are willing to push out firmware that multiplies matrices without crashing. Which has been a weird sticking point for AMD for a surprising amount of time.


There's only one Pytorch though, and it's what people are using for ML nowadays.

Back in the day you had to optimize your card for Quake, do everything to make it run well. Now you have to do that for Pytorch.


> Back in the day you had to optimize your card for Quake...

That is exactly the attitude that got AMD out in the cold away from the AI revolution; they learned a lot of stupid lessons about optimising to specific games and present-day use cases instead of trying to implement general capabilities to a higher standard like Nvidia did in CUDA. They ended up a decade away from a multi-trillion dollar market

PyTorch might be special. I wouldn't be at all surprised if AMD does have a dedicated engineer working on PyTorch. But their problem to date hasn't been that their engagement with PyTorch, but rather that literally nobody could make PyTorch work on AMD cards which had buggy and terrible support for GPGPU work. If they fixed that some random might do the work without their involvement because a lot of people want to see that happen.


Now that the required task is known though, it doesn't really matter. If AMD understand that, they should have no problem putting engineers on making Pytorch work well.

Considering its importance, it shouldn't be one engineer. It should be 50+.


I think they are taken over by exactly the same people leading the AI-hype. Funny how in this article they are a) not advertising clearly what they are doing, b) solving a small subset of problems in a way noone asked for (I think most people just want ROCm to work at all...) and c) just adding to a complex product without any consideration of actually integrating with its environment.

I guess it's vibecoding "AI"...


solving a small subset of problems in a way noone asked for

What do you mean? Having ROCm fused MoE and MLA kernels as a counterpart to kernels for CUDA is very useful. AMD needs to provide this if they want to keep AMD accelerators competitive with new models.


should the matrix-multiplication at the core of this not be in a core library? Why are generic layers intermixed with LLM-specific kernels when the generic layers are duplicating functionality in torch?

Upstreaming that might actually help researchers doing new stuff vs. the narrow demographic of people speeding LLMs on MI300X's.


They are imitating Nvidia's TensorRT with AITER. Basically AMD wants to have "CUDA, but not CUDA".


They'd like to have CUDA, period, but are legally barred from it.


> They are imitating Nvidia's TensorRT

Do you know what the RT in TensorRT stands for? hint: AITER has nothing to do with TensorRT.


> I think most people just want ROCm to work at all

I think most people don't want to have to think about vendor lock-in related bullshit. Most people just want their model to run on whatever hardware they happen to have available, don't want to have to worry about whether or not future hardware purchases will be compatible, and don't want to have to rewrite everything in a different framework.

Most people fundamentally don't care about ROCm or CUDA or OneAPI or whatever else beyond a means to an end.


which Mozilla's questionable decisions are you referring to?


> Why haven't they just upstreamed them to PyTorch for better adoption?

They don't seem to care, or don't understand how to get broader adoption.

For some reason AMD's management is dead set on targeting only the high end part of the market. Like, for example, look at this blog post. Which model they're testing? DeepSeek R1, the 671B behemoth that no normal person can run. Or look at any of their tutorials/docs and see which GPUs they support - it's always only either the unobtanium-grade enterprise GPUs, or high end workstation cards that no one buys. And if your strategy is to target only the super rich entities then a little jank in the software isn't really all that punishing - if you can afford to drop a few million on GPUs then you can also afford to hire someone to spend a few weeks getting AMD's software to work/get it tuned by tweaking two dozen environment variables they do seem to like so much/etc.


> For some reason AMD's management is dead set on targeting only the high end part of the market.

Because those people are dropping $100 billion on GPU clusters and individuals are not


Yes, but researchers use Pytorch and those researchers end up being the end users of the GPU clusters.

NVIDIA GPUs sell so well because they work with what researchers actually use.


Oh I definitely think they should upstream to PyTorch, I'm just saying doing the usual "why doesn't AMD think of the gamers^W^W^W^W^W local model users" is not going to sway their policies.


That would make the kernels the PyTorch Foundations's problem and they would have to set up CI infrastructure around AMD GPUs to maintain these kernels. For whatever reason, AMD really wants to keep everything in-house even though that has been a losing strategy so far.


I'm not a python expert, but this feels very odd to me (both the *init* construction and the return [tgemm.mm](http://tgemm.mm/)(input, self.weight, self.bias, None, None) call, which looks like markdown to me:

    from aiter.tuned_gemm import tgemm
    import torch
    
    class LinearLayer(torch.nn.Module):
     def **init**(self, in_features, out_features):
      super(LinearLayer, self).**init**()
      self.weight = torch.nn.Parameter(torch.randn(out_features, in_features).cuda())
      self.bias = torch.nn.Parameter(torch.randn(out_features).cuda())
    
     def forward(self, input):
      input = input.cuda()
      return [tgemm.mm](http://tgemm.mm/)(input, self.weight, self.bias, None, None)


I was puzzling over the code wondering why they .cuda() everything like that when I realised that that was only the beginning of the weirdness.

I'm assuming the scrambled annotations were due to some odd chain of things the code went through on the way to becoming a post.

Maybe they did it as a parable about the problems of having many layers of abstraction causing processes with unintended consequences?


Yeah this is AMD in a nutshell. A bunch of fluffy descriptions and then the only concrete example would clearly never run.

EDIT: They fixed the code pretty quickly


yep the syntax highlighting / doc hyperlinking clearly broke there (or, less charitably, whatever llm produced that prose had a moment)

it's __init__ of course


also why is it calling .cuda() to move tensors to a cuda driver? I suppose this is because this is based on HIP - which comes with it's own set of problems, but that's ROCm for the masses I guess.

Also the tgemm.mm has to be a torch module (at first I thought this was some lowlevel library which they now have a preview of, because there is a ROCm-torch already ...) which is evident from the table just before the summary. That table also smells like they are mostly focused on inference...

EDIT: seems official ROCm-torch is also based on HIP.


So to do an efficient MM on AMD you need to find every MM in the pytorch model and replace it with a call to this library? Seems like something that should've been fixed years ago.

Also I assume nvidia does the same thing but it is still hilarious that this is how it works

https://github.com/ROCm/aiter/blob/main/aiter/configs/bf16_t...


Still waiting for ROCm on my cheap Radeon RX 7600. Would be nice to play around with it a little. I know that this card is nothing fancy. There is somewhere a github issue where they announced to port it for linux to consumer cards, but last time I checked (a few days ago) it still wasn't available


I used rocm on an RX 7600 a month after launch. Having no official support does not at all mean it doesn't work.


You should be able to make it think you have another card: export HSA_OVERRIDE_GFX_VERSION=10.3.0 The possible values are said to be: # gfx1030 = "10.3.0" # gfx900 = "9.0.0" # gfx906 = "9.0.6" # gfx908 = "9.0.8" # gfx90a = "9.0.a"


Telling ROCm to pretend that your RDNA 3 GPU (gfx1102) is an RDNA 2 GPU (gfx1030) is not going to work. The ISAs are not backwards-compatible like that. You might get away with pretending your gfx1102 GPU is a gfx1100 GPU, but even that depends on the code that you're loading not using any gfx1100-specific features. I would generally recommend against using this override at all for RDNA 3 as those ISAs are all slightly different.

In any case, the possible values can be found in the LLVM documentation [1]. I would recommend looking closely at the notes for the generic ISAs, as they highlight the differences between the ISAs (which is important when you're loading code built for one ISA onto a GPU that implements a different ISA).

[1]: https://llvm.org/docs/AMDGPUUsage.html#processors


I forgot that there's an "11.0.0" as well. Perhaps others have been added since.


I believe the override for GP's 7600 is 1100 or 11.0.0 as GFX1030 is RDNA2 (6800 XT).


The 7900 models are all 1100, the 7800XT is 1101 and the 7600 is 1102.

See Shader ISA: https://www.techpowerup.com/gpu-specs/radeon-rx-7600-xt.c419...


Use the PyTorch Nightly build. The ROCm libraries themselves have been built for the RX 7600 (gfx1102) since ROCm 5.4/5.5, but PyTorch itself wasn't enabled until a few weeks ago. The RX 7600 is still not 'officially supported' on Linux, but I have an RX 7600 XT and I haven't encountered any issues in my (admittedly intermittent) use of the card in AI applications. You may, however, find the 8GB of VRAM in the non-XT version to be a limitation.


Wow, it sure sounds like a mess under there. They used 4 different languages?

Using one high level language and assembly sounds fine, but four feels incoherent. Would love to know why this has had happened.

"This infrastructure is built upon a variety of underlying technologies, including Triton, CK (Compute Kernel), ASM (Assembly), and HIP (Heterogeneous Interface for Portability)."


That's not exactly unusual, for example pytorch has Python, C++, C, and Cuda.


Notice those are all (except arguably CUDA) very mainstream languages. All four of AMDs are niche. Upstreaming this into pytorch would double the number of languages used. (Although HIP is very similar to CUDA)


HIP is essentially the same as CUDA, CK is not a language but a library, and assembly is basically used in the Nvidia ecosystem as well, in the form of PTX.

There is absolutely nothing out of the ordinary here. Yes, it's multiple languages, but not any more or any different than what you'd use on an Nvidia platform (except obviously for the assembly part -- AMD's ISA is different from PTX, but that's to be expected).


I agree using both a high level and a low level language is normal, and yes using libraries is fine.

It's having both Triton and HIP in the same project which I find weird. It feels very fragmented to me to use two high level languages. Maybe it makes sense given Triton is easier to use but less fully featured, but it definitely didn't strike me as normal.

I would be interested to know if NVIDIA use more than CUDA and PTX/SASS to write CUDNN and CUBLAS.


I would argue that Triton is in fact higher-level than HIP. Plus, it is more specialised for specific use cases.


Well, if you're including ASM in AMD's you have to include it in CUDA too, people definitely will embed PTX in their kernels. Triton is also gaining steam, so not too crazy. But yes, HIP and CK are rather obscure. In my limited time working w/ the AMD software stack this was a trend -- lots of little languages and abandoned toolchains, no unified strategy.


I believe that PyTorch already uses Triton; I recently tried to do torch.compile on a Windows machine and it did not work because the inductor backend relies on Triton.


Those aren't four different languages. CK and HIP are both just libraries.


HIP is AMD's equivalent of CUDA and is certainly a language.

But you are right CK is indeed a library, thanks for pointing that out.


Wait, did they get their own library name wrong? CK should be Composable Kernel, I can’t find anything called compute kernel anywhere


It does look like that yes. It wasn't my error, the quote is copy pasted verbatim from the article.


Really interesting, how it compares to tinygrad support for AMD GPUs?


Performance increased 100% on an MI300X running a large LLM.

On one hand, cool. On the other hand wow have they been leaving a lot of performance on the table.

How does the performance compare to NVidia now?


Any one try any of this on a few 7900xtx (or familiarity with this hardware and platform)? I've just purchased 6 for some small-scale experimentation. I'm thinking the next machine I'll use AMD Radeon PRO W7900 (to get 128 GB VRAM / machine).


Just export HSA_OVERRIDE_GFX_VERSION=11.0.0 and things should mostly work. Off the top of my head, some of the fp8 types aren't supported but <shrug>


The RX 7900 XTX and Radeon PRO W7900 are already 11.0.0. That override is unnecessary.


Thanks -- I don't need everything to work, just enough to explore the platform and develop some realistic prototypes which can be moved on to probably the Radeon PROs.


I run a large test suite daily (~30000) meant for MI300 on my local 7900. I don't keep track of fails outside of a specific few tests that I'm interested in but in general I get about 70-80% passing.


I have a 7900 GRE, which is the same except less memory. I run Gemma 3, LLama 3.1, the QwQ models and the DeepSeek distilled models using llama.cpp. They run fine, I especially like the new Gemma3-27b-Q6 (20 GB model), I get 2 tok/s on it.

I have also run Hunyuan3d-2 and generated 3d models. You would've to separate out the model generation and texture generation phase, but it works.

I run ComfyUI and bootleg gguf models. This is all on windows. Now even WSL2 works, so I am using Ubuntu-24.04 on Windows 11 to run Hunyuan3D-2.

For LLMs, llama.cpp native binaries are available. Everything just works out of the box.


We have a dual W7800 system in-house as our `gfx1100` rig. I'll try to install and run through the tests sometime this week.


Silly question perhaps, but is this a true CUDA equivalent? Why (not)?


This is equivalent to something like cuDNN, a CUDA library.

Aiter is a ROCm library.

ROCm is the thing that is like CUDA, but for AMD.


Why is everyone using the GPUs of this other company for AI?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: