More

Bimos · 2025-02-27T02:58:17 1740625097

Maybe add Chimera as well?

isoprophlex · 2025-02-27T05:59:26 1740635966

it looks as if Chimera has marginally less bubbles than DualPipe?

danielhanchen · 2025-02-27T03:31:44 1740627104

Oh more nice pictures :)

Bimos · 2025-02-27T01:59:40 1740621580

I heard that it is possible to achieve better performance than cuBLAS using CUTLASS? I thought they chose the better one among cuBLAS and CUTLASS as baseline.

Bimos · 2025-02-26T02:06:57 1740535617

> FFMA SASS interleaving

> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).

> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.

I would say it is really mind-blowing.

blackeyeblitzar · 2025-02-26T02:39:27 1740537567

From what I read elsewhere, this is the type of typical performance optimization for matrix math you would see when performance is critical. It’s just not been applied yet to this specific problem by other AI players since it wasn’t a necessity for other companies. But eventually everyone would probably end up here regardless.

mitthrowaway2 · 2025-02-26T03:47:33 1740541653

How many people does it take to implement this? A 10% gain in performance could pay for a lot of people's salaries when your company is spending hundreds of millions on GPU clusters.

fulafel · 2025-02-26T05:10:41 1740546641

If you think how many people who looked and failed to realize this optimization in the preceding performance efforts of the community, you could argue for quite a big number.

rfoo · 2025-02-26T05:36:35 1740548195

Uh, three? I worked at $CORP where we had a three people sub-team, they reverse engineered most of Volta's SASS instruction encoding, built a working SASS assembler (before the open source one of course), with the ultimate goal of making GEMM / Conv faster. And they did it. Though it wasn't applied to a high-profile enough big picture so we never heard about it :>

If you don't believe me, previous open source SASS assemblers were mostly from university, they surely didn't have that many people.

bjourne · 2025-02-26T09:36:05 1740562565

Did $CORP also release the im0lementation to make it trivial for others to replicate their work?

rfoo · 2025-02-26T09:42:16 1740562936

I think we did release some of the optimized kernels but I don't think we have released any one with SASS black magic, at least not before I left. Already been sanctioned by BIS, better not annoy NVIDIA furthermore.

DannyBee · 2025-02-26T13:05:40 1740575140

Actually, a number of them did. Even Google did.

saagarjha · 2025-02-26T03:55:13 1740542113

I mean it’s not a significant change so one? But that isn’t to say anyone could do it.

rvz · 2025-02-26T07:00:12 1740553212

Just a reminder, this is the third of many open source releases from DeepSeek that they are willing to release, and that release is a very trivial low bar for them to find optimizations when it is needed.

I guess since the majority here are blown away by the very low-level code involved, it tells me that they're likely not ready to use it or have been stuck on very high level tools that abstract this away.

randomNumber7 · 2025-02-26T07:18:28 1740554308

I tell you a secret. Most devs do something wrong when they start rolling out their own linear algebra library. Thats why people use LAPAC, BLAS, etc...

KeplerBoy · 2025-02-26T08:01:53 1740556913

The thing is most people don't use Lapack or Blas. Most people are at higher levels of abstraction than torch.matmul.

rowanG077 · 2025-02-26T19:01:18 1740596478

Just a few of highly skilled people.

Bimos · 2025-02-26T03:02:52 1740538972

I think most AI players rely on high performance GEMM. But most people would be satisfied with cutlass or cublas, and the others implement gemm themselves, but not necessarily use undocumented features?

creato · 2025-02-26T03:58:13 1740542293

Using undocumented features is not rare. People reverse engineered Apple's undocumented AMX instructions on their CPU, and I know people use undocumented/private extensions for several different kinds of GPUs.

Zacharias030 · 2025-02-26T04:40:09 1740544809

I‘ve only seen it done by hedge funds so far. What were you referring to?

shaklee3 · 2025-02-26T05:54:20 1740549260

scott grey figured out this exact thing and more back in 2015 for maxwell, and it's been written about many times since by other people.

ETH_start · 2025-02-26T02:16:36 1740536196

It is not literally mind-blowing..

tough · 2025-02-26T02:25:32 1740536732

I think he might mean hyperbolically figuratively so

dang · 2025-02-26T04:48:16 1740545296

Literally literally means not literally.

I love it when words turn into their opposites!

Bimos · 2025-02-26T03:09:13 1740539353

I edited it.

kneegerman · 2025-02-26T02:39:43 1740537583

orthogonally

Bimos · 2025-02-25T02:44:33 1740451473

The PTX instructions they talked about in the tech report should be pointing to the code here?

zardinality · 2025-02-25T04:32:22 1740457942

"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

magicalhippo · 2025-02-25T06:25:04 1740464704

So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?

Or did I get that wrong?

ta988 · 2025-02-25T06:55:36 1740466536

My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.

saagarjha · 2025-02-25T11:36:12 1740483372

Yeah that's about right

helloericsf · 2025-02-25T02:56:34 1740452194

this might help: https://x.com/main_horse/status/1894215779521794058/photo/1

Bimos · on Feb 19, 2025

How is it different from eliminating y?

Bimos · on Feb 14, 2025

As a Chinese I want to mention that foreign people don't usually call it "PPT".

Bimos · on Feb 14, 2025

Let's see how US reacts to this incident in Munich meeting

rasz · on Feb 15, 2025

This wouldnt happen if russia had better access to state of the art military grade electronics - sanctions lifted!

readyplayernull · on Feb 14, 2025

Increase drone traffic controller headcount?

Bimos · on Jan 8, 2025

> If GPUs Are So Good

Who ever claimed that?

Bimos · on Dec 16, 2024

> In Shanghai, the Huangpu River divides old and new.

Pudong is new, but Puxi is not "the old one". It is mixed with old historical sites and modern buildings. It lives with its history, and it grows actively. It doesn't have to destroy all the old stuff, but it also doesn't have to slow down its pace for them. I believe it is the same for any city (e.g. in Europe) with history.

Bimos · on Oct 30, 2024

button^2