I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower...

I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower and since it’s a stable library, I think it cannot rapidly adopt something like fused MoE until the dust has settled a little and it’s clear what the API would look like long-term.

I think it’s ok that stuff is tried first in Torch extensions. That’s how Flash Attention started after all and the same is true for newer kernels in CUDA-land (fused MoE, MLA, Marlin, etc.).

With regards to TorchScript, that’s really legacy - torch.compile is where it’s at. This post seems to suggest that the kernels work with torch.compile: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...