should the matrix-multiplication at the core of this not be in a core library? Why are generic layers intermixed with LLM-specific kernels when the generic layers are duplicating functionality in torch?
Upstreaming that might actually help researchers doing new stuff vs. the narrow demographic of people speeding LLMs on MI300X's.
Upstreaming that might actually help researchers doing new stuff vs. the narrow demographic of people speeding LLMs on MI300X's.