Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yes, and it's on a per-layer basis, I think!

So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: