Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I assume comments like these, "GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference."

https://twitter.com/soumithchintala/status/16712671501017210... https://archive.li/rfFlW

I'm not sure the most canonical paper on mixture of experts but here's one possible:

https://arxiv.org/pdf/1701.06538.pdf



I think when ppl refer to MoE they are referring generally to the Google GLaM paper actually




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: