Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The training sources and weights are public info. Less than 5% of the training was from Wikipedia and of that it covers many languages. English Wikipedia article text alone is ~22 GB when losslessly compressed so it's no surprised it's not giving original articles back.

  CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: