Hacker News new | past | comments | ask | show | jobs | submit login

I don’t think they’re using picture heavy book for LLM training, no?



Just because the LLMs are trained on text doesn't mean that images we're a part of what they downloaded.

You clean up the data after you acquire it, not before.


Even if they didn't use the illustration(which isn't clear given multimodal models), they'd still make use the text in the books.


Presumably they didn't create the torrent


Whoever created it has a lot of spare hard disk space.


100TB is like 6 hard drives...


> 100TB is like 6 hard drives...

Discounted Seagates ? /s


You can get recertified 18TB drives, but still it's a lot of disk space. I simply don't have enough data.


Yes they do, there's multimodal models.


I don't think they need to be selective. It's not like Meta can run out of storage.


For multi-modal models, why not? They would be probably some of the best data.


Sometimes the PDF of a book is big because the book's packed with important illustrations and charts - like a textbook or journal paper.

Other times a PDF of a book is big because someone scanned it and didn't have trustworthy OCR, so they figured distributing images of text at 1.5 MB per page was better than risking OCR errors.


Why not ? Do you think that AI doesn't enjoy porn ? /s




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: