Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wouldn't web scraping be possible by taking screenshots of the rendered pages and then reading them with OCR?


If you just want the text there are other ways to do that. You could dump out document.body.innerText for example - here's how to do that with https://shot-scraper.datasette.io/en/stable/javascript.html

    shot-scraper javascript youtube.com 'document.body.innerText' -r
Output: https://gist.github.com/simonw/f497c90ca717006d0ee286ab086fb...

Or access the accessibility tree of the page using https://shot-scraper.datasette.io/en/stable/accessibility.ht...

    shot-scraper accessibility youtube.com
Output here: https://gist.github.com/simonw/5174380dcd8c979af02e3dd74051a...


Of course, if the document is using the outline in unexpected ways, you'll run into trouble. Consider Facebook infamously splitting "Advertisement" into multiple spans to avoid tripping ad blockers.


Although you'd imagine screenshots would be easy to OCR reliably, it's not guaranteed to get everything correct.

It's not like you can rely on a dictionary to confirm you've correctly OCRed a post by "@4EyedJediO" - who knows if that's an O or a 0 at the end?

And if you're OCRing the title and view count of a youtube video, for example, you've got to take the page layout into account because there's a recommendations sidebar full of other titles with different view counts.


I guess you'd get better results if you knew the font the site uses (which in many cases you could figure it out pretty quickly) or even just override every font with your own.


Yes, it's possible. We do this for TV shows.


Much of the content worth scraping isn't rendered on the screen.


Do you have any examples? I haven’t experienced this myself


URL, images, stuff shown after you click on a button...


probably very inefficient as it would depend on layout a lot too


As inefficient as parsing heap snapshots?


Much more


You'll be spending resources on LLMs like crazy. Possible but very messy IMO.


You don't need LLMs for OCR.


No but maybe you want to do something with the ocr output.


OCR does not get you the names of the classes in a DOM




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: