Wouldn't web scraping be possible by taking screenshots of the rendered pages an...

simonw · on Aug 8, 2023

If you just want the text there are other ways to do that. You could dump out document.body.innerText for example - here's how to do that with https://shot-scraper.datasette.io/en/stable/javascript.html

    shot-scraper javascript youtube.com 'document.body.innerText' -r

Output: https://gist.github.com/simonw/f497c90ca717006d0ee286ab086fb...

Or access the accessibility tree of the page using https://shot-scraper.datasette.io/en/stable/accessibility.ht...

    shot-scraper accessibility youtube.com

Output here: https://gist.github.com/simonw/5174380dcd8c979af02e3dd74051a...

lelandfe · on Aug 8, 2023

Of course, if the document is using the outline in unexpected ways, you'll run into trouble. Consider Facebook infamously splitting "Advertisement" into multiple spans to avoid tripping ad blockers.

michaelt · on Aug 8, 2023

Although you'd imagine screenshots would be easy to OCR reliably, it's not guaranteed to get everything correct.

It's not like you can rely on a dictionary to confirm you've correctly OCRed a post by "@4EyedJediO" - who knows if that's an O or a 0 at the end?

And if you're OCRing the title and view count of a youtube video, for example, you've got to take the page layout into account because there's a recommendations sidebar full of other titles with different view counts.

plorntus · on Aug 8, 2023

I guess you'd get better results if you knew the font the site uses (which in many cases you could figure it out pretty quickly) or even just override every font with your own.

is_true · on Aug 8, 2023

Yes, it's possible. We do this for TV shows.

berkle4455 · on Aug 8, 2023

Much of the content worth scraping isn't rendered on the screen.

zffr · on Aug 8, 2023

Do you have any examples? I haven’t experienced this myself

throwawayadvsec · on Aug 8, 2023

URL, images, stuff shown after you click on a button...

ekianjo · on Aug 8, 2023

probably very inefficient as it would depend on layout a lot too

cush · on Aug 8, 2023

As inefficient as parsing heap snapshots?

brigadier132 · on Aug 8, 2023

Much more

spaniard89277 · on Aug 8, 2023

You'll be spending resources on LLMs like crazy. Possible but very messy IMO.

anamexis · on Aug 8, 2023

You don't need LLMs for OCR.

spaniard89277 · on Aug 8, 2023

No but maybe you want to do something with the ocr output.

ekianjo · on Aug 8, 2023

OCR does not get you the names of the classes in a DOM