Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I only use an iFrame to crawl and scrape content (airovic.com)
287 points by natzar on Dec 26, 2019 | hide | past | favorite | 49 comments


This is how the very first version of Selenium worked. The application under test was in an iframe, and the test controller was in the parent page. The Selenium "Remote Control" protocol was later added where the controller would phone home to a listening web server for commands to relay to the iframe (basically, AJAX before it had a name. It all mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time. Ultimately, we merged with the WebDriver project, which was implemented in a more native way, avoiding all the limitations of automation-via-iframe.


"It mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time."

The article only mentions malformed URLs and browser run-time errors. Were there any other "edge case limitations" that became intolerable.


Some websites will detect the iframe and send nothing instead of the requested page. I happen to know from personal experience that YouTube is one of the sites that does this.


> YouTube is one of the sites that does this.

that's because they have a proper iframe-embed url: https://www.youtube.com/embed/quyj70RogxI (instead of https://www.youtube.com/watch?v=quyj70RogxI )


I believe karma still works this way.


Yeesh, everyone is so critical here. It's just a blog post about how somebody does occasional one-off scraping across multiple pages using browser devtools.

Yes of course injecting an iframe into a third-party site with devtools isn't going to replace Selenium. But it's a clever little hack in a pinch. No need to get upset.


Go back far enough in time (2004), and injecting an iframe into a third-party site is Selenium. I agree, using an iframe is a fun hack!


Wow, I just posted this, went to take a nap, got back and 84 points ¿?!

The site is working now, it was retrieving some localhost scripts.

I was just trying to get some feedback and to check if that document was interesting, because I was spending a lot of time on it.

For what I see, it seems Selenium does exactly the same, but I would choose this iframe solution (small-medium projects) anyway.

It's a super small tool that do the job.

Please, let me know if you can fully use airovic.com


Won't the same origin policy kick in the moment I try to read the content of an iframe that isn't on the same origin as my website?

Or is this meant to run on the dev console on the target website? In which case, the iFrame and the Airovic website doesn't make sense (the electron app mentioned does sure, but it doesn't exist)


If you use an iframe yes, that is the problem.

But the airovic.com tool uses and external script, in the same server that provides http tunneling from same server, so I can modify headers and display everything on the iframe because it is under same domain.

Thanks for your comment!


That was my thinking as well. You cannot access another website's content opened via iframe via Javascript at all.


I think you first visit the website, then you inject an iframe onto the page you're currently on, and then inside that iframe you can scrape any content on that website.

That's at least what it looked like from his examples.


Okay, the way I see this is that using headless tools like puppeteer or selenium is tedious; just trying to... er scrape my HN account's favorites (AFAIK no API) becomes a task when you have to automate login.

Just typing in and pressing the button is much easier than automating the task, so that's why the iframe is something useful. You can interact with the content (without code).


The irony is that the iframe approach is exactly how the first version of Selenium worked. It's a cool hack, but we abandoned that approach over time because automating iframes couldn't cover all automation use cases.


Is login required to access favorites page; isnt the page public, e.g.,

https://news.ycombinator.com/favorites?id=pcr910303

In the case login was required, using tmux it's easy to automate login to HN with a text-only browser such as links and saving the desired text/html, etc. to a file. Takes me about 500 characters of script to non-interactively log in, grab some text and log out


Selenium allows you to load different browser profiles. So you create a profile manually, login to the target site and then everytime you load the profile automatically the cookies are set and you're logged-in. Works like a charm.


You can use non-headless puppeteer to manually log in (or perform any other manual actions, or use the dev console in the puppeteer-controlled browser) and only automate parts you want to automate.


This sounds hilariously n00by because it's VB and Internet Explorer, but creating an Internet Explorer instance through VB in, say, Excel and then dumping data into Excel was great because I had full control over my IE instance.

Okay, I'll stop speaking now and revealing the fact I started my career as a data guy at a giant corporation instead of a software engineer.


Haha, loser! Looo sseeerr


I can do everything listed in benefits with puppeteer, while I can’t even make sense of what iframe is supposed to achieve here, or how it’s even gonna load (anyone with a shred of sense would set X-Frame-Options to SAMEORIGIN, subject to exceptions). The airovic.com site doesn’t work and hilariously attempts to load two seemingly important scripts from localhost...

I’m very confused about this submission, and even more confused about how it managed to almost top the front page.

Edit: Having read the code samples, it seems the code snippets are supposed be run from the same origin in the dev console. A quick and dirty way to interactively scrape without navigation, I guess? Still not sure what the “all together: Airovic.com” is supposed to mean, and definitely more limited than puppeteer.

Edit2: To be fair to the author, they did say

> You cannot bypass their protections without using a HTTP Tunneling component.

Which I didn’t see until just now. This is a pretty big caveat though, should probably be more upfront...


He's traversing the site using the injected iframe. That is, there is no top-level navigation event, only an iframe navigation event. Then he's gathering information from the iframe hosted DOM and combining it in the context of the starting, main page.

I think it's kinda clever.


Okay, I guess the scripts are supposed to run in the same origin, not from some third party, say airovic.com. I edited my original post.

It’s still more limited than puppeteer though.


Injecting an iframe into websites could trigger an assertion error, because the iframe isn't supposed to be there.


Hi, that's right. Some sites protect html injection. Twitter protects their site very well, but if you use http tunneling they can't do nothing as you can modify X- headers.


Interesting, could you explain what you mean by http tunneling and how it can bypass protection against html injection?


Hi! I'm the submitter.

I just submitted the article to recieve some feedback. I was working a lot on the tool and the article, but needed to check if I was into something.

I fixed the errors already. it is woking.


For this kind of tasks I usually create a private Firefox extension which gives me access to extended browser capabilities and the ability to lift some security-related restrictions. I run it in a sandboxed browser, much like I would do with something like Selenium or Puppeter, but I have much more options to hand-tune the automation.


Depending on the nature of the content being scraped, you can use the `sandbox` attribute to the iFrame to prevent scripts from running.

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if...

This was useful for a brief period when I ran a news aggregator that used iFrames to display content from other news websites. Adding the sandbox attribute prevented scripts, ads, modals, etc.

For the purpose of scraping, unless you're always on the same domain(or running a proxy to add CORS), I don't see how an iFrame is better than either a web extension or a backend script using Puppeteer.


Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?

Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.

Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.


I think having this on a production website means you can crawl the web using your user's CPUs and network connections, making it harder for people to stop you from harvesting their data.


>Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.

Isn't this a solved problem in javascript land? Just use a compiler/minifier and your module oriented js code is in a single file as a build artifact.


Then every time you want to make a change to your code, you have to go to your original codebase, make the change, start the compiler, copy the output and paste it into your dev tools ...


> It would be much nicer if one could import modules.

Is there some reason es modules wouldn't work here? Just a snippet that inserts a tag of type=module


Yes, the reason is that most sites these days serve a content security policy which only allows code from whitelisted origins.


         $iframe.contents().find('.result-row').each(function(){
         data.push({
                         title: $(this).find('.result-title').text(),
                         img: $(this).find('img').attr("src"),
                         price: $(this).find('.result-price:first').text()
             });
         // And everything starts running when you set first iframe's target url
         $iframe.prop("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa");

Looks like he wants output something like

         title: 
         img:
         price:
I tried reproducing this example without using Javascript, instead using curl and sed. The output is

         image: 
         title:
         price
I did not try to move "title:" above "image:" though I bet this could be done using the hold space. Nor did I format this as JSON though that would be easy to do.

         n=0;while true;do test $n -le 3000||break;
         curl https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa?s=$n|sed -n '
         /result-title hdrlnk/{s/.*\">/title: /;s/<.*//;/^title: /p;};
         /./{/result-meta/,/\/span/{/result-price/s/.*\">/price: /;s/<.*//;/price/p;};};
         /data-ids=\"/{s|1:[^,\">]*|https://images.craigslist.org/&_600x450.jpg|g;s/,/, /g;
         s/1://g;s/>//;s/.*data-ids=/image: /;/^image: /p;}'
         n=$((n+120));done


I've done something similar in Firefox with scratchpad. The main reason is simply convenience. I don't need to switch to a different workflow, I merely bring up scratchpad (I often already have a window open with some utility functions) and can start hacking away immediately.

Sadly scratchpad is going away soon. Fortunately the console now has a multiline mode, unfortunately it's not as convenient for this use.


why don't use something like proxycrawl? controlling an iframe is slow and painful


You can even inject a browser extension to chrome with selenium, or even back the selenium with an upstream proxy. So why iframe, what's the edge?


do not understand why iframe is a must here, why can I just scrape the whole page directly? still learning web scraping using scrapy.


I think the basic utility is that you keep your parent-frame JavaScript context.

Normally if you click a link with jQuery, you lose the current context after the next page loads.

By controlling it inside an iframe it's more convenient


Maybe I'm missing something obvious, but can anyone explain to me how this is better than using a tool like selenium for scraping? I guess this might be easier to quickly setup and play around with for one-off scraping?


I describe something very similar here: https://github.com/jawj/web-scraping-for-researchers


Working for a big tech company, stuff like this infuriates me.

It’s exactly why we’re currently pushing for the ability to disable developer tools, we want it added to Chrome and other browsers. I should be able to, as a web site owner, not allow any kind of developer tool usage.

Users do not own our product and have no right to go poking around like this!


You are talking about websites, not native apps. Make a native app.


Make an app then. You/your company made the choice to use web technologies and take advantage of their benefits.

Users own their computers and their browser (user agent) is for serving them. Not you.

You have no right to be telling a users computer exactly what to do. Do that on your own servers.

The state of tracking and telemetry is insane enough already with chrome gearing up to cut the legs out from ad blocking.

Plus, even if you're lucky enough to have your wishes with the browser, it doesn't affect anyone serious anyway. They will scrape outside of chrome, as they already do and have always done.


> Users do not own our product and have no right to go poking around like this!

I feel just the same about websites and apps poking around in my stuff.


This sounds to me like « people who watch our movie shouldn’t be allowed to rewind! They don’t own the movie! »


You don't own my computer either and thus you don't get to choose what I can and can't do with it. The prospect of disabling dev tools just so your business can have it easier is laughable.


You are fighting against giants. I mean windmills.

I mean, that ship sailed long ago and your energy is best invested in something else.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: