This is how the very first version of Selenium worked. The application under test was in an iframe, and the test controller was in the parent page. The Selenium "Remote Control" protocol was later added where the controller would phone home to a listening web server for commands to relay to the iframe (basically, AJAX before it had a name. It all mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time. Ultimately, we merged with the WebDriver project, which was implemented in a more native way, avoiding all the limitations of automation-via-iframe.
"It mostly worked for the most common test cases, but we abandoned this approach for similar reasons mentioned in the article -- the edge case limitations became more and more frustrating over time."
The article only mentions malformed URLs and browser run-time errors. Were there any other "edge case limitations" that became intolerable.
Some websites will detect the iframe and send nothing instead of the requested page. I happen to know from personal experience that YouTube is one of the sites that does this.
Yeesh, everyone is so critical here. It's just a blog post about how somebody does occasional one-off scraping across multiple pages using browser devtools.
Yes of course injecting an iframe into a third-party site with devtools isn't going to replace Selenium. But it's a clever little hack in a pinch. No need to get upset.
Won't the same origin policy kick in the moment I try to read the content of an iframe that isn't on the same origin as my website?
Or is this meant to run on the dev console on the target website? In which case, the iFrame and the Airovic website doesn't make sense (the electron app mentioned does sure, but it doesn't exist)
But the airovic.com tool uses and external script, in the same server that provides http tunneling from same server, so I can modify headers and display everything on the iframe because it is under same domain.
I think you first visit the website, then you inject an iframe onto the page you're currently on, and then inside that iframe you can scrape any content on that website.
That's at least what it looked like from his examples.
Okay, the way I see this is that using headless tools like puppeteer or selenium is tedious; just trying to... er scrape my HN account's favorites (AFAIK no API) becomes a task when you have to automate login.
Just typing in and pressing the button is much easier than automating the task, so that's why the iframe is something useful. You can interact with the content (without code).
The irony is that the iframe approach is exactly how the first version of Selenium worked. It's a cool hack, but we abandoned that approach over time because automating iframes couldn't cover all automation use cases.
In the case login was required, using tmux it's easy to automate login to HN with a text-only browser such as links and saving the desired text/html, etc. to a file. Takes me about 500 characters of script to non-interactively log in, grab some text and log out
Selenium allows you to load different browser profiles. So you create a profile manually, login to the target site and then everytime you load the profile automatically the cookies are set and you're logged-in. Works like a charm.
You can use non-headless puppeteer to manually log in (or perform any other manual actions, or use the dev console in the puppeteer-controlled browser) and only automate parts you want to automate.
This sounds hilariously n00by because it's VB and Internet Explorer, but creating an Internet Explorer instance through VB in, say, Excel and then dumping data into Excel was great because I had full control over my IE instance.
Okay, I'll stop speaking now and revealing the fact I started my career as a data guy at a giant corporation instead of a software engineer.
I can do everything listed in benefits with puppeteer, while I can’t even make sense of what iframe is supposed to achieve here, or how it’s even gonna load (anyone with a shred of sense would set X-Frame-Options to SAMEORIGIN, subject to exceptions). The airovic.com site doesn’t work and hilariously attempts to load two seemingly important scripts from localhost...
I’m very confused about this submission, and even more confused about how it managed to almost top the front page.
Edit: Having read the code samples, it seems the code snippets are supposed be run from the same origin in the dev console. A quick and dirty way to interactively scrape without navigation, I guess? Still not sure what the “all together: Airovic.com” is supposed to mean, and definitely more limited than puppeteer.
Edit2: To be fair to the author, they did say
> You cannot bypass their protections without using a HTTP Tunneling component.
Which I didn’t see until just now. This is a pretty big caveat though, should probably be more upfront...
He's traversing the site using the injected iframe. That is, there is no top-level navigation event, only an iframe navigation event. Then he's gathering information from the iframe hosted DOM and combining it in the context of the starting, main page.
Hi, that's right. Some sites protect html injection. Twitter protects their site very well, but if you use http tunneling they can't do nothing as you can modify X- headers.
For this kind of tasks I usually create a private Firefox extension which gives me access to extended browser capabilities and the ability to lift some security-related restrictions. I run it in a sandboxed browser, much like I would do with something like Selenium or Puppeter, but I have much more options to hand-tune the automation.
This was useful for a brief period when I ran a news aggregator that used iFrames to display content from other news websites. Adding the sandbox attribute prevented scripts, ads, modals, etc.
For the purpose of scraping, unless you're always on the same domain(or running a proxy to add CORS), I don't see how an iFrame is better than either a web extension or a backend script using Puppeteer.
Is the only reason for the iframe so that it is possible to keep a state in the top frame while loading different pages?
Because otherwise - since you use the dev tools to inject the iframe - you don't really need the iframe. You can just run it as a "snippet" in Chromium or from the multi-line-code-editor in Firefox.
Both have the problem that it all has to be a single file. It would be much nicer if one could import modules.
I think having this on a production website means you can crawl the web using your user's CPUs and network connections, making it harder for people to stop you from harvesting their data.
Then every time you want to make a change to your code, you have to go to your original codebase, make the change, start the compiler, copy the output and paste it into your dev tools ...
$iframe.contents().find('.result-row').each(function(){
data.push({
title: $(this).find('.result-title').text(),
img: $(this).find('img').attr("src"),
price: $(this).find('.result-price:first').text()
});
// And everything starts running when you set first iframe's target url
$iframe.prop("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa");
Looks like he wants output something like
title:
img:
price:
I tried reproducing this example without using Javascript, instead using curl and sed. The output is
image:
title:
price
I did not try to move "title:" above "image:" though I bet this could be done using the hold space.
Nor did I format this as JSON though that would be easy to do.
I've done something similar in Firefox with scratchpad. The main reason is simply convenience. I don't need to switch to a different workflow, I merely bring up scratchpad (I often already have a window open with some utility functions) and can start hacking away immediately.
Sadly scratchpad is going away soon. Fortunately the console now has a multiline mode, unfortunately it's not as convenient for this use.
Maybe I'm missing something obvious, but can anyone explain to me how this is better than using a tool like selenium for scraping? I guess this might be easier to quickly setup and play around with for one-off scraping?
Working for a big tech company, stuff like this infuriates me.
It’s exactly why we’re currently pushing for the ability to disable developer tools, we want it added to Chrome and other browsers. I should be able to, as a web site owner, not allow any kind of developer tool usage.
Users do not own our product and have no right to go poking around like this!
Make an app then. You/your company made the choice to use web technologies and take advantage of their benefits.
Users own their computers and their browser (user agent) is for serving them. Not you.
You have no right to be telling a users computer exactly what to do. Do that on your own servers.
The state of tracking and telemetry is insane enough already with chrome gearing up to cut the legs out from ad blocking.
Plus, even if you're lucky enough to have your wishes with the browser, it doesn't affect anyone serious anyway. They will scrape outside of chrome, as they already do and have always done.
You don't own my computer either and thus you don't get to choose what I can and can't do with it. The prospect of disabling dev tools just so your business can have it easier is laughable.