Hacker News new | past | comments | ask | show | jobs | submit login
Creating an open-source solution to the headaches of headless browsers (sourcesort.com)
252 points by mrskitch on Nov 25, 2019 | hide | past | favorite | 98 comments



(Author here): lots of comments about why would someone pay for this. I think the answer is simple if you step out of your developer shoes.

There's a lot of complexity in managing and even building the thing from the start -- and then you have to support it. If you're working in a large org then there's a chance that you can just DIY it, however for small-medium businesses this isn't practical and it's a waste of time (their most precious resource).

I like to think of it as a managed database. Sure, you can freely download postgres in a container and you're up and going, however there's a lot more costs to it than just that. Having a fully-managed database saves you time and other non-tangibles that it can be worth the cost. Just depends on your circumstances.


When I worked for a large tech company I spent a few months building a similar service for internal use. I can confirm, managing all of those instances and making it reliable can be a huge headache. I really wish using browserless was an option for us


[flagged]


It's probably those same engineers that use Dropbox instead of getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem.


Quality comment


The headache doesn't come from building out this 'simple' bare bones solution. It comes from making it play well with the larger ecosystem you're integrating it with, and committing to supporting it for other engineers. All of a sudden, your job description starts to change and you're working well out of your scope.

These managed solutions can provide value for large companies, especially ones that fall into a DIY trap.


You should totally rewrite this and open it under a license you like and post back here. Make sure it has a 1-click deploy to multi-cloud though - needs to be easy for anyone to launch it wherever they might be working.


They probably write stuff that they can't buy. Engineers cost money.


> what do engineers even do these days?

Solve problems with computers.


Heads up that the article's first link to your site is broken

The link is showing up as https://sourcesort.com/interview/browserless.io instead of https://browserless.io


Single founder/developer here and paying for browserless. I really don't have the time to spin up something similar myself.


Since you're reading this, a feature request: I would love it if you could put up a REST endpoint for extracting all the images (example code here [1]) on a web page, and more endpoints for extracting all the links, script addresses, etc.

I was trying to do that on Browserless but couldn't get the final file download to work (I adapted Stack Overflow code linked below to put all the web pages images into a ZIP file and download that) - presently I'm running this on a Google Cloud Function, which is working but I'd rather outsource it to you, especially since the function chokes on large web pages (possibly it needs more RAM than the 2GB limit currently available in GCF?).

[1] https://stackoverflow.com/a/52542490


Just to follow up, you can use our /scrape API to do this:

curl -X POST \ https://chrome.browserless.io/scrape \ -H 'Content-Type: application/json' \ -d '{ "url": "https://reddit.com/", "elements": [{ "selector": "img" }] }'

This will get all the <img> tags on a page and return their attributes (which includes their sources). If you wanted to do scripts as well, just add a new object to the elements array with the "selector" of "script".


Noted ;)


Haven't read the article but I used to develop a kiosk application that used headless Chrome and holy fuck is that a complicated endeavor. It wouldn't be if browsers were content to simply be browsers, but Chrome has so many stupid value-adds and telemetry points that care+feeding of the Chrome (later Chromium) instances took up a substancial part of my job. At one point we toyed with a simply pyGTK app that loaded a webview but there were issues with that as well. The browser space really needs to return to its roots.


Am I reading your website wrong, or is your hosted solution cheaper than the on-prem license (360/yr vs 500)? I would think it would be much more expensive to run it for them, right?


Nope, you've got it right. We're increasing prices next, however your sentiment is still true. Once we have the enterprise-flavor out the inverse will be true.


Interesting. Unrelated question, is it easier to start a side business since you work for a remote company? (Had to do some stalking to find this out.) I imagine you might have more time to do these kind of things because you presumably don't have to deal with a commute.


Commute was nearly 3 hours for me back when I had to do it. I'd say being remote certainly can help, but has it's own drawbacks as well.

For instance, if you're already programming and working 8+ hours a day in your home, the last thing you want to do is more of it. One of the biggest things you'll want to do is get out more often since you're home a lot, and staying home to work on a side-project doesn't sound awfully appealing.


Enterprise tax. On-prem licenses are going to be used by large enterprises, who are more challenging to do business with, so you need to charge them more.


On-prem would have a greater support burden.


A solution to run non-headless Chrome browsers in the cloud would be welcome. We used Sikuli and now UI.Vision RPA for test automation, and have to manage the servers for it ourself. If I could outsource this to a service like browserless, it would be convenient. (Applitools was an option I looked at)

The web app we are testing right now (SaaS for architects) makes heavy use of canvas elements and can't be tested headless.



So how do you test such applications?


Congrats on the product!

Just as a bit of feedback, on the linked site you write:

>Browserless is simply a tiny web-server that “productionalizes” all the stuff about headless browsers and their automation capabilities.

I find that both hard to parse (productionalizes? wat) and saying nothing. How about "a tiny web-server that helps you use headless browsers without the hassle of setting them up on your environment"


++ getting the wordage on _what_ it does and why you'd use it can be tough.


At two of my previous jobs this would have been a really great resource, and I can immediately think of a few hobby projects I've been thinking about that would be much easier with this. You're dead on too when you talk about the time investment being a big hurdle for companies- this is a great product and I'm excited that it's also open source!


Great work, mate! I can totally see why someone (e.g. a company) would pay for this. I've also featured Browserless on SaaSHub https://www.saashub.com/featured. Hopefully it's helpful to more people.


I've definitely come across your site multiple times, and I can say with absolute certainty it's helpful!


I looked closely at Browserless a while ago, for a 15 - dev company it would be a good tool. I would have recommended it but circumstances (particularly a lot of effort put into a container setup that could already run these kinds of tests) means it made sense to keep everything working the same way in our k8s. But definitely this is a good offering for small businesses.

The only downside to using a lot of IaaS type stuff like this is almost every month something goes down - Github or Azure or Appveyor etc, and this is another thing that could go down, even for a few minutes, and basically the CI is stuffed, or pingdom goes heywire. But this is a more general problem with cloudifying all the things, not with this particular service.


The startup I worked for a few years ago was an early customer of Browserless (one of the first maybe?). We had a large crawling system we were developing and used Browserless as a time saving way to scale up interfacing with headless chrome. Joel worked with us directly to address issues and bugs and freed us up to focus on our crawlers and not on managing distributed headless chrome.


Hey! Great to see you here! We’re you with knotch?


Yup! I left about a year and a half ago. Congrats on the success of Browserless.


Obligatory link to the famous HN Dropbox comment: https://news.ycombinator.com/item?id=9224

When thinking about the viability of a business, developers commonly make the mistake of assuming that nobody would pay for something that's possible for them to code up themselves.

Sure, many of us write code all day and love it. But most people have other responsibilities, value their time highly, and (correctly) prefer to pay reasonable amounts of money so they can spend their time doing more important or profitable things.


I find this post kind of snake oil sell. It begings explaining the problem it wants to solve (lack of a modern text/console based browser) to directly enter in a very long history about why he did that (thats not a5 all useful for the people wanting a tl;dr summary). The first thing i miss about this is how this thing compares to brow.sh abd why i should pay for this instead of deploying my own brow.sh instance


What's with the snarky comments? The guy build a successful business with happy customers. He didn't make it in 2% of his time, he built it over the years with experience from other projects and solved an issue people are willing to pay money for. Maintenance now might be on low time but doesn't mean it's easy to solve a problem and consistently ship the solution. I honestly think most of you have a misunderstanding about serviced software and the demand for it. The fact that you can do it with a Docker script and some time doesn't mean it works properly for all use cases or everyone can and should manage that infrastructure themselves. Some people happily pay for it so they don't have to worry whether their instance is still running and can call support instead of firing a terminal and looking through obscure Stackoverflow answers. He even open sourced his entire business. Instead of throwing apples, try to learn something or at least appreciate the effort.


I felt similar, thanks for echoing the sentiment.


I recently learned (by trial and error) some of the headaches associated with running headless browsers at scale that Joel mentions here; wish I'd heard of this service earlier. I ended up finding other solutions to fill in the gaps: Puppeteer Cluster is one I'd recommend (https://github.com/thomasdondorf/puppeteer-cluster)

I especially like the "host it yourself" commercial license model, here; while automating browser _actions_ over a network works well enough, _detailed scraping_ over a network can quickly become inefficient (as many requests for elements or element attributes may incur individual round-trips). In some cases, colocating your browser instance with your scraping logic becomes a necessity.


We hear about puppeteer-cluster _a lot_, and we hear the same thing from folks (that's it's great). browserless.io essentially does "clustering" at an infrastructure level, whereas puppeteer-cluster does it at the application level.

Both essentially solve the same problem, just in different ways.


I'm confused about the license[1]. It seems to not be actually open source. The Open Source Definition says[2]:

> The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

But this seems to be doing that exact restriction.

Additionally the license seems like it contains a loophole:

>If you are creating an open source application under a license compatible with the GNU GPL license v3, you may use browserless under the terms of the GPLv3.

If I make an open source application, I can use browserless under the terms of the GPLv3. That means I can redistribute browserless under the GPLv3. That means people can take the browserless code I redistribute and use that for commercial products (as long as they don't distribute a non-GPLv3 binary form of the commercial products containing browserless, because that would break the GPLv3).

[1] https://github.com/browserless/chrome#licensing

[2] https://opensource.org/osd#fields-of-endeavor


Just checked the GitHub license page (https://github.com/browserless/chrome/blob/master/LICENSE.md).

> This work is dual-licensed under GPL-3.0 OR the browserless commercial license. You can choose between one of them if you use this work.

So it's clearly GPLv3 (no loophole required), which AFAIK does allow closed source proprietary use within a company so long as the program isn't redistributed externally (perhaps the developer didn't understand that?). It seems that the licensing section in the readme should have it's wording adjusted somewhat.

In fact, I think you're even in the clear to run a proprietary cloud service using GPLv3 code which is why the AGPL (among others) exists. Some recent drama (https://techcrunch.com/2019/05/30/lack-of-leadership-in-open...) for reference.

(Oddly, the header underneath that states "GPL-3.0-or-later" which is a bit inconsistent.)


I wrote something like this a couple months ago and thought about selling it. I ultimately decided that the price where you make it cost-prohibitive to mine cryptocurrency is too high for someone that just wants to render PDFs without dealing with the burstiness of running several copies of Chrome on their production infrastructure. I was also concerned about the underlying browser changing how things render when upgraded; I didn't want to run an outdated browser, but I also didn't want to tell users "hey we updated Chrome, better check the output of your batch job and make an emergency fix to your HTML".

How are you dealing with these issues?


How often do browsers break backwards compatibility? I don't think I can recall a single time I've had working code break due to a browser upgrade, with the exception of non-standard features (which is maybe what you're referring to, but then it's a known risk and you should already be aware of it).

Edit: And if a client really does need version X of Chrome, you could give them the ability to pay extra to pin the version indefinitely.


I work on Lighthouse (user of the Chrome DevTools protocol) and work with the DevTools team (obviously the main user of the protocol) - and when I first saw browserless I was blown away. So cool! Good job with your success.

What was the hardest part re: working with the protocol?


The protocol is pretty easy, I think coordinating the necessary “enable” calls is a bit cumbersome. Also the legacy JSON protocol is harder to support, but I understand why.

Hardest part is debugging crash issues and why they happened. You either just get a generic “Page crashed!” error (which I think is puppeteers handler message), or “browser disconnected!”. That and chromes logs are just crazy noisy, which I haven’t gotten a lot out of.

Those are probably the biggest, thanks for asking!


> There's always been this thought that I've had, that advertising and "paid" attention is really in no one's best interest. You're likely to get users who really aren't going to get any value out of your software, so your churn increases, and you've also just paid for that user that's churned. These things get harder to tease out since it's almost impossible to ask "show me all the churned users this period acquired from advertising channels." Maybe that's possible, but you'd have to do a lot wrangling together to get it all working.

Correct me if I’m wrong, but isn’t precisely that kind of analytics simply table stakes for any modern crm/marketing/customer intelligence suite in 2019? It seems like that is absolutely a solved problem.


Yeah, it's a pretty contrived example on my part. My sentiment here is that there's so many inputs to modeling behavior, and trying to find signal in the noise, that at this scale your time is likely spent better elsewhere. Unless you have the revenue stream to do it and do it well, then the effort can be a time sink


How many variables do you need to track, though? $X spend, Y signups at $Z each, AA% churn, $BB retained MRR for an estimated $CC CLV based on an $X/Y*(1-AA%) CAC - why does it need to be much more complicated than that when you don’t have millions of users?

(Seriously though, I’m asking, not poking fun. You probably know more about this stuff than I do, having actually done it. It seems really simple to figure, to me. What am I missing?)


It's more a matter of being messy rather than complicated. You're trying to track users who often visit your site multiple times before purchasing, coming from different sources, on different devices, and somehow tie that attribution to the purchase. None of the data inputs are consistent nor tied together. Analytics tells you differently than your ad platform, your payment processor tells you yet another thing, and your email marketing tool, and your CRM. (And if you want the serious tools for data monitoring and reporting have moved to focusing on enterprises...) You have to somehow factor in refunds, free trials, prorated billing, early cancellations. You have multiple ad campaigns running ad variations. Don't forget a/b testing.

Finally you see some data point that hints something might be working, but you know you have to account for all the other factors involved. Did I make any website edits that day? Did the ad network change their algorithm slightly? Was there a holiday affecting traffic? When did I insert that new ad again? Wait, I know I changed my ad bid at some point... Did I get an influx of traffic from another source? Was it just a fluke?

If you want good, real data, it's messy. And far from a solved problem.


Wow, you said this so much better than I could, thanks for chiming in


Stripe honestly does a lot of this for you. You can hack all that together with what you've got, the problem is that there's some seasonality to business as well as other market effects that make it tough to determine _why_ something is trending one way or the other.

For instance: we had a week last year where we had a flood of cancellations at once and there was nothing I could attribute it to. Looking back it was just a coincidence, however it consumed a lot of my time (writing emails and looking at analytics) to figure out why instead of just pushing ahead.

I was likely over-reacting, but I have noticed there's a lot of people spending a lot of time doing analytics and doing research on trends in stead of just executing. And, especially early on, you should just be executing and not thinking too much about trends.


Does this handle something like Distil [0]? Or is that type of scraping not the focus of this product?

[0] https://www.distilnetworks.com/block-bot-detection/


No, it won't. However, Distil is not hard to work around if you automate a real browser in headful mode.


Could you point me to some reasonably straightforward ways to do that? Thanks!


I don't think there's any HOWTO posted online; I just worked it out by trial and error.

Use a real version of Chrome (not Chromium) and headful mode. Mask the navigator.webdriver property. Pace your requests and take care to use "good" IP addresses.

Keep in mind that as soon as Distil sees something obviously automated (like a headless browser) the source IP address is "burned" for some number of days.


Thanks!


I second this!


Looks interesting, however I can't view your webpage - I am getting this error: "The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature."


Interesting, I’ll take a look and see if our markup is encoded improperly.


Congrats on the continued success. It can stun people when they are reminded that the skills they sell to others can be applied for their own prosperity and I think you see that here.

If you’re a developer with a day job there has never been a better time to get started building and selling your own software.

It’s not glamorous but it is rewarding.


I spent ~4 hours on my 10th wedding anniversary debugging a production issue. It's not a fun thing to talk about, and doesn't get a lot of attention, but the truth of the matter is that when things are bad _they are bad_. I can see now why folks say that this isn't for everyone.


For a similar headless Chrome project launched around the same time, but with a price-per-api-request model, see https://www.prerender.cloud/ (PDFs, screenshots, pre-rendering). MRR is about the same.


Rendora is true FOSS, free and self-hosted with very lightweight usage of resources.

https://github.com/rendora/rendora


Last commit 12 months ago.

https://rendora.co/ Seems to be gone.


Probably worth mentioning that's $24k MRR, not how much it costs...


Still impressive for a 1-man show who says he spends just 1% if him time on it.


Editor here, my apologies. I did of course mean to communicate MRR and didn't really notice how the title might be confusing when I wrote it but I see it now. I'll make it clearer.


I'm using chrome-aws-lambda on Lambda and it works like a dream. Luckily for my use case I don't need images, fonts, etc.

There's also GCF for those on Google Cloud. I have used Browserless' trial and felt like the 2+ GB instances were kind of expensive because they require reservation unlike Lambda where you get 400,000 GB-seconds and 1M request per month for free.


"How cool would it be if you could just fire up your browser, do the work you want it to, and press a button and now it just magically does that someplace for you without ever having to write code?"

Like, recording macros has been a thing forever, but how are you going to magically generalize them, without better-than-human AGI?


Thanks again for the questions. Please do email me if I happen to overlook anything: joel at browserless dot io


how is this different from apify (https://apify.com/), apify seems can do what browserless does and is also open-source, means you can self-host it freely.


Solo founder: I built a tool and it solves problems for people to the tune of $24k/month

Almost everyone: That's great! Well done.

HN commenters: Pffft.


That's not even close to a fair summary of this thread.


(there's a lot of pffft here because so many have been trying to do this with $0k MRR and no MVP)


That's great! Well done.


[flagged]


I saw two of your comments on this post. Regardless of your point, both of them are rude and do the opposite of what you want. Your comment history has plenty of other snarky, rude and unhelpful comments.

You are somewhat new so maybe you don't know that it's not really the way people communicate around here. Please check out the guidelines in the footer of the website.


It’s sorta weird that you feel the intensity of emotion about this.

Maybe examine your feelings, is it jealousy or envy or something?


How is GPL3 a “fake FOSS” project?


there is no such thing neither in GPL nor in FOSS to be GPL in noncommercial uses and at the same time invalidate the license in commercial or closed uses. This is outright gaslighting, not to mention that the entire project is literally a big nothing burger like I said before, it's even less nothing burger than paid "get Geo/IP" apis for instance which needs at least some operations effort even despite becoming increasingly trivial in the serverless age.


What $288k/yr headache is there around `docker pull buildkite/puppeteer`?

https://hub.docker.com/r/buildkite/puppeteer


Sure, but there's a lot of other things you're looking over:

- We support selenium with the same version of Chrome as puppeteer's. Everything is versioned together.

- Queueing/Concurrency and notifications. Arguably done in other efforts, but works.

- Numerous other APIs built on top of puppeteer to do core cases. No need to write your own integration.

- Monitoring and more.

People do pay for this in order to not think about the burden of supporting it. Managed databases are a thing, even though they are freely available to download and run.


I'm a browserless user, and the headache for me was cost. You're right that it's dirt-simple to spin up a puppeteer service in docker, and that's what I had been doing previously. But, for my usage, I found it was cheaper to pay browserless than to run my own EC2 instances. Granted, I probably account for a very, very tiny fraction of that $24k/mo :)


I dove into the comments to ask the same question. I just did something similar to this with a few lines in my docker-compose.yml, which eventually turned into a few lines in a Helm chart. Why would I pay someone money for something as trivial as this?

Now, I won’t write another rant on the subject of Eternal September, and how our field is increasingly populated by charlatans who can’t do simple things on their own. But yes: that’s exactly how I feel.


Disclaimer: Neither I nor my colleagues or anyone I'm aware of are, were, or ever will, use said service.

Did you divide (time that took you to complete this trivial task + time to document it, version it, explain it to coworkers, maintain it next time chrome decides to break something) * (your $/h cost) / 0.00008 before doing that?


Here's a build vs buy calculator to save you from doing the maths: https://baremetrics.com/startup-calculator


This is a dangerous calculator. It misses the cost of deployment, integration, internal support, etc.

(This can be significantly larger for the purchased solution then an internal one as well)


Yep. Puppeteer is the state of the art wrt headless browsers. There's money to be made selling B2B wrappers of OS software though. Not to trivialize the author's work...


People who would rather pay $30/mo to have someone manage it than run it themselves.


That's a good deal if it takes more than about 10 minutes of my time a month to run it myself. Wanna venture a guess on how many years to ROI on time put into initial setup?

And fwiw I do run my own puppeteer cluster because it's economically advantageous at our scale.



Yea, I’m thinking making about a company named after some kind of low hanging fruit and just building out a bunch of these trivial little use cases that keep popping up and put them all behind a subscription model.

You’ll probably see me with an article in 6 months about how I’m making $150k/mo for about 2% of my time.


I have set a calendar item for 6 months from now. Looking forward to your blog post.


Feeling the pressure now.


Or you're just gonna leave a snarky comment, and others gonna build those services and get the money...


Scripting headless browsers for testing is an antipattern and should literally be avoided as much as humanly possible


Care to elaborate?


I have never had a good experience incorporating a headless browser test into a test suite. In literally every case it added so much complexity, suite run time and uncertainty that I realized it was better to just do the test via unit tests (which test the logic directly) and/or integration tests (which test the HTTP output of the controllers) if it was at all possible to rework the logic to operate in that fashion.

1) Increased load times and test run times due to browser complexity and memory consumption

2) Impossible to run concurrently without additional instances, each of which takes up massive memory

3) Tests are slow and often nondeterministic (literally THE WORST property a test suite can have), with many cases where things like "sleep()" delays are put in to circumvent some opaque browser latency issue, which is just gross

4) Even after suffering all of the above, you're still only testing ONE engine (say, WebKit) instead of all of the popular ones (Blink (Chrome), Gecko (Firefox), EdgeHTML (eh, Blink, I guess, now?), etc)

I did not enumerate all the disadvantages, but these should be enough to support my position. The number of browser driver driven tests in your test suite should be as close to "zero" as possible.

Does this discourage the use of SPA's? A lot, yes. But when necessary, I manage to do a separate frontend JS test suite via jsdom which does not require firing up a headless browser, and my build process runs both the frontend and backend test suites and only deploys if they both pass.


Thanks! That's interesting, because at work we're struggling with browser-specific regressions and were looking at headless browser testing to help solve that. I agree with all the drawbacks you listed (except #4, since some headless browser solutions let you use multiple browsers), but unit tests don't do anything to help with differences in browsers. Do you just do manual QA and hope for the best? Or is this not as big an issue for you as it is for our company? (We still have to support IE 11, so that's where the majority of browser-specific issues manifest.)


The “surface“ of my sites is usually small enough to just visual-check manually. I get that it might be necessary in some cases- note that I did say “as close to zero as possible” and not simply “zero”. My criticisms are that it is just such a Rube Goldberg-esque approach to testing something that its use should be minimized. It would be great if all browsers had to pass some standardized spec before being considered viable, that is the source of the cross-browser nondeterminism IMHO.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: