Show HN: Ready-to-use API to convert any web page to PDF using headless Chrome

frenchman99 · on Nov 2, 2017

Saving a webpage to a PDF is literally one command line away:

chromium --headless --disable-gpu --print-to-pdf=google.pdf http://google.com/

What does Apify add in this case?

jancurn · on Nov 2, 2017

You can call the API from anywhere, from resource-constrained servers, Docker containers that cannot run headless Chrome, JavaScript on a website etc. Also, we'll keep adding new features to this act to make it worthwhile to use it, e.g. retries on failures, posting of the file to some URL etc.

xilni · on Nov 2, 2017

This is exactly what I do bundled into a nice function added to my .zshrc

    chromepdf() {
        chrome --headless --disable-gpu --print-to-pdf="$1" $2
    }

sgolestane · on Nov 3, 2017

Is there anyway to make chrome wait until the page loads?

lashkari · on Nov 6, 2017

Looks like it prints automatically once Page.loadEventFired is triggered.

Alternatively, you can run Chrome headless with the remote debugging API (--remote-debugging-port=9222) and send a Page.printToPDF (https://chromedevtools.github.io/devtools-protocol/tot/Page/...) after some delay.

zulln · on Nov 2, 2017

What do you use this for?

nsomaru · on Nov 3, 2017

Save a webpage as a PDF

zulln · on Nov 4, 2017

Did not mean to sound negative, but what is the use-case of saving web pages as PDFs? I understand building in the functionality in something else, but here it sounds like you manually type/paste in URLs on a regular basis.

Edit, I see now that I replied to the wrong comment. It was meant to they who made an alias to it.

nickjj · on Nov 2, 2017

Also if you can't access this command for whatever reason, another option is to open the print dialog box in Chrome and set the destination to "Save as PDF" and it will work. You'll even get to see a preview. It's very useful for 1 off saves where you want to consume a really long post offline in a PDF viewer.

Scarbutt · on Nov 2, 2017

Sure, but the idea is to do it programatically.

scaryclam · on Nov 2, 2017

Thanks, I didn't know that was a thing. I was checking out wkhtmltox earlier this week as well.

Looks like there are a lot of options available for this. I suspect Apify's using one of them.

jancurn · on Nov 2, 2017

Actually we're simply using Puppeteer - see the source code at the bottom of https://www.apify.com/jancurn/url-to-pdf

zulln · on Nov 2, 2017

Any special reason to why you use this over Chrome?

nikisweeting · on Nov 3, 2017

Puppeteer is a scripting library for Chrome, built by the Chrome team.

Natsu · on Nov 3, 2017

There's also wkhtmltopdf that more or less does the same thing using webkit.

granda · on Nov 2, 2017

Are there any downsides to me using this to build my own archiver?

nikisweeting · on Nov 3, 2017

Many projects are already using this for archiving purposes. Check out https://github.com/pirate/bookmark-archiver

It's robust because the modern web is pretty much built for Chrome, although it can be resource-intensive if you're archiving many sites.

xfer · on Nov 3, 2017

converting to pdf is lossy, i wouldn't use this for archiving purposes, normally i use webrecorder for archiving.

Keats · on Nov 2, 2017

Is there a way to add a delay to let JS render?

jancurn · on Nov 2, 2017

Now there is - I've just added the "sleepMillis" input option.

jotto · on Nov 2, 2017

I've been working on something similar: https://www.prerender.cloud/docs/api

  // URL to screenshot
  service.prerender.cloud/screenshot/https://www.google.com/

  // URL to pdf
  service.prerender.cloud/pdf/https://www.google.com/

  // URL to html (prerender)
  service.prerender.cloud/https://www.google.com/

FabioFleitas · on Nov 2, 2017

Heads up that I'm getting a "Too many requests for the month, sign up for an account at https://www.prerender.cloud/" when trying to go to any of those links.

jotto · on Nov 2, 2017

Thank you - I had an overaggressive rate limiter for non auth'd accounts, it's _improved_ now.

rahulroy9202 · on Nov 3, 2017

Nope. Still getting - Too many requests for the month, sign up for an account at https://www.prerender.cloud/

nikisweeting · on Nov 3, 2017

There are quite a few of these Chrome headless-as-a-service already out there, this is the third time I've seen one posted on HN.

visarga · on Nov 2, 2017

By the way, is there an opposite service that converts PDF's into plain HTML for reading? I know about https://www.arxiv-vanity.com/papers/ but it only works on arXiv PDFs.

rpedela · on Nov 2, 2017

Best one I have found. https://github.com/coolwanglu/pdf2htmlEX

jancurn · on Nov 2, 2017

There's another Apify act that extracts text from a PDF using the pdf-text-extract NPM package - see https://www.apify.com/juansgaitan/pdf-scraping If there's any library or tool that can convert PDF to HTML, it will only take a few minutes to setup such an API on Apify.

jancurn · on Nov 2, 2017

Here we go - another Apify act that uses pdf2htmlEX to convert PDF to HTML:

https://www.apify.com/jancurn/pdf-to-html

jugjug · on Nov 2, 2017

Off-topic, but Apify as a service looks really good. I was spinning up a dedicated VM on AWS with Docker installed only to get a simple webscraper running. Apify solves this elegantly and removes an significant pain in my workflow.

ak39 · on Nov 2, 2017

Any info how this compares to commercial html to pdf renderers like PrinceXML?

schneidmaster · on Nov 2, 2017

In my experience, Prince is great for static HTML + CSS rendering, but its JavaScript engine is pretty lackluster -- I couldn't get it to work with rendering React components, for example. So it depends a lot on your use case and if you can server-side render everything. It's also pretty pricey[1] -- not that I mind paying for quality software but that sticker could be prohibitive for a lot of folks.

[1] https://www.princexml.com/purchase

jamespaden · on Nov 3, 2017

I'm a developer at https://docraptor.com. We're an official Prince partner with a SaaS pricing model, but we've got a separate JavaScript engine for that very reason.

defenestration · on Nov 2, 2017

I would like to know that as well. We are using wkhtmltopdf for around 2 years. Had many issues with incompatible css. I considered PrinceXML. Looked solid, but a bit expensive and also lacking some css support. I'm considering switching to Puppeteer / Chrome.

jancurn · on Nov 2, 2017

If you'd like to add specific features, please let me know at jan@apify.com, I'm sure we'll figure it out

waddlesworth · on Nov 3, 2017

My main concern is support for the CSS3 Paged Media Module.

I have been using PDFReactor for this reason, as wkhtml2pdf and weasyprint have had problems, as will any webkit or chromium based renderer because these layout systems simply do not support paged printing.

defenestration · on Nov 3, 2017

Thanks, as a follow-up I found this comparison between PrinceXML and PDFReactor: https://www.print-css.rocks/tools.html

dmmalam · on Nov 2, 2017

Also check out https://urlbox.io/. YC alum, super helpful.

laktek · on Nov 2, 2017

I built Screen.rip, which also supports PDF generation. https://screen.rip/#pdf

Screen.rip gives you more control over the generated PDF beyond Puppeteer's options (like it can wait for certain elements to appear, inject CSS or switch to screen stylesheet instead of the print stylesheet).

phmagic · on Nov 3, 2017

I love this service! I think ease of adoption you can allow pre-made scripts to be shared so the non-technical can easily set up work flows that go right into their email. For the technical folks, I think it would be great to have examples of things you can do with Apify that is a hassle to do with your local chrome headless.

Great job!

nikisweeting · on Nov 3, 2017

If you're interested in running your own personal Way-Back machine that uses Chrome headless for archiving (among other methods), check out Bookmark Archiver.

https://github.com/pirate/bookmark-archiver

sebazzz · on Nov 2, 2017

We are not too happy with our EvoPDF license so in the basis this is a good option. However, I do not think this allows adding headers, footers, page numbers etc.

jeppebemad · on Nov 2, 2017

Is there a similar API around that accepts HTML instead of a URL? I’ve build one for my project, but I would prefer to delegate this to an external service.

chrismorgan · on Nov 2, 2017

Bear in mind that you’ll need to either embed all your resources, or only use CORS-enabled resources, or fake the origin for your HTML document so that it can access non-CORS-enabled resources on a particular domain.

Encoding your HTML as a data: URI might work for this service as-is (provided you use no non-CORS-enabled resources). Haven’t tried it.

jancurn · on Nov 2, 2017

We can easily add this feature. So you'd like to pass a single HTML file?

inferiorhuman · on Nov 2, 2017

So something like wkhtmltopdf?

Robdel12 · on Nov 2, 2017

Question: does this create accessible PDFs? That would be a really nice _possible_ work around for screen reader users having issues with a website.

jancurn · on Nov 3, 2017

I believe it's possible, by adding something like "scale: 1.5" to "pdfOptions" you might render an accessible PDF

tehlike · on Nov 2, 2017

you could rpobably launch this service free, and someone will probably create a docker image, and make it one click.

colordrops · on Nov 2, 2017

As long as GPU support is not functional in headless, "any web page" is a misnomer. A large enough percentage of sites use GPU acceleration so that headless mode is useless. This needs to be addressed by the Chrome team.

kyberias · on Nov 2, 2017

What do you mean "use GPU acceleration"? Are you saying that a large percentage of sites use WebGL? Using GPU acceleration for web page rendering is just a browser performance optimization. They can render the same page without GPU only slower.

colordrops · on Nov 2, 2017

I'm more specifically talking about WebGL. We'd love to use headless chrome at our company but we can't. But even for things like CSS transforms, we do a lot of really heavy 3D work and software emulation won't cut it.

nikisweeting · on Nov 3, 2017

CSS transforms work just fine without the GPU. We use it extensively for screenshot testing our CSS transform-based animations on https://oddslingers.com.

colordrops · on Nov 3, 2017

We are using a ton of them to create 3D heads up display camera overlays. They are too slow with GL software rendering.

nikisweeting · on Nov 3, 2017

Not sure what you mean? Usually people use either CSS 3D Transforms or WebGL, but not both.

colordrops · on Nov 4, 2017

What do you mean? We are using both right now. Why can't they both be used on the same page?

kyberias · on Nov 3, 2017

I think you're confusing WebGL (explicit GPU usage) and how GPU is used to implement CSS transforms (transparent, implicit GPU usage).

colordrops · on Nov 4, 2017

No, I'm not. We are using both a webgl context to render 3d objects on the screen using shaders, and 3D CSS transforms to render overlays on a video stream.

kyberias · on Nov 5, 2017

Ok you are. But most web sites are not.

panda888888 · on Nov 2, 2017

Does this work if the page is behind a password/SSO wall?

And is it possible to print multiple Chrome tabs?

Printing pages to PDF is pretty straightforward. It's the above two issues were I've run into problems. Anyone know of a good solution to the second one?

gbrits · on Nov 2, 2017

Assuming you want this done automatically, what's the advantage of 'printing' multiple tabs to PDF in a headless browser, over just sequentially loading and printing the pages you want done?

laktek · on Nov 3, 2017

You can give https://screen.rip/#pdf a try. It supports capturing pages behind auth.

jancurn · on Nov 2, 2017

Not yet, but we can quite easily add these features. Just let me know at jan@apify.com what would you need

nikisweeting · on Nov 3, 2017

Chrome headless does, just specify the --user-data-dir parameter to give it a profile to use (a profile where you're authenticated to the site you want to snapshot).

e.g.

chrome --disable-gpu --headless --user-data-dir=/Users/username/Library/Application\ Support/Google/Chrome/Default https://example.com/paywall/article.html

Alternatively, use puppeteer to script the auth process. https://github.com/GoogleChrome/puppeteer