The hard part of PDF generation is support for complex script (Arabic, Indian languages etc.), including embedding a font subset. On Windows this is usually accomplished using the Uniscribe library (which is not available on Linux). QuestPDF appears to be using HarfBuzz for this purpose. If that works well then this is a winner!
Gold standard? Even though serious bugs are not fixed [1] because "the code is too fragile to touch at this point"? Looks like Android uses HarfBuzz, if so it can't be that bad.
Oh man, .NET needed this. Did some PDF work for a .NET project last year and found the ecosystem to be somewhat light on PDF support. There's a few commercial options, but they're pricey.
its basically a wrapper for wkhtmltopdf but I develop an app that has probably generated a million +/- invoices/statements over the past 5 years with it, and its been rock solid for me. Was a bit of a bear to get it working the first time (not a ton of documentation that I could find at the time), but once working, was easy to add/change new documents/layouts.
As it uses wkhtmltopdf under to covers, it is a HTML->PDF tool, but I prefer that, at least for my use case.
Not sure there is a dotnet-core version, so that might be a problem for some.
I've been using Rotativa[1] for URL to PDF generation which is also a wrapper for wkhtmltopdf. They have a dotnet-core[2] version and also a SaaS[3] but it's worth mentioning that Azure PaaS supports wkhtmltopdf[4] so I just self-host.
Looking at QuestPDF's API docs, it doesn't look like they support URL / HTML to PDF generation. I think this would be a great addition especially given the age and issues with Rotativa and TuesPechkin on their public repos.
The various wkhtmltopdf wrappers (there are many for .NET) work, but note wkhtmltopdf itself (based on Qt 4/5) is abandoned and the general response for issues was always 'just hack the HTML until it sorta works'.
We're using Puppeteer with Chrome. It's easy to test, as basically it runs Chrome's Print to PDF. Difficult to work with headers and footers and page breaks can be tricky, but it could work for a lot of layouts.
I don't know if it's due to "partnerships" but I never could understand why Microsoft didn't do better at supporting .NET Word & PDF tooling since .NET Core came out. The older versions I know at least had support for Word docs. Creating documents is a huge foundation of their company.
> the project had now been silently moved to GitHub Enterprise (likely in the short window @dnfadmin had owner access). The author states that projects in GitHub Enterprise can be entirely controlled by the owner of the account (the .NET Foundation). This transfer happened silently.
I think alot of people have moved on and are using services or puppeteer to generate their pdf's i know we did since we couldnt find a library that worked properly for our usecase.
the tl;dr of using Puppeteer for this is "We run Chrome in a headless mode, load your page, and then print to PDF with it".
It makes me nervous having Chrome running on the server, even inside a container without root. Doubly so if the user is able to control any portion of the page being run by Chrome.
We used aws lambdas to execute the pdf renders and upload the result to s3 using a signed url passed in from the request. Complete insulation from our own application process, all of data is passed into the request so the worst the user can do is add a malicious file to our s3 bucket.
This looks great. I am glad to still see good work being done in this space.
I had used https://gotenberg.dev/ on AWS in the past. Many of the options available at the time weren't usable in Azure outside of a VM due to needing to make use of GDI interfaces that were disabled for security reasons. Interested to see how it compares to that and the other options being floated at the time like Puppeteer*
Good point! There's an open issue regarding that, and it seems to be due to the fact that under the hood, QuestPDF uses Skia which itself lacks support for tagged PDF's: https://github.com/QuestPDF/QuestPDF/issues/193
This could be a SkiaSharp limitation. This thread made me interested in Skia and I started looking around their site and did a quick search for "tagged PDF" on their Milestone Release Notes.
If they understand the same thing with tagged PDF as what is being discussed in this thread, that page says that "Add new APIs to add attributes to document structure node when creating a tagged PDF.", which could be a milestone as old as of 2020 [0]
Are people really OK with writing code in C# for data presentation? How is it fine having to compile code when you want to change a font colour for example?
And just from looking at the examples you can see how quickly it becomes messy. It looks like something that would work fine for very simple document structures but would get messy really fast once you add any level of complexity. Imagine having to write code for a document with multiple levels of headings, tables and images with captions, parts of text that need emphasis, etc.
From my experience the best solution for generating PDFs in .NET are, as mentioned here before, wkhtmltopdf wrappers. Take a Razor view, add some document properties (page size, margins, headers, footers, etc) in code, and output to a PDF.
I would even argue that on desktop using Office APIs to populate word documents and output them to PDF is more efficient than this.
The concept of writing the presentation layer in the actual programming language isn't new. Flutter developed by Google is a great example.
It all depends on how you structure your code. Of course, you can create huge chunks of HTML+CSS that cover your entire document. Or, to improve maintainability, you can split that HTML into multiple components and compose them together.
Very similar rules can be applied to this library, where you split your code into smaller parts by using properly named methods. That gives you better understandding on the structure and ways to traverse the implementation.
Futhermore, using a programming language gives you many benefits. You can rely on all language features like loops, conditions, methods, formatting, recursion, etc. You have access to IntelliSense, static analysis and refectoring tools.
In terms of changing the layout content - it would be equally difficult when using any markup language. After all, you don't want to parse markup file every time you generate PDF - for performance reasons.
HTML may be a good choice for relatively simple documents, where you don't care that much about splitting content between pages, etc. However, you still will fight with performance problems related to running entire web browser.
I am using this in prodution instead of Dink2PDF since recently, generating 30K PDF daily (in tens of minutes, with multiple threads). It works great, highly recommended.
Dink2PDF was crashing monthly in this scenario due to internal unmanaged memory problem so we had to replace it. Not to mention, HTML to PDF libraries are insecure, and dink is no exception - you can execute arbitrary code on the server. Not to mention that you need to have full browser engine in your app...
Crashing is a problem with wrappers for wkhtmltopdf library when they run concurrently (you need the right locks, some wrappers have this but not dink apparently?). However, process-based wkhtmltopdf wrappers are also available and don't have this problem at all (e.g. nreco. It's also not too difficult to create a wrapper for oneself).
We had very funky case - the crash happened couple of times per month only, with web app doing daily work via REST API producing very large collection of pdfs on single API request, each having from couple to 5k pages. It was very hard to troublehsoot and pinpoint memory corruption problem (took us several months, one doesn't expect this in C# app). Finally we switched to QuestPDF. Surprisingly, performance was exactly the same as with dink (which is surprising); I see in recent changelog 50% speed up on textual PDFs so this might be a game changer (TBH performance is great even now).
BTW, we used dink for half decade in public web apps with millions of users. Dink was used to create PDF reports and we never had a crash in that concurent scenario. However, when we started doing typical non-web multithreading this started to happen.
Does this allow you to use existing PDFs as "templates"? We do that a lot with PDFs. It allows end users to design in Adobe Acrobat and upload to our product. We can then inject dynamic data into placeholders at runtime. We do this for text and images.
It shouldn't be too difficult to add support for this. I authored a Go library which adds support for importing PDFs into a new PDF generator (either gofpdf or gopdf). It is around 2,500 lines of code: https://github.com/phpdave11/gofpdi
Not at the moment. We are currently using RadPdfProcessing [1] from Telerik ($) to do so. It is a processing library that allows creation, import and export of PDF documents from code.
Because styling semi-complicated PDFs with CSS is a layer of hell right above e-mails & old browsers. I say this as someone who enjoys CSS & in the .NET world has used this method over things like Crystal Reports (even before they dropped their .NET support).
I havent found this at all. Ive been rendering very complex html to pdf (complex svg charts, headers, footers etc) and its been fine. Just a matter of getting the element heights/widths correct. Once you've got the basic page template done its not much effort at all to tweak as required
There's a good matrix of feature support at https://print-css.rocks/lessons for all the things HTML-to-PDF engines can (and can't) do.
The CSS3 Paged Media spec was born broken on some fundamental things like counter resets, then effectively abandoned in 2013, so some complex print-specific requirements like fully customizable page numbering just don't happen without additional tooling. Accessible tagged PDFs are still a struggle, and I think only Weasyprint readily supports them among free or open-source options (and only since around September).
You're just pushing complexity around. Now you have to figure out how much content you can put on each page before you have to make a new one. The reason it seemed so easy is that you just deferred the hard part.
But lets use an entirely new, esoteric PDF specific layout language instead and we wont call that the "hard part".
Actually to be fair it obviously depends on the situation. If you are churning out large volumes of PDFs then it might make sense to get the efficiency with a PDF-specific language. But HTML -> PDF definitely has its place too and is not as hard to work with as people are claiming.
Sometimes they might be. But even if you know the font size of the headers, the contents, and the bulleted lists, you still won't know where the paragraphs wrap, unless you also require a fixed-width font, and don't allow any of the word lengths to change. But then we probably wouldn't be calling it a "modern library for PDF generation".
The Paged Media spec on counters and counter-resets paints implementations into a corner. They can't both comply with the spec and implement page count resets on page breaks. This has been a known issue with the spec since 2013[1][2] and been a thorn in implementations since.[3]
I’ve had trouble with html -> pdf when it comes to tables. Most of the packages out there don’t have a way to say “if only the first few rows of the table will render before a page-break, then put the page-break before the table”. Or getting a table to not break at row X when that row has a cell that spans more than one row. Or getting table headings to repeat on the next page when a page-break mid-table is unavoidable.
These are all things that good page layout software can deal with easily.
HTML to pdf is also pretty slow and unreliable when you want a table of contents or an index with page numbers. On a fast machine, a 200 page PDF can take several minutes to generate. PrinceXML is the only software I’ve tried that does a good job of it. For very simple documents (no CSS, limited Unicode) HTMLDoc is pretty good and very fast.
Row count is not the only condition for how much can fit on a page.
Content matters for row height, as does fonts, styling, etc. Footers, too. The number of footnotes (thus the height of the footer) can depend on the page content, too.
Maybe there's an image in there, or some sort of specially styled content that makes the line height larger than normal.
You have to be able to render the row to know the final dimensions, to be able to make a call whether it should go on that page or the next. If you don't, then you end up with one page actually rendering over into a second page.
I still find jasper reports the best in pdf generation. Jasper studio gives you okay design tools. Much better than hand coding. Jasper server means integration is as simple as a rest interface. The community edition seems to do everything I need.
A PDF file is a program for a virtual machine that draws characters. For instance, I believe fonts in PDF work like PostScript fonts, where (for left-to-right languages) each glyph in the font is actually a bytecode function that starts with the brush in the lower-left corner of where the glyph is to be drawn, draws the glyph, and leaves the brush at the lower-left corner of where the next glyph is to be drawn. I think it's somewhat similar to turtle graphics, if you're familiar with Logo programming or G-code if you've ever hand-coded a CNC mill. (PostScript is text instead of bytecode. PDF is an odd mix of a binary and text format, which helps explain why it has had so many parsing security vulnerabilities over the years.)
For common cases, it may be possible to basically decompile the PDF, modify the text, and re-flow the text, and re-compile to bytecode. However, it's very complicated to do in the general case. (Note that in HTML, the browser determines how to best layout the text, but with PDF, the PDF generator makes the layout decisions.)
Also, many PDF renderers will "compress" fonts by lazily building up an embedded font as glyphs are used in the document. These typically will assign "a" to the first glyph used "b" to the second, etc., so if you decompile "This is some text", you'll see "abcd cd defg hgih". Some PDF generators will helpfully annotate the generated text with "backing text" metadata to help screen readers/copying-to-clipboard, but it's far from universal. So, you might need a database of hashes of all of the bytecode functions in a large number of fonts and/or some image-to-text software in order to reliably decompile the PDF.
If you're unable to copy text out of a PDF or you get gibberish when you copy text from the PDF, it's likely because the PDF lacks this "backing text" metadata (and in the gibberish case, likely a compressed embedded font). Some scanners will helpfully perform OCR to add this backing text metadata to the generated PDF.
Source: I did a small amount of work related to PDF analysis in Google's web search indexing pipeline over a decade ago. Most of my work was related to figuring out how JavaScript altered web page text, but I did learn just enough about PDF to be dangerous. At the time, Yahoo was Google's biggest competitor, and tons of their indexed PDFs had preview text that was this compressed font "abcd cd de..." garbage. Yahoo obviously naively decompiled the PDF and just trusted that "a" in the embedded font was a bytecode function that drew the glyph "a".
Maybe the letter you intend to add is not part of the subsetted font. Font subsetting is extremely common.
Maybe out of coincidence all the letters are present. Then you'd have to deal with manually adjusting the spaces and reflow the text. Reflowing the text can be done, but cumbersome. It's akin to fixing a bug in program not by changing the source and recompiling, but by binary patching.
In contrast, it's much easier to delete some letters in the PDF and keep everything else in the same place. In fact I've had obvious PDFs that have a copyright notice on every page. Deleting that can be done with qpdf and just vim (basically deleting the Tj or TJ operators).
This is fascinating. I recommend you read the PDF specification.
You can, using a tool like Adobe Acrobat. But a PDF is a fixed layout, where each line of text is a positioned box. So editing text will not cause reflow across lines.
Not knowing anything about PDF generation (but will need to soon), what can these libraries do that you can't do with something like a puppeteer web service and create PDFs with HTML/CSS?
Using HTML/CSS for PDFs really just isn't a good idea in my experience. It makes layout extremely cumbersome. If you just need to spit some data out onto a page, sure it works I guess. However, doing more complex page layout with an actual design element often introduces scenarios where a markup language just can't work.
Scale/performance. The interface is also straightforward to use. Puppeteer or any nonembedded process is just unnecessary hassle/overhead in a lot of cases.
It's an overhead but not a big one, at least for web applications, especially if they run as containers anyway. And then it really scales like crazy. Yes, this pdf generator may be faster at what it does, but a headless browser with paged media polyfill can do a lot more than this and uses html+css which are widely used standards.
Sure, but as others have said, how do you get column headers appearing on each page, put metadata into your documents, make elements properly selectable etc etc
"Just run it as a container" is a bit of an industry cop-out for making stuff unnecessarily complex.
I put puppeteer into a serverless function and it worked well enough for low tens of thousands of PDFs a day. It's not fast, nor efficient, but it was reliable and surprisingly cheap. It was a definite improvement over the existing solution which was a terrible proprietary black box that was occasionally returning the wrong invoice, but that is not saying much. It was an easy drop-in replacement because we were already generating invoices in HTML, so we just sent them to the new PDF service instead.
Something like this is likely much more efficient than launching a whole browser for each PDF.
Tables over multiple pages is a major problem. It just doesn’t work with the popular htmlpdf tool that everyone uses to power their tools. That is the use case I am interested in.
>> You are 250 lines of C# code away from creating a fully functional PDF invoice implementation.
As a web developer this hurt to read. This is a task which is just crying out for a markup language and a stylesheet, not hundreds of lines of declarative C# code.
Even the "complex example" in their documentation looks like the most basic of web pages.
Coming from the web dev space into backend on a project that heavily relies on PDF generation, I would say that something like a PDF often cannot be expressed with just markup and a stylesheet. There's a large difference in something like the web (it must be expressed with some fluidity of layout) compared to a very static document like a PDF. Page breaks, readability, print supply, watermarks, paging, etc all has to be considered.
I've worked with PDF markup tools built on libriaries like this for 20 years, both third-party and in-house custom. It usually takes 10 minutes to find out the markup doesn't support what is required for the task. Third-party you have to find a hack or drop it all together. In-house you can maybe add something in, but you'll have to do it fast, and if you can't break it down into a general-purpose feature (which you probably can't because the fundamental philosophy of your "easy" markup language wasn't designed with anything like this in mind) so you'll just have to uglify the markup language even more or, again, drop it all together.
PDF is an insanely complex spec (I’ve spent more time reading it than most because I need to know bits of it for my job and I just generally find it fascinating). But a lot of devs just need to put some content on the screen to match a template they were given. In my experience, a complete enough markup language allows you to bang out and maintain those templates better than code.
I know it doesn’t suit every need, but it’s just a way of representing the data so it’s closer to the final output than imperative code is. Definitely take your point though about the limitations becoming dealbreakers.
What if the code resembles the markup language in terms of readability, but still gives you access to more advanced features? Surely, there is space for various approaches, it all depends on your task and requirements
Webpages and pdf (paged documents) are fundamentally different, you won't be able to support easily headers and footers, page-breaks and orphans on a webpage. You can create basic invoices on webpages, but anything more complex (and by that I mean any serious word document) will require you to twist HTML. Try to have column headers to repeat on each printed page on a HTML page.
The markup doesn’t need to be html - and would be better not to be. The point is more that templating languages are great for formatting data as markup and markup is great for driving layout. With this library as a backend you can make something super usable.
I believe browsers have been repeating table headers on printed output for some time.
Page media CSS is designed for this although most browsers don't fully support it, PrinceXML is the go to for full paged media support.
IMO they are not fundamentally different, they are both document formats, PDF just a has fixed paged rendering layout baked in while HTML can flow and adjust to rendering target. The main issue is lack of full print CSS support in HTML rendering engines.
Still, to switch back to the previous point, it seems it's more a divergence between using markup or code to design a document. Both have valid usage and benefits depending on your case.
In my case and my apps, I often need to handle complex conditions that fits better imo in procedural code (complex invoices and agreements). On other cases (reports), I prefer to use a markup language.
There are a lot of procedural tools for generating HTML, lots, if modern browsers fully supported print CSS then you could use them for complex PDF generation, or direct printing, either client side or on the server headless.
If your app is a web app this is a no brainer, the users browser could simply do the print or PDF conversion as needed.
I do see a use for more direct libraries in native apps, although if every native client had a browser control with full print CSS support even then it might not be such an issue.
If your app is a web app this is a no brainer, the users browser could simply do the print or PDF conversion as needed.
That's arguable, IME (and also a better UX), most would prefer to just get the PDF file which just one click than to deal with additional browser dialogs. No everyone knows how to do print-to-pdf or even know it exists.
Or do you mean browsers expose print-to-pdf functionality as an API?
Hitting print in the browser or calling Window.print() if you want to force the dialog.
If you serve a PDF you still need to hit print or use dialog to save, you can use a headless browser server side to serve that if needed.
I do think browser could use better print API's but you not getting around that with server side PDF's unless the server direct prints to on site printers or something.
I am not sure if it is a good idea to think about webpage and PDF content as the same. After all, they both serve different purpose and their layout shouldd be optimized for the use case.
None of these things are difficult at all with html. Plus you have the benefit of having the document viewable in a web browser too. You use the exact same html layout for both with specific css (heights, widths mainly) for each.
We do a lot of dynamic report gen PDFs and this is something we'd prefer.
Right now, we basically emulate this technique w/ HTML->PDF. We build chunks of report HTML with various string interpolation methods and then compose those to obtain our final HTML output.
Raw, declarative HTML is nice if you don't have an undefined # of things to describe with it. When you are looping and projecting domain types into a report, things get a lot trickier.
I used https://github.com/Antaris/RazorEngine to generate all sorts of complex HTML, email body etc. back in the day. Since it follows razor syntax, loops etc. work well
There are many good reasons of choosing the programming language over a markup language. C# has countless of features, both functional and syntactic: conditions, loops, methods, formatting, iteration, recursion, etc. Additionally, each of those features is well supported by all major IDEs. Writing your presentation layer in a proper programming language does not only rely on your existing skills but also gives you access to tools such us code completion and IntelliSense. Moreover, using FluentAPI helps with keeping the code concise and easy to change.
At the end of the day, it all depends on how you use the technology, doesn't it?
Not familiar with .Net but I’d imagine this would probably be fairly easy to build on top of this library (and I agree, xml is often a much better way to generate reports).
I’ve done something similar but in Python and generating Excel documents. I use jinja for templating to create the xml and then parse that and convert to commands that drive the library that creates the final document.
A programming language referencing a template language library for processing a markup language to generate another markup language (PDF) sounds just about right.
Nitpick but it’s a stretch to classify PDF as a markup language. They’re a graph of nodes that can encapsulate myriad different types of data including things that are probably even turing complete like fonts. Even the graphics streams inside PDFs aren’t markup.
We build abstractions for a reason. I think we can all agree that templating markup for layouts has been a reasonable success story of the web generation.
You could just build the example in C#, grab all the required dlls from it and then load them in PowerShell with `Add-Type -Path '.\QuestPDF.dll'`.
Unfortunately it looks like this uses extension methods for everything, and those are a pain to use in PowerShell. You'll probably want to write the PDF creation bits as a C# cmdlet instead.
This is for making PDFs using C# code, right? And you can preview it while you work? I was wondering if that is available for other programming languages.
Since you mention Forth, I might mention that PostScript is another stack-based programming language (different than Forth although there are some similarities), which can be used to make PDF output. Additional PostScript codes could be made which you can load into your file in order to add additional procedures, etc for doing formatting that you will not need to write by yourself.
I don't care about Karma at all, I really was just trying to ask a question. I'm interested in learning HTML and CSS. I'm also interested in creating PDFS. My question was since this looks like it's for making PDFs using C# code, and you can preview it while you work, I was wondering if that is available for other programming languages. Besides HTML and CSS, someday I want to learn C and FORTH. I thought this would be a practical way to get familiar with programming languages.
We must, one day, realize that .NET will disappear when Microsoft stops supporting it. All work done with this framework is only a prison for future developers. This must end.
First of all, what does it have to do with this particular library?
Second, get a grip, .NET is open source for a while, it's getting more popular again, JetBrains entered the stage with a very competetive IDE, it's easier than ever to use .NET via CLI, days of having to use Visual Studio, or any Microsoft product really, are long gone. I don't like Microsoft either but noone's forcing you to use any of their products anymore.