I know I should "just do it myself," but I keep waiting for something that can unsplit and unwrap PDFs generated in ACM double-column style with LaTeX word-breaking and turn it into an epub with graphics for the figures/tables. Trying to deal with that 9-ish pt. font is a huge pain for my old eyes. I ended up giving up on reading them on my iPad because keeping a reasonable zoom level and managing to scan down then over to the next column required the finger dexterity of a concert pianist (even on GoodReader, which is quite, well, Good).
Calibre was mentioned in the article as being able to convert PDF's into epub format. I had my hopes up for a second, so downloaded it and tried it on a textbook and a smaller scientific publication.
It threw up on both the math equations and figures. It didn't handle the general formatting of the book too well either.
To my knowledge, a good PDF->epub converter has not yet been built. Any takers?
It lets you upload pdfs and attempts to parse them into editable text.
The pdf parsing is based on my experiments with pdf-miner (http://denis.papathanasiou.org/?p=343), and while still imperfect (in general parsing pdfs is a difficult problem), it works fairly well for certain types of whitepapers.
This is a wicked cool site, but you need to put in screenshots of the input (how it went in) and the output (what the output looked like in an epub reader).
What approach do your algorithms use? Do you do recognition of title, subtitles etc based on differences in fonts, spacing, line length etc.? Or do you need to enter regexps to recognize those?
Do you recognize paragraphs correctly?
Can you filter out front- and back filler like the ToC, and extract only the 'content' pages?
If so, it's 90% of what I'm looking for and I think good enough to pay for :)
I have some notes on how to approach from when I tried to make it myself, it includes what functionality I consider necessary for a MVP. Let me know if you're interested...
I'm working on an FAQ/Help page which will show some of those features in more detail.
The algorithm I use is a variation of the code described here: http://denis.papathanasiou.org/?p=343 except the output is html, not text, so that I can take account things like font sizes and paragraph breaks.
If you signup and try it (it's free for the first 3 days), you'll see that the parser renders each pdf page as text, and it's up to you to decide which range of pages you want to use in your book.
Feel free to contact me by the form on that site, and I can reply in more detail.
I read most non-mathematical text in epub (usually converted from something else) because, as you say, it is better. But there is no tool support for making good epubs of math text, so I still need PDFs. When I have LaTeX source for what I read, I just compile with appropriate margin settings.
Really PDF is just ill-suited for distribution of text. The only reasonable exceptions are when that text is explicitly meant for printing (ala fliers or posters), or when said text is not computerized- e.g. a scan of written script that has yet to be OCR'd
Alternatively, provide documents in latex or similar and people can do the final compilation themselves, dictating exactly the details of the physical medium (be that printed paper or an electronic display of some kind) they will be using to view it.
This would require people learning how to do something, though.
On the iPad, GoodReader can do margin cropping on the fly, and remembers the margins you've set up for a document so they're reapplied when you open the document again.
I just wish it had some smarts for two-column PDFs (easily 90% of what I read). I often resize, read down the left column, then shift to the top, which moves the crop window and confuses GR horribly.
I've been wanting something like this for ages - particularly to print ebooks and latex stuff with their huge side-margins.
The basic aim is to trim all margins and print 2 pages side-by-side (landscape).
While Briss trims the margins just fine, printing the (trimmed) document as pdf(or ps) restores the margins. (Tried on okular/evince). What gives?