In my opinion, anything that touch compiled packages like pytorch should be packaged with conda/mamba on conda-forge.
I found it is the only package manager for python which will reliably detect my hardware and install the correct version of every dependency.
Try pixi! Pixi is a much more sane way for building with conda + pypi packages in a single tool that makes this so much easier for torch development, regardless if you get the condaforge or pypi builds of pytorch. https://pixi.sh/latest/
Likewise, this was my experience. If ever I need to "pip anything" I know I'm in for a bad time. Conda is built for literally this exact problem. Still not a breeze, but much better than trying to manually freeze all your pip dependencies.
Just to illustrate this point, poppler [1] (which is the most popular pdf renderer in open source) has a little tool called pdf2cairo [2] which can render a pdf into a svg.
This means you can delegate all pdf rendering to poppler and only work with actual graphical objects to extract semantics.
I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc.
But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans.
There is also PDF to HTML, PDF to Text, MuPDF also has PDF to XML, both projects along with a bucketful of other PDF toolkits have PDF to PS, and there is many many XML, HTML, and Text outputs for PS.
Rastering and OCR'ing PDF is like using regex to parse XHTML. My eyes are starting to bleed out, I am done here.
It looks like you make a lot of valid points, but also have an extremely visceral reaction because theres a company out there thats using AI in a way that offends you. I mean fair still.
But im a guy who's in the market for a pdf parser service, im happy to pay pretty penny per page processed. I just want a service that works without me thinking for a second about any of the problems you guys are all discussing. What service do I use? Do I care if it uses AI in the lamest way possible? The only thing that matters are the results. There are two people including you in this thread ramming with pdf parsing gyan but from reading it all, it doesn't look like I can do things the right way without spending months fully immersed in this problem alone. If you or anyone has a non blunt AI service that I can use Ill be glad to check it out.
It is a hard problem, yes, but you don't solve it by rastering it, OCR, and then using AI. You render it into a structured format. Then at least you don't have to worry about hallucinations, fancy fonts OCR problems, text shaping problems, huge waste of GPU and CPU to paint an image only to OCR it and throw it away.
Use a solution that renders PDF into structured data if you want correct and reliable data.
This reminds me of the mp-units [1] library which aims to solve this problem focusing on the physical quantities.
The use of strong quantities means that you can have both safety and complex conversion logic handled automatically, while having generic code not tied to single set of units.
I have tried to bring that to the prolog world [2] but I don't think my fellow prolog programmers are very receptive to the idea ^^.
I remember a long, long time ago, working on a project that handled lots of different types of physical quantities: distance, speed, temperature, pressure, area, volume, and so on. But they were all just passed around as "float" so you'd every so often run into bugs where a distance was passed where a speed was expected, and it would compile fine but have subtle or obvious runtime defects. Or the API required speed in km/h, but you passed it miles/h, with the same result. I always wanted to harden it up with distinct types so we could catch these problems during development rather than testing, but I was a junior guy and could never articulate it well and justify the engineering effort, and nobody wanted to go through the effort of explicitly converting to/from primitive types to operate on the numbers.
I had kind of written off using types because of the complexity of physical units, so I will be having a look at that!
My biggest problem has been people not specifying their units. On our own code end I'm constantly getting people to suffix variables with the units. But there's still data from clients, standard library functions, etc. where the units aren't specified!
I am working on a unit-aware arithmetic library for swi-prolog (1) modeled after the c++ mp-units library (2).
Turns out prolog is really well suited for this because:
* of its ability to store unit system data as code
* unit conversion is an iterative deepening depth first search
* manipulating symbolic arithmetic is so easy
Unfortunately, it requires users to compile swi-prolog for source because the library is using some unreleased features.
If anyone would like to test and report some feedback, I would be truly grateful !
Curl is one of the very few projects I managed to contribute to with a very simple PR.
At the time, I was a bit lost with their custom testing framework, but was very imprest by the ease of contributing to one of the most successful open-source project out there.
I now understand why. It is because of their rules around testing and readability (and the friendly attitude of Daniel Stenberg) that a novice like me managed to do it.
I have recently started working on a swi-prolog library for unit aware arithmetic[1].
It is still very bare bone (especially in documentation), but I started writing some examples[2] to showcase the library.
It is essentially a port of the mp-units[3] library in c++.
It was a lot of fun, and I found prolog especially well suited for manipulating symbolic representation of units and quantities.
Funny how we can stumble on the same idea from a different perspective.
In my case, I wanted to generate efficient C code for an array library written in prolog. The logical solution was to manipulate a C ast from prolog, but since a lot of the ast needed to be written by me, I chose to make it very similar to the original C syntax.
Here is the grammar generating tokens from the ast and then c code from tokens: https://github.com/kwon-young/array/blob/master/c99.pl
I'm working on a pure SWI-Prolog grammar to describe the modern music notation.
The end goal being to be able to do the last step of Optical Music Recognition and generate the final music score (in the MEI) from a set of graphical primitives: https://github.com/kwon-young/music
It's been months I've been stuck on the description of note groups because of the insanely complex 2D semantics.
A few years ago, my dad working in chips control quality asked me how to do exactly this but with images from optical microscopes.
I can confirm what the post affirms that panorama stitcher softwares are not able to do the job. But what I found was that the opencv Stitcher class can do this perfectly out of the box. Unfortunately, there was no existing gui for the class at the time, so I quickly made one in 3 days: https://github.com/kwon-young/ImageStitcher
It would have been nice if the post had compared it's approach to the Stitcher class. Maybe the number of images or the size of the final image or the stitching error control cannot be sufficiently controled with the Stitcher class ?
Yeah, I should really try to improve it... But it was just 3 days hacking to put a frontend on the opencv Stitch class and produce an exe that my dad could use.
I don't recall finding hugin when I did my (short) research on image stitching tool. Thanks to you, I've read the scanned image stitching documentation and I suppose it could work.
However, the process seems quite complicated and slow, asking you to draw control points and all that.
Microscope pictures have the particularity that there are nearly no deformation to the image but you have a lot of them, so you want the process to be as automatic as possible.
That was the goal of my tool, make the simplest gui anf process possible for the task at hand.