I'm answering questions about Pandas (the Python data analysis framework) on StackOverflow from time to time. It's an exercise in patience, because many people will post screenshots of their data instead of a reproducible code example. You'll have to point about every other newcomer to the documentation on how write a proper question that one can actually answer.
I'd imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I've just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.
It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.
I'd consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy/pasted from the site (not having to download a csv).
For illustration: here's what the Pandas code would look like for the first example of extract-table.com:
Off topic funny story: My highest voted answer on SO is a very basic one about Pandas, from 7 years ago. It's funny that I've only used Pands for a few weeks, years ago (I would need to relearn it from scratch now), but 90% of my SO score comes from that answer and I still get more points almost daily. In fact I'm in the top 6% of SO mostly thanks to that answer.
I'm in the same boat, 95% of my SO points come from an answer that was basically a copy pasted script to fix an obscure VMWare error with Ubuntu. Turns out a lot of people had the same issue that day.
Since all votes have the same weight I guess it makes sense that the answers to most basic questions or highly common problems will get the most points. Maybe SO should have a button to donate points to an answer that really saved your bacon, a super-upvote if you will. (I know you can attach bounties to questions, but that's not really feasibly when you come across something that has already been answered).
But yeah, crowd behavior is fun. I have the feeling I can time when some computer vision courses (or the semester) starts, as suddenly there's many upvotes on my basic answer explaining BGR/RGB color space confusion with OpenCV, the computer vision library :)
Funny that this is brought up. As an undergraduate in a Data Scientist class we did analysis on the SO dataset (we processed the whole thing using RStudio running on a big EC2 instance). I found that about ~1,000 users that have made less than fifty posts have moderator privileges. In that report, I suggested that they should give users quality points (Upvotes / # Page Views) rather than straight reputation points.
Yes it is, IIRC I've given bounties to answers from long ago just to donate points to an answer that was really good. In fact it's in part exactly for this reason, since you can pick as one of the official reasons:
Ok it's feasible, let me reword: it's awkward. You have to hunt for the "start a bounty" link in the question, not the answer, and then presumably still have the minimum bounty period of 24 hours, after which you have to come back to award it to the answer you wanted to reward?
Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.
Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.
I used to work for a bank on their innovation team and pitched basically this, but as an intern I had neither the skill nor time to do it. But it was certainly something a bunch of people internally wanted.
Do you happen to know how to paste regular UTF-8 text into Excel/Google sheets as multiple cells? If I copy two cells in Sheets, I get a tab character (\t) between the cells. But if I try to paste "hello \t world" into Sheets then it's just dumped into one cell.
At least my bank was comfortable with cloud everything and people using APIs from approved partners. If you can write the report in Google Docs, as long as they were the ones plugging in their API key for the OCR, I imagine it would be fine.
With this image[1] from this question on SO[2], the output[3] is missing the last row. FWIW, I've had the occasional miraculous-looking results from AWS Textract, but you do need to keep an eye on what's happening.
Update: I just checked a bit carefully, and this example[4] is also missing the last row.
Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.
This is really awesome. I have tried to solve that many times. I got close, with open CV and azure ML. I have even tried AWS Textract (~2 years ago). But this is the best implementation I have seen so far. Congratulations.
I'm not sure what application you are thinking off. But the reason I'm following this problem is UX. Years ago, I worked on a project where anyone can add product prices into a DB. They do that by typing their receipt (line items) into the DB. The major issue was, the UX was horrible.
With an API like yours, this is super simply. One photo. That's all.
Thank you! I have also been kind of obsessed with this problem. I have tried to solve it myself, going from an image to bounding boxes and trying to separate the boxes into columns. But that problem is just fraught with edge cases, so I decided to just use an existing tool.
Neat tool! There appears to be two minor issues in the last example. There is an encoding issue of "ø" characters ("Røkke"), and a column split appears to be missing betweeen the closely spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible possibly non-trivial improvement: harmonize formatting within the same column to avoid mixed occurences of "7800" / "7 800".
This currently returns an error if it doesn't find exactly one table in the image, so it might be able to work with larger images, but probably not if there are multiple distinct blocks of text.
I'd imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I've just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.
It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.
I'd consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy/pasted from the site (not having to download a csv).
For illustration: here's what the Pandas code would look like for the first example of extract-table.com: