Show HN: Extract Table from Image

w-m · on Sept 28, 2021

I'm answering questions about Pandas (the Python data analysis framework) on StackOverflow from time to time. It's an exercise in patience, because many people will post screenshots of their data instead of a reproducible code example. You'll have to point about every other newcomer to the documentation on how write a proper question that one can actually answer.

I'd imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I've just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.

It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.

I'd consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy/pasted from the site (not having to download a csv).

For illustration: here's what the Pandas code would look like for the first example of extract-table.com:

  df = pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'}, 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1: 47, 2: 12}} )

pastacacioepepe · on Sept 28, 2021

Off topic funny story: My highest voted answer on SO is a very basic one about Pandas, from 7 years ago. It's funny that I've only used Pands for a few weeks, years ago (I would need to relearn it from scratch now), but 90% of my SO score comes from that answer and I still get more points almost daily. In fact I'm in the top 6% of SO mostly thanks to that answer.

belval · on Sept 28, 2021

I'm in the same boat, 95% of my SO points come from an answer that was basically a copy pasted script to fix an obscure VMWare error with Ubuntu. Turns out a lot of people had the same issue that day.

w-m · on Sept 28, 2021

Since all votes have the same weight I guess it makes sense that the answers to most basic questions or highly common problems will get the most points. Maybe SO should have a button to donate points to an answer that really saved your bacon, a super-upvote if you will. (I know you can attach bounties to questions, but that's not really feasibly when you come across something that has already been answered).

But yeah, crowd behavior is fun. I have the feeling I can time when some computer vision courses (or the semester) starts, as suddenly there's many upvotes on my basic answer explaining BGR/RGB color space confusion with OpenCV, the computer vision library :)

naberhausj · on Sept 28, 2021

Funny that this is brought up. As an undergraduate in a Data Scientist class we did analysis on the SO dataset (we processed the whole thing using RStudio running on a big EC2 instance). I found that about ~1,000 users that have made less than fifty posts have moderator privileges. In that report, I suggested that they should give users quality points (Upvotes / # Page Views) rather than straight reputation points.

franciscop · on Sept 29, 2021

Yes it is, IIRC I've given bounties to answers from long ago just to donate points to an answer that was really good. In fact it's in part exactly for this reason, since you can pick as one of the official reasons:

"Reward existing answer. One or more of the answers is exemplary and worthy of an additional bounty." - https://stackoverflow.blog/2011/09/23/bounty-reasons-and-pos...

w-m · on Sept 29, 2021

Ok it's feasible, let me reword: it's awkward. You have to hunt for the "start a bounty" link in the question, not the answer, and then presumably still have the minimum bounty period of 24 hours, after which you have to come back to award it to the answer you wanted to reward?

unwind · on Sept 28, 2021

People post images of C code too. Best are the ones that post a link to the image on some external image host. Gaaah.

MattGaiser · on Sept 28, 2021

Could do it with a Chrome extension. Add a button to the right click context menu and get the tabular data in the popup.

v3gas · on Sept 29, 2021

Thanks for the feedback! That is a good suggestion, I'll definitely add support for using the image URL.

v3gas · on Oct 2, 2021

I've added support for urls now! Please try it.

greaterweb · on Sept 28, 2021

Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.

[1] https://www.johnsnowlabs.com/spark-ocr/

[2] https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.ht...

v3gas · on Oct 2, 2021

Thanks! No, I hadn't heard of either - thank you!

MattGaiser · on Sept 28, 2021

Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.

I used to work for a bank on their innovation team and pitched basically this, but as an intern I had neither the skill nor time to do it. But it was certainly something a bunch of people internally wanted.

v3gas · on Oct 2, 2021

Interesting, thanks!

Do you happen to know how to paste regular UTF-8 text into Excel/Google sheets as multiple cells? If I copy two cells in Sheets, I get a tab character (\t) between the cells. But if I try to paste "hello \t world" into Sheets then it's just dumped into one cell.

v3gas · on Oct 3, 2021

Nevermind, the tab character is indeed what's needed to split it into multiple cells.

EveYoung · on Sept 29, 2021

I can only imagine what a pain it would be to get InfoSec approval for such a tool, unless it's doing everything on-device.

MattGaiser · on Sept 29, 2021

Wouldn’t need to be on device necessarily.

At least my bank was comfortable with cloud everything and people using APIs from approved partners. If you can write the report in Google Docs, as long as they were the ones plugging in their API key for the OCR, I imagine it would be fine.

saradhi · on Sept 30, 2021

You should consider extracttable.com

P.s: I run the linked resource.

nanis · on Sept 28, 2021

With this image[1] from this question on SO[2], the output[3] is missing the last row. FWIW, I've had the occasional miraculous-looking results from AWS Textract, but you do need to keep an eye on what's happening.

Update: I just checked a bit carefully, and this example[4] is also missing the last row.

Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.

[1]: https://i.stack.imgur.com/y7Zrt.png

[2]: https://stackoverflow.com/q/69363708/100754

[3]: https://results.extract-table.com/8d4818867ad604792819e98808...

[4]: https://results.extract-table.com/254d95722a2c2b1df72fc26b59...

v3gas · on Oct 2, 2021

That's interesting. Thanks for reporting!

eihli · on Sept 28, 2021

Nice. I worked on something similar but far less robust: https://github.com/eihli/image-table-ocr. It fails to find the tables on the example images at extract-table.com, but the code is heavily commented at https://eihli.github.io/image-table-ocr/pdf_table_extraction... so there's high visibility into what's going on and what needs to change to get it to work with images of different sizes/fonts.

BrandiATMuhkuh · on Sept 28, 2021

This is really awesome. I have tried to solve that many times. I got close, with open CV and azure ML. I have even tried AWS Textract (~2 years ago). But this is the best implementation I have seen so far. Congratulations.

I'm not sure what application you are thinking off. But the reason I'm following this problem is UX. Years ago, I worked on a project where anyone can add product prices into a DB. They do that by typing their receipt (line items) into the DB. The major issue was, the UX was horrible.

With an API like yours, this is super simply. One photo. That's all.

Maybe I'll revisit it as a side project.

v3gas · on Oct 2, 2021

Thank you! I have also been kind of obsessed with this problem. I have tried to solve it myself, going from an image to bounding boxes and trying to separate the boxes into columns. But that problem is just fraught with edge cases, so I decided to just use an existing tool.

BillSaysThis · on Sept 28, 2021

Really nice but... wondering how long this will last as a free tool given AWS fees.

whirlwin · on Sept 28, 2021

Nice. Fun fact: The third example table is an ordered list of Norway's richest people (according to net worth, I think)

howmayiannoyyou · on Sept 28, 2021

Nice job. Actually though, what the world really needs in ML that divines the trend and perhaps indices/values from images of charts.

plaidfuji · on Sept 28, 2021

This has been my pet side project for many years. What use case would you apply it to?

howmayiannoyyou · on Sept 30, 2021

Scraping financial content

pveierland · on Sept 28, 2021

Neat tool! There appears to be two minor issues in the last example. There is an encoding issue of "ø" characters ("RÃ¸kke"), and a column split appears to be missing betweeen the closely spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible possibly non-trivial improvement: harmonize formatting within the same column to avoid mixed occurences of "7800" / "7 800".

mzs · on Sept 28, 2021

https://github.com/vegarsti/extract-table

jnsie · on Sept 28, 2021

Really cool. I'm interested to hear your plans for this. Are you planning to offer as a service/open source/etc.?

visarga · on Sept 29, 2021

Does it also do table detection in a larger image and header/body classification?

v3gas · on Oct 2, 2021

This currently returns an error if it doesn't find exactly one table in the image, so it might be able to work with larger images, but probably not if there are multiple distinct blocks of text.

ducktective · on Sept 29, 2021

Awesome project!

Can AWS Textract be used directly with curl to return text strings of an uploaded image?

v3gas · on Oct 2, 2021

Thanks! No, not that I know of, looks like for the AWS cli it needs to be in an S3 bucket, based on looking at this document: https://docs.aws.amazon.com/cli/latest/reference/textract/an...

ducktective · on Oct 2, 2021

hmm...weird. They could have provided a rate-limited API endpoint as a service...

z3t4 · on Sept 28, 2021

Should make it into a browser plugin, so annoying when web sites have tables in images.

basmango · on Sept 29, 2021

Does it use textract directly? Or are you doing some preprocessing?

v3gas · on Sept 29, 2021

Directly, no preprocessing! The postprocessing is concatenating all words that belong to the same cell.

tuberelay · on Sept 29, 2021

UI Path does this in a nice way