Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a script example of de-dupe of a thousand photos, including ones with trimming, and re-sizing?

I believe ImageMagick includes some of the hash based methods which often do a DCT reduction and then do 2D analysis of similarity, emitting a hash code which you can do a hamming distance thing on. But thats a way off practical use to actuall find, stack, rename and identify the one golden one to keep in all the dupes.



Deduplication is a hard problem: what constitutes a "dupe" is somewhat arguable.

My first approach just used metadata, but that doesn't aggregate the files stripped of their original metadata, like what you get from a Google Takeout, so I had to add image hashing as well. I actually generate three mean hashes in L*a*b color space for PhotoStructure (many hashes ignore color). I've also found that metadata needs to be normalized, including captured-at time, and even exposure metadata. It's a lot of whack-a-mole, especially add new cameras and image formats are released every year.

I described more about what I've written for PhotoStructure (which does deduplication for both videos and images) here: https://photostructure.com/faq/what-do-you-mean-by-deduplica... -- it might help you avoid some of the pitfalls I've had to overcome.


Thank you. It looks interesting. I was heading to much the same place of the order of precedence for matches, silently wondering if there was a class of bad edit which made the post-modified file bigger not smaller. Seems unlikely but not impossible.

A lot of my dupes are google dupes but across about 4 cameras with a mixture of original/compressed size.

A lot of my local copies had jhead run on them to "fix" time. So have modified EXIF

A small number have me playing with ITPC to try and auto-name things for tag matching.

Your program looks to be the one which understands the corner cases.


Not what you asked for, but related to the method you described:

"Image Retrieval Based on Using Hamming Distance"

https://www.sciencedirect.com/science/article/pii/S187705091...

EDIT: Here's a python package that does image deduplication. Not ImageMagick and no idea on its performance though.

https://idealo.github.io/imagededup/


Looks interesting


If you are open to alternatives, there are tools like (macOS) https://macpaw.com/gemini which scans for duplicates and can find similar pictures. I would assume cropped photos are found unless they are VERY cropped.


From the people who did cleanmymac. I bought that, I may buy this one




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: