Hacker News new | past | comments | ask | show | jobs | submit login
Techniques for matching similar/identical products from different stores?
1 point by victorhooi on Nov 4, 2013 | hide | past | favorite
I’m trying to write a simple program to compare prices for products from different suppliers.

Different suppliers may call the same product different things.

For example, the following three strings refer to the same product:

A2 Full Cream Milk Bottle 2l A2 Milk Full Cream 2L A2 Full Cream Milk 2L

Or the following two strings are the same product:

Ambi Pur Air Freshener Car Voyage 8mL. Fresh Vanilla Flower fragrance. - 1 each Ambi Pur Air Freshener Voyage Primary 8ml

Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)

What are currently recommended techniques for matching product strings like this?

From my Googling, I found:

1. Some people recommend using Bayesian filtering techniques. 2. Some recommend doing feature extraction on all the products strings. So you might extract things like brands (e.g. “A2”), Product (“Milk”) and capacity (“2L”) from the products, then create distance vectors between products, and use something like a binary classifier to match products (SVM was mentioned). However, I’m not sure of how to achieve this without a whole bunch of rules or regex? I’m assuming there’s probably smarter unsupervised learning methods of attacking this problem? Price could probably be another “feature” we could use to calculate the distance vector as well. 3. Others recommended using string similarity algorithms, such as Levenshtein distance, or the Jaro-Winkler distance.

Would you use one of the above techniques, or would you use a different technique?

Also, does anybody know of any example code, or even libraries for this sort of problem? I could

(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: