Clever method of near duplicate detection

wastedbrains · on Aug 10, 2008

This is interesting. I used to to work with SVM document categorization, and this was a huge problem.

This was years ago, so forgive me for not ever having an elegant solution, but I never thought to use something similar to Markov's chains (http://en.wikipedia.org/wiki/Markov_chain) to detect duplicates or avoid layout data. I ended up doing custom layout detection and trying to ignore headers, navigational menus, footers. It was always a constant battle and very brittle and hackish. It worked OK for my purposes, but I still remember going through and checking a whole bunch of new entries every time the crawler pulled down new stories to see if I should add some obvious ignores to the filters.

jaydub · on Aug 10, 2008

One of my professors did some work in this area, as well http://www.cs.umd.edu/~pugh/google/

I think its pretty interesting to look at those Google results from 2000.

cameldrv · on Aug 10, 2008

Interesting but I'm not sure if it's any better than digram or trigram frequency. The real difficulty of course is the k-nearest-neighbors problem when you are trying to search for the duplicate.

gaika · on Aug 10, 2008

It is more efficient, as the distance between the duplicates in their case is a lot shorter (because effectively they only compare the important part of the page). That lets you use simple algorithms that are a lot faster. This is still not foolproof: some sites have an about header with lots of text or blogs that have a blurb before and after a quote.

stcredzero · on Aug 11, 2008

You can think of this as just changing the granularity of your tokens. The set of possible tokens becomes huge and the number of them per page decreases at the same time, so of course this is going to make looking for approximate matches much easier.