Building a recommendation engine, foursquare style

physcab · on March 22, 2011

By setting some constraints on which scores were significant, it was possible to build the resulting similarity matrix in less than an hour on a 40-machine cluster

40 machines!?! Wow. We compute our recommendations at Grooveshark in a couple hours on a crappy 2-node Hadoop cluster. hah. We have about as many songs as they have venues, so I have no idea why they need so many nodes. Or maybe we're just extraordinarily harsh in our constraints. I loved this writeup though. Its great to see how other companies tackle these difficult problems.

We also have that problem of a "cold-start" and another one which we call the "coldplay" problem. For them it would be something like McDonalds I guess.

izendejas · on March 23, 2011

For the "coldplay" problem you may want to use an inverse band frequency weighting (much like in IDF for text). Bands that are extremely popular won't give you any useful signals. As far as the cold-start... yep, that's a beast and using social data may help. For example, are you using stuff from Facebook/Twitter once you have users sign in? You could crawl out band/artists name from their data.

This gives me an excuse to try grooveshark. I've been curious for a while. :)

update: I didn't see any recommendations based on what I've like on Facebook or Twitter. This would be a good "warm" start. Note: I'm working on stuff like this, but don't mind sharing this simple idea. :) Quora does a great job of this pulling in data from your Twitter network.

adpowers · on March 23, 2011

I'm curious: do you see any speedup from that Hadoop cluster? I always assume that on less than like 5 nodes it is faster to just run it all on a single machine unless you are highly disk bound.

Was it written that way in order to support future growth?

justinmoore · on March 23, 2011

Delis are our Coldplay.

baberuth · on March 22, 2011

Very cool.

10M venues is a lot, and a really hard to believe number.

For example, Yelp lists ~13k restaurants in NYC (3.5k in Chicago, 4.5k in SF)

In the top 10 categories for venues, Yelp NYC has 40k VENUES (restaurants, shopping, food, health, spas, nightlife, and some more).

Either 4square has 250 cities like NYC, a fundamentally different definition of what a venue is, or Yelp is SERIOUSLY missing a lot of places (like on the order of getting only 10% of the venues in each city).

People have written about the data sparsity problem in the Yelp dataset as compared to Netflix (http://www.stanford.edu/class/cs229/proj2009/Fennell.pdf) for using CF techniques, I'm very interested to hear what people will think about the 4square implementation.

I'm skeptical, but I really hope it works, because I have very average food far too often on recommendations from friends with dissimilar palates...

billpaetzke · on March 22, 2011

Foursquare has way more places in their database, based on my usage of both Yelp's and Foursquare's iPhone apps.

Yelp is based on reviews of businesses. Whereas Foursquare is based checking into places. Therefore, Foursquare's domain is much larger.

I'm not sure if Foursquare or Yelp add places to their database or if it's 100% user-driven. In Yelp's case, it takes more effort to add a place becuase you need to (or at least you feel like you should) also write a review. Whereas for Foursquare, you can just add your home. Or "RainApocolypse 2011," of which I am two days away from becoming the mayor.

baberuth · on March 23, 2011

Reduced friction for adding places is a plausible reason for why 4square might have more venues, but what I was getting at is the order of magnitude of the difference.

10 MILLION places is a ton. Like I said above, thats like 250 NYC's. Even given 4square's reduced friction to add venues, the fact that they've probably launched in more cities, and inclusion of arbitrary venues ("Hey I checked in at the tree that's 10 paces west of my house!"),

I don't think 4square is lying. Now I'm just wondering how useful those venues are. (I am also pretending that Yelp has better than 10% coverage of venues that users care about). Incidentally, either way, you've got to think that 10mm venues has to wreck the data sparsity problem as well.

For reference, the venue count of 3 large american cities:

- NYC 47228 - SF 38656 - Chi 19079

I'd be surprised if Yelp had even 1.5mm venues. Would love if someone could corroborate/dispute this fact. While there is certainly a disparity between the two and Yelp, I'd imagine that the "useful" venues (those that people want recommendations for) aren't your home or rainapololypse2011. Adding those into the dataset actually makes 4squares job _harder_ with respect to teasing meaningful data out of the dataset.

If my phone ever recommends your house though, I just may come a knockin'.

SpikeGronim · on March 23, 2011

Yelp adds places to their database, and they have had mixed success with Mechanical Turk.

http://engineeringblog.yelp.com/2011/02/towards-building-a-h...

jkava · on March 23, 2011

Regarding the issue of user-based feedback ranking (the Per Se vs. Shake Shack problem), it may not be such a negative thing to have results skewed due to "unequal" ratings. Culinary ratings often are a result of many factors that civilian diners may not consider necessarily important or even relevant. Looking to Foursquare to show search results based on user approval, which often is submitted on a knee-jerk reaction after dining, may be the best thing for a prospective diner. After all, Per Se and Shake Shack may have 5 stars rewarded to them by the same diner, but unless this diner is making hundreds of thousands a year (or is Thomas Keller), they would likely recommend Shake Shack as the spot to eat to their Foursquare friends. To me, hedging this data will end up producing results along the lines of more traditional culinary recommendation systems, and may be devaluing the Foursquare recommendation engine.

sadiq · on March 22, 2011

Anyone else used Mahout? What are your experiences with it?

physcab · on March 22, 2011

We tried using it a while ago when it was first getting started. It was a pain to use back then but afaik it has advanced quite a bit and after seeing this post I'll be taking another jab at it. If you want an out-of-the-box recommender, Mahout is good while providing many other Machine Learning algorithms that can be run at scale.

You definitely need to have some technical knowledge though. One thing that I don't like about things like this is that if there is a bug and you don't understand the software (or the math) you won't be able to troubleshoot.

This is why I tend to write my own for the specific task at hand.

justinmoore · on March 23, 2011

Agreed, it has a very steep learning curve. Once you understand it (and understand how to think in map-reduce), it's pretty awesome -- and easy to extend. Rolling your own definitely makes the learning curve easier, but you can miss out on some of the M-R efficient algorithms that are built in it.

drjoem · on March 23, 2011

I am currently using Mahout at work for k-means clustering and singular value decomposition. It seems to be working well.

ronnier · on March 23, 2011

My team at Amazon is using it. There's also been some talks given on it at Amazon.