bradfordcross: good overview, with one minor point. I wouldn't think of Pandora's problem as one mitigated with semi-supervised learning. That's usually applied to a situation where you have a small number of labeled points and a whole mass of unlabeled data; often the task is then to determine low density regions to define boundaries of natural clusters.
In Pandora's case, they have TONS of labeled data. All you'd need to do would be to run a decision tree (or a categorical-variable version of PCA) to (1) determine that many of those features are strongly statistically dependent and (2) reduce the number that need to be populated for any given song.
You could probably also do supervised learning on their massive sound database to infer lots of these features automatically (i.e. i bet you can pick out male vs. female vocalists without having someone listen to it).
Combining these (supervised learning on historical data + decision tree on historical data) would likely vastly increase their per-song labeling throughput. Only "global" features like song genre would have to be input by humans.
This is the same point I am making. If they are still manually curating each song at 30 mins each, they could just stop, use the labels they already have, and infer the rest through semi-supervised learning, or learning the target labels based on the destructured tracks.
It is a good approach and one I share. NLP has done it for years, once you have a corpus which is tagged is so much more easier to then work on your data. I also like your idea of using the Mechanical Turk to gain traction on the manual tagging, in any way that is probably what would super intelligent computers might do in the 40-year span - use humans to tag - before they super-massively carry out with the balance of the calculations! :)
One area which the article did not touch is how to introduce controls to identify 'rigging' of the system, ie, similarly to controlling link farms at Google. This is where the problem in my opinion is turned the other way out.
In Pandora's case, they have TONS of labeled data. All you'd need to do would be to run a decision tree (or a categorical-variable version of PCA) to (1) determine that many of those features are strongly statistically dependent and (2) reduce the number that need to be populated for any given song.
You could probably also do supervised learning on their massive sound database to infer lots of these features automatically (i.e. i bet you can pick out male vs. female vocalists without having someone listen to it).
Combining these (supervised learning on historical data + decision tree on historical data) would likely vastly increase their per-song labeling throughput. Only "global" features like song genre would have to be input by humans.