How the new recommendation system works

13 replies [Last post]
benanne
benanne's picture
User offline. Last seen 14 min 59 sec ago. Offline
Joined: 22/08/2009
Posts:

As we've announced today, we have introduced a band recommendation system and used the recommendation algorithm to update the 'map of djent'. If you're curious about how this works, then read on!

Similarity

To build a recommendation system, we need to be able to compare bands. We operate on the assumption that similar bands are typically good recommendations. There are many ways to determine how similar two bands are: we could look at their respective genres, the tags they receive on sites like last.fm and musicbrainz, or we could even analyse their music!

At got-djent.com, the only information we have available is which bands each user is a fan of. It turns out that this information can also be used for this purpose. This approach is called "item-based collaborative filtering". It was popularised by Amazon, which also uses it to compute recommendations (although their approach is of course much more sophisticated).

Let's say the user is viewing a band page, for example Uneven Structure's page. Let's call this band V (from 'viewed'). Similarly, we'll use R to indicate the bands considered for recommendation (from 'recommended'). For example, let's evaluate whether Meshuggah, worC and Veritas are good recommendations.

How can we determine which bands to recommend? For starters, we will compute a similarity metric between Uneven Structure (V) and each other band (R) in the database. This is a number, typically between 0% and 100%, that indicates how similar both bands are. We will then rank all the other bands according to this metric, and present the top scorers from this ranking as recommendations.

This raises the question: how do we compute the similarity metric? To do this, we will count the number of fans that each band has (denoted by #V and #R), and the number of people that are fans of both the viewed band (V) and each other band considered for recommendation (R). We will call these numbers #(R∩V); '∩' is the 'set intersection' symbol, which indicates that these people are in the intersection of the fanbases of both bands. We can visualise this using a Venn diagram:

The grey set represents the fanbase of band V. The red set represents fanbase of band R. Their intersection contains the people that are a fan of both bands.

Popularity bias

A naive way to recommend bands would be to look at the percentage of fans of V that are also a fan of R. It's not unreasonable to assume that, if a large portion of V's fans likes R, maybe you might enjoy R as well. We'll call this percentage R|V and we can compute it as: #(R∩V) / #V.

Unfortunately, this approach is inherently flawed. Let's look at what happens when we consider Meshuggah for recommendation. Meshuggah has a lot more fans than Uneven Structure:

As you can see, this means R is much bigger than V. But also, the percentage R|V will be a lot bigger, because it's simply more likely that a user is a fan of Meshuggah, regardless of whether they are a fan of Uneven Structure or not.

If we consider worC for recommendation, on the other hand, we see the opposite effect: since a user is inherently less likely to be a fan of worC, the overlap between fanbases will also become smaller, so worC will never be recommended.

This problem is called popularity bias. A recommender system based on this percentage will always recommend Meshuggah, Periphery and other big bands that everybody is already familiar with. Useless!

Turning things around

How can we solve this problem? We could look at it the other way around: instead of looking at the percentage of Uneven Structure fans that also like Meshuggah, we could look at the percentage of Meshuggah fans that also like Uneven Structure! This is not the same thing. We will call this percentage V|R and we can compute it as: #(R∩V) / #R.

Using this metric, we will no longer recommend Meshuggah for Uneven Structure fans, seeing as the percentage of Meshuggah fans that also like Uneven Structure is not notably high. We will recommend worC however: practically all worC fans are also Uneven Structure fans, so V|R will be very high in this case.

Problem solved?

Great! No more popularity bias! Unfortunately, this approach creates a new problem. Let's consider another, smaller band: Veritas has only a single fan on got-djent.com, at the time of writing. We now have the opposite situation:

Is Veritas a good recommendation for Uneven Structure fans? Let's look at V|R. If this single Veritas fan also happens to be a fan of Uneven Structure, then 100% of Veritas's fanbase is also a fan of Uneven Structure. In other words, Veritas is the perfect recommendation!

Except it isn't. By completely removing the popularity bias problem, we have created a situation where the most 'obscure' bands will be recommended, if their few fans happen to be a fan of band V as well.

We have implemented two solutions to this problem. One is to simply disregard bands with less than 20 fans. If there are not enough fans, it is impossible to compute recommendations. Too bad, but that's the way it is.

Best of both worlds

However, this does not solve the problem completely. It turns out to be a good idea not to eliminate popularity bias entirely. Instead of looking only at the percentage V|R, we will use a combination of both R|V and V|R. We can combine them by taking the so-called harmonic mean.

The harmonic mean has an interesting property that makes it useful for this application: the harmonic mean of two numbers is only large if both numbers are large. If either one is small, the harmonic mean is also small. This means that we will require both R|V and V|R to be large for a good recommendation.

To control the amount of popularity bias we retain, we can use the weighted harmonic mean. We decided to put three times more emphasis on V|R than on R|V, which makes for interesting recommendations that aren't too obscure either.

For those that are familiar with classification problems, there is an interesting parallel to be drawn. We could consider this a classification problem, where we try to predict whether the user will like the recommended band R, by using whether or not they like band V as an indicator. In this case, R|V and V|R correspond with the precision and the recall of the indicator, respectively. The weighted harmonic mean of both is the F-measure with beta = 1/3.

Summary

To summarise, got-djent.com's new band recommendation system works by computing the percentages of fans of each band that are also fan of another given band, and combining those into a similarity score, in such a way that emphasis is put on discovering new bands that the user may not have heard yet. The recommendations are then made by selecting the bands with the highest similarity scores.

If you are interested in reading more about this subject, some keywords to google are: collaborative filtering, one-class collaborative filtering, item-based collaborative filtering, recommender systems, recommendation engines, popularity bias.

blckravn01
blckravn01's picture
User offline. Last seen 1 day 16 hours ago. Offline
Joined: 21/12/2009
Posts:

I don't understand it but I approve.

rao
rao's picture
User offline. Last seen 7 weeks 4 days ago. Offline
Joined: 16/12/2010
Posts:

Absolutely love it!

StervendeWens
StervendeWens's picture
User offline. Last seen 17 hours 51 min ago. Offline
Joined: 05/09/2009
Posts:

nerdy! love it :3

mysteriousdsf
mysteriousdsf's picture
User offline. Last seen 2 weeks 17 hours ago. Offline
Joined: 01/03/2011
Posts:

very nice concepton to do that. I would really like to express my gratitude for how much you devote for the djent scene. btw never heard of "Veritas", but now I`m gonna check em out ... and become their fan just to get the post outdated, lol ... `

Meowzer
Meowzer's picture
User offline. Last seen 12 hours 29 min ago. Offline
Joined: 13/02/2011
Posts:

For once statistics comes in handy! I'm starting to actually not regret the tuition I paid for that probability theory class...

benanne
benanne's picture
User offline. Last seen 15 min ago. Offline
Joined: 22/08/2009
Posts:

Haha, indeed! While I was working out the theory I approached it probabilistically, hence the notations V|R and R|V, which actually represent conditional probabilities Smile I'm calling the metric weighted harmonic mean of conditionals for that reason. And also because it sounds cool.

r4gw33d
r4gw33d's picture
User offline. Last seen 15 hours 8 min ago. Offline
Joined: 29/08/2010
Posts:

tl;dr Laughing out loud

Opusvoid
Opusvoid's picture
User offline. Last seen 1 week 4 days ago. Offline
Joined: 01/12/2010
Posts:

It wasn't the best way to do it, but for the amount of work and how quickly it was in going to take it was the best course of action. I have a pretty extensive recommendation list and I could never do something like this because it leads down to many paths of choice. Well done.

HOKENSTYFE
HOKENSTYFE's picture
User offline. Last seen 46 min 46 sec ago. Offline
Joined: 01/09/2011
Posts:

It's the maybe computation that get's problematic. It's fan generated, by attention, through what? The components that generate likes, could still limit unheard of bands, even more, by over-whelmingly heavily liked bands with just as heavily liked bands. Though you did state the lower liked band would not be sided to the heavier liked band, the heavier band is still preferenced. Regionalize-site/specific/genre-name/change, additions to further pin-point cohesion.

I did that on purpose! Just to show how it could be picked apart! A novel idea to get more, lesser known bands attention. Thall of a job!

benanne
benanne's picture
User offline. Last seen 15 min ago. Offline
Joined: 22/08/2009
Posts:

I didn't get any of that, sorry. I always have that with your posts Tongue