## Recommender SystemThis tutorial shows how to build a
recommender system using Infer.NET. In particular, it implements the design
developed by David Stern, Ralf Herbrich, and Thore Graepel described in their
2009 paper You can run the code in this tutorial either using the ## The Stern-Herbrich-Graepel modelThe problem being solved in this tutorial is the prediction of ratings that
users are likely to assign to items. In this context Because of the possibly large number of users and items, as well as the high
sparsity of the ratings, we'd want to use an algorithm that scales as
The resulting affinity is a floating point number which needs to be
converted into a rating value. This is achieved by using thresholds. The
number of threshold levels ( Ratings are represented by a
bool vector, the dimensionality of which is ## Model implementationWe'll implement the model in the following steps: We start by fixing the sizes of the arrays used in the model and defining the
ranges over them. We will consider 50 users, 10 items and a 3-star rating (i.e.
2 levels). We'll only use 2-dimensional trait vectors, but the reader should be
aware that a larger number is usually required in practise. The number of traits
should be chosen empirically based on the training data and the available
computational resources. As to the size of the training data, we'll generate 100
We then define the latent arrays in the model. Their elements are all of type
int. As to the user and item traits, they are variable jagged arrays with The user and item biases are simple 1-D arrays (one bias value for each user
and item), and the user thresholds are represented by a
The definition of the priors is analogous to the one of the latent variables, the only difference being the array type - this time it's Gaussian rather than double. That's because we draw these floating-point parameters from a normal prior.
Now we need to connect the latent variables to their priors. As explained in
the
Infer.NET 101 paper, this is referred to as the
The initial values of the priors need to be manually set. As to the trait
prior, it can be a standard
Gaussian(0, 1). In this small
example we use the same distribution for the bias prior, but in practise a
broader variance might be considered - like
Gaussian (0, 10). We set the
initial interval size between two user thresholds to 1, and we keep the
thresholds zero-centred in the beginning. Thus, if we have 4 threshold levels,
they will be -1.5, -0.5, 0.5, and 1.5. For easy initialisation of the priors, we
use the static ArrayInit method of the
Infer.NET Util class.
It has two parameters - the length of the array to be created and a .NET
We should now choose how to represent the training data. As previously
discussed, a rating matrix with
Now that we have the latent variables, their priors, and the training data, it's time to build the model. We do that by iterating over each observation. Intermediate variables are defined locally for simplicity. That is, rather than maintaining an array of affinity values, we define a new affinity variable for each observation. For each observation the current user id is
userData[observation], the item id is
itemData[observation], and the rating vector is
ratingData[observation][]. Given this model data, we start by defining
the product vector of the user and item traits. This is achieved by iterating
over the trait range and computing the corresponding products. The product
operator used here ('*') should be the one from Variational Message Passing. And
since we use Expectation Propagation, special care needs to be taken when
setting the compiler options for this operator (explained later in The noisy affinity should now be compared against the different levels of user thresholds. But since these thresholds might vary with time, some noise should be added here as well. Therefore, we create a new variable array - noisyThresholds, which is formed from the current user thresholds by adding some noise. We use a variance of 0.1 here again, but the developer should tune this parameter in accordance with their data. Now each of the rating levels (ratingData[observation][level]) is formed by comparing the noisy affinity to the corresponding level of the noisy thresholds.
## Running inference and making predictionsNow that the model is in place, we can create an inference engine and make some predictions. The inference algorithm used is Expectation Propagation, but we need to call the implementation of the vector product operator from Variational Message Passing. This is achieved by setting the compiler option GivePriorityTo(typeof( GaussianProductOp_SHG09)).
We now use the inference engine to obtain the posteriors. These are then fed
into the model as the new priors by using the
ObservedValue property. Note that because we infer multiple
distributions, temporary variables should be used here, rather than assigning
the priors immediately and performing inference again. I.e. the following is
Here is the correct way to implement it:
With the inferred priors in place, we can use the same model to make predictions. There are only a few minor modifications to be made. First, we need to reset the number of observations. In our example this will be the number of ratings to be inferred. We set it to 1 for the sake of simplicity. Second, we should reset the training data. As to the user and item data, we provide the arrays with the corresponding users and items to be matched. As to the rating data, it needs to be cleared (because it will all be inferred). This is achieved by using the ClearObservedValue() method. Finally, we can infer the ratings arrays (in our case only one). We call the Infer() method with a Bernoulli[][] generic parameter, and the first element of the resulting jagged array is the predicted rating (represented as Bernoulli[]). In our example it contains the probabilities 0.6869 and 0.1147. This means that user 5 will most likely rate item 6 with 2 out of 3 stars.
## Resolving parameter ambiguityThe model explained in this example is symmetric. One way of coping with such
problems was discussed in the Mixture
of Gaussians tutorial and explained in the
User Guide. It involves overriding the initialisation of the algorithm by
specifying an initial marginal for some variables. We'll undertake another
method here, which is to manually set some priors to a fixed value. We choose
this approach because later we'll generate data from the model and we'll want to
be able to compare the true parameters to the learned ones (explained later in
the The question now is how many traits we need to manually set. That is, what is
the number of degrees of freedom in the model? Just to introduce the problem,
consider a one-dimensional trait space with one user and one item. Let the
learned means of these two traits be . We always draw the affinity from a function which builds on the
product of
y and x, but y. So, the model can infer equally well
xy = (-x)(-y) and
(x, y). This can be solved
by fixing the sign of either (-x, -y) or x. yLet's now look at the broader picture. Consider a particular user vector . The product of the
two is then v. This is essentially equal
to u^{T}v, where (Au)^{T}(A^{-T}v) is some invertible
AnumTraits by numTraits
matrix. This means that if we take the whole numUsers by numTraits
user trait matrix and multiply it by U,
the result A will be a valid U' = UAnumUsers
by numTraits user trait matrix. Then, we need to multiply
by the item trait matrix and the resulting products in the model will be the
same. Also note that we can find such A^{-T} that the upper
square part of A is the identity matrix. U'
This shows that in the traits there are
## Generating data from the modelIn order to make sure that the model is working properly, we generate data from it and compare the true parameters to the learned ones. A simple model data generation can be found in the code of the Mixture of Gaussians example. The model parameters there are the means and precisions of the distributions to be learned, as well as the vector of mixture weights. So these are the ones that are explicitly set in the code. In the current example the set of true parameters is larger - it comprises the user and item traits, the user and item biases, and the user thresholds. We undertake a typical approach here, which is to sample these parameters from their priors.
Now when we have the true parameters, we can sample data from the model. We do that by repeating the whole model definition step by step and obtaining the final values. Since the parameters are fixed, we no longer work with probabilities. Therefore, the Infer.NET modelling API is not used here (i.e. there is no mention of the Variable class). The only stochastic parameter in the data generation is the noise, which we sample from fixed Gaussians. The code in the data generation is semantically very similar to the one in
the model definition. The only difference that might seem odd is the rejection
of user-item pairs. Because here we don't iterate over all possible user-item
combinations, we randomly generate only some number of them (
Since we generate the data from the model, we can compare the true parameters
to the learned ones. For simplicity, we'll only look at the first 5 item traits.
If we print these out right after data generation (the
This is obviously not a very good estimation. If we also printed out the variances here, we would notice that they are very broad. That is because there were too few observations in order to properly learn the parameters. If we increase the number of users to 200, the number of items to 200, and the number of observations to 20,000, then the results look much better:
Please see the Matchbox recommender Learner for a comprehensive implementation of a recommendation system. |