Infer.NET user guide : Tutorials and examples
Difficulty versus ability
This example is a model of how people answer questions on a multiple choice test. It explicitly models the trade-off between a person's ability and the difficulty of the question. The model also allows you to estimate the correct answer to each question, which is useful for crowdsourcing and generalizes the approach of majority voting. This model was used in the paper "How To Grade a Test Without Knowing the Answers --- A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing" by Bachrach et al (ICML 2012), where it was called the DARE model. You can run this example in the Examples Browser.
In this model, there are multiple subjects who answer multiple questions, each having multiple choices. The data is simply an integer for each subject and question, describing the answer that was chosen. The following variables set this up:
To explain the data, we introduce four different latent variables. For each subject, we hypothesize a real-valued ability variable, where high values increase the subject's probability of answering a question correctly. You can think of this as the subject's level of expertise or concentration on the test. These are assumed to be normally distributed:
For each question, we hypothesize a real-valued difficulty variable, where high values decrease a subject's probability of answering the question correctly. These are also assumed to be normally distributed:
Besides difficulty, a question may have high or low discrimination between people of different abilities. For example, a question that is badly worded may be misinterpreted by a fraction of the subjects, leading to noisy answers regardless of the subject's ability. This is captured by a real-valued discrimination variable, where high values increase the effect of a subject's ability. Discrimination is always non-negative. Zero discrimination means that a subject's ability has no effect on whether they will answer the question correctly.
Finally, each question has an integer-valued trueAnswer. This may be known, as in a classroom scenario, or it may be unknown, as in a crowdsourcing scenario. The model can handle both cases.
The generative model now works as follows. For each subject and question, the difference of ability and difficulty is the subject's advantage in answering the question correctly. To this advantage we add noise scaled by the discriminatory power of the question. If this noisy advantage is greater than zero, then the subject answers the question correctly, otherwise they choose an answer at random.
To get robust inference in this model, some special settings are necessary, otherwise it tends to generate improper message exceptions. The issue is that the model has highly correlated variables, yet we are using a factorized distribution to approximate it (see the page on Expectation Propagation). This leads to slow and unstable convergence. To help convergence we instruct the scheduler to process subjects sequentially, so that all variables are updated after each subject, i.e. 40 times per iteration, rather than once per iteration. A nice benefit of these settings is that the inference converges rather quickly (less than 5 iterations).
To test the inference under this model, we generate a data set from known parameters and compare the learned parameters to the true ones. Notice that the Sample method has the same structure as the Infer.NET model. This happens because the Infer.NET model essentially is a sampler but expressed using the Infer.NET primitives instead of C#. The results are shown below. The estimated true answers and difficulty/ability parameters are pretty good. The discrimination parameters are not quite as good.
Note that if the ability parameters are all equal, then the estimate of the true answers will be identical to majority voting, since the most likely true answer will be the answer that most subjects chose. Thus to compare the results of this model to majority voting, just set the ability parameters to a constant. If you do this on this dataset, only 91% of the estimated trueAnswers are correct. Thus the ability parameters help to do better vote aggregation.
How to handle missing data
The provided code assumes that every subject has answered every question. If this is not the case, then some changes are necessary. One approach is to leave the response array unobserved and apply constraints to the individual elements that were observed. Another approach is use conditionals to skip over the missing elements, as explained in How to handle missing data. However both of these are inefficient. The most efficient approach is to restructure the data as a collection of (subject, question, response) observations. Instead of looping over all subjects and questions, you only loop over the provided observations. The model becomes:
For an example of this approach, see the forum.