Introduction to Item Response Theory

IRT is an adaptive testing framework that can give graded scores on tests and reduce the time it takes to administer tests.

The Framework

At the base IRT defines a model that associates some latent ability of the student to probability of answering correct on problems. p(θ,bi)p(\theta,b_i) where pp is answering correctly on problem ii given the student's ability θ\theta and the intrinsic parameters of the problem collected in the vector bib_i.

We often want to study the problem intrinsic parameters (bib_i) as a function of latent ability (θ\theta). We call this the characteristic equation of a problem which is just a (θ,p(θ,bi))(\theta, p(\theta,b_i)) plot.

In standard models, θ\theta is unbounded that is θ],[\theta \in ]-\infty,\infty[, and a sigmoid function (σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}) is usually used as a base for the characteristic equation. This is convenient since sigmoid is defined x],[\forall x \in ]-\infty,\infty[, and has a range of ]0,1[]0,1[ which makes it a natural probability density function.

Even though θ\theta is unbounded, in practice, θ\theta is typically in the range of [3,3][-3,3].

Assumptions

There are some formal assumptions for how p(θ,bi)p(\theta,b_i) should behave. They're mostly common sense, but it's worth noting them regardless.

  • Monotonicity: The probability of answering correctly, p(θ,bi)p(\theta,b_i), should monotonically increase as θ\theta increases.
  • Unidimensionality: Basic models assume that there is only one latent trait that encodes student ability, θ\theta, but this assumption can be relaxed in more advanced models.
  • Local Independence: Responses to different problems are independent from each other.
  • Invariance: The intrinsic parameters of the problems are stable across different students.

Models

1 Parameter (Rasch model)

The simplest model uses only one intrinsic parameter: the difficulty of the problem. We write the

p(θ,di)=11+e(θdi)p(\theta,d_i)= \frac{1}{1+e^{-(\theta-d_i)}}

where did_i is the difficulty parameter. It is analogous to θ\theta in its range. The neutral value is 00.

2 Parameter model

We can extend this model by introducing a discrimination parameter that varies the slope of the characteristic equation.

p(θ,di,ai)=11+eai(θdi)p(\theta,d_i, a_i)= \frac{1}{1+ e^{- a_i (\theta-d_i)}}

where aia_i is the discrimination parameter.

aia_i should be in the range [0,[[0,\infty[ where if 00 means that skill has no effect. Everyone has the same probability of answering correctly given by 12\frac{1}{2}, and in the 4 parameter model the average of b,cb,c. If aia_i approaches \infty then the discrimination becomes a perfect step function. The neutral value is 11.

3 Parameter model

The 3 parameter model takes account for the probability of guessing correctly by raising the lower asymptote

p(θ,di,ai,ci)=ci+(1ci)11+eai(θdi)p(\theta,d_i, a_i, c_i)= c_i+(1-c_i)\frac{1}{1+e^{- a_i (\theta-d_i)}}

where cic_i is the guessing probability parameter.

cic_i should be in the range [0,1][0,1] where 00 means that it's impossible to guess the solution, and 11 means that it is impossible to fail the question. The neutral value is 00.

4 Parameter model

The 4 parameter model is used less often, but is still worthwhile to discuss It introduces a slip factor: The probability of someone who knows the concept makes a mistake.

p(θ,di,ai,ci,bi)=ci+(bici)11+eai(θdi)p(\theta,d_i, a_i, c_i, b_i)= c_i+(b_i-c_i)\frac{1}{1+e^{- a_i (\theta-d_i)}}

where bib_i is the upper bound probability parameter.

bib_i should be bigger than cic_i and is in the range [0,1][0,1]. It encodes the probability of someone who knows the answer actually answering correctly. The neutral value is 11.

Building intuition

Difficulty (0):

Discrimination (1):

Guess factor (0):

Slip factor (1):

Fitting the model

We fit the models using maximum likelihood estimation. This is the same whether or not you know the intrinsic problem parameters. However, if you already know the problem difficulties, then you can choose to iteratively estimate student abilities using only a subset of the problem dataset.

By choosing the problems based on which one gives the maximum information: The maximum of I=p(θ,bi)(1p(θ,bi))I= p(\theta,b_i)(1-p(\theta,b_i)), you can quickly estimate the student's ability to a high degree of confidence. The highest information problem is the problem where you guess the student has the closest to 50%50\% chance of answering correctly.

continue reading

How to intuitively understand neural networks
A framework for thinking about neural networks, and supervised machine learning.
A framework for thinking about risk

How long do you want to live? 80 years? 120 years? 200 years?  1000 year...