Stefan Eng - Likelihood ratio test for 2x2 independence

Likelihood Ratio Test for independence in 2x2 contingency table

	Success	Failure
Group 1	a	b	\(n_1\)
Group 2	c	d	\(n_2\)
Total	\(m_1\)	\(m_2\)	N

Assume that \(n_1\) and \(n_2\) are fixed and we sample from binomial distributions with success probability \(p_1\) and \(p_2\). Then, we can find the maximum likelihood estimate for \(p_1\) and \(p_2\) as \(\hat{p}_1 = a / n_1\) and \(\hat{p}_2 = c / n_2\).

Since \(p_1\) and \(p_2\) are independent, then the joint likelihood is

\[ \begin{aligned} \mathcal L(p_1, p_2 | a,b,c,d) &= {n_1 \choose a} {n_2 \choose b} p_1^a (1 - p_1)^b p_2^c (1 - p_c)^d\\ &\propto_{p_1, p_2} p_1^a (1 - p_1)^b p_2^c (1 - p_c)^d \end{aligned} \]

Then the log-likelihood \(\ell\) is

\[ \begin{aligned} \ell(p_1, p_2 | a,b,c,d) &= \log \mathcal L(p_1, p_2 | a,b,c,d)\\ &\underset{p_1, p_2}{\propto} a \log p_1 + b \ln (1 - p_1) + c \ln p_2 + d \ln (1 - p_2) \end{aligned} \]

Evaluating the log-likelihood at the maximum likelihood estimates \(\hat{p}_1\) and \(\hat{p}_2\):

\[ \ell(\hat{p}_1, \hat{p}_2 | a,b,c,d) \underset{p_1, p_2}{\propto} a \log \frac{a}{n_1} + b \ln \frac{b}{n_1} + c \ln \frac{c}{n_2} + d \ln \frac{d}{n_2} \]

Under the null hypothesis that \(p_1 = p_2\) we would expect to see the following table

	Success	Failure
Group 1	\(a_{E} = \frac{n_1 \cdot m_1}{N}\)	\(b_{E} = \frac{n_1 \cdot m_2}{N}\)	\(n_1\)
Group 2	\(c_{E} = \frac{n_2 \cdot m_1}{N}\)	\(d_{E} = \frac{n_2 \cdot m_2}{N}\)	\(n_2\)
Total	\(m_1\)	\(m_2\)	N

Expressing this in terms of success probabilities, this would mean that \(p_0 = m_1 / N\). Under this \(p_0\) then then likelihood is

\[ \begin{aligned} \ell(p_0,p_0 | a,b,c,d) &\underset{p_1, p_2}{\propto} a \log \frac{m_1}{N} + b \ln \frac{m_2}{N} + c \ln \frac{m_1}{N} + d \ln \frac{m_2}{N} \end{aligned} \]

Now we can solve for the likelihood ratio test statistic

\[ \begin{aligned} \ell(\hat{p}_1, \hat{p}_2 | a,b,c,d) - \ell(p_0 | a,b,c,d) &= a \log \frac{a}{n_1} + b \ln \frac{b}{n_1} + c \ln \frac{c}{n_2} + d \ln \frac{d}{n_2}\\ &\quad - \left( a \log \frac{m_1}{N} + b \ln \frac{m_2}{N} + c \ln \frac{m_1}{N} + d \ln \frac{m_2}{N} \right)\\ &= a \ln \left( a \frac{N}{n_1 m_1} \right) + b \ln \left( b \frac{N}{n_1 m_2}\right) + c \ln \left( c \frac{N}{n_2 m_1}\right) + d \ln \left( d \frac{N}{n_2 m_2}\right)\\ &= a \ln \left( \frac{a}{a_E} \right) + b \ln \left( \frac{b}{b_E}\right) + c \ln \left( \frac{c}{c_E}\right) + d \ln \left( \frac{d}{d_E}\right) \end{aligned} \]

Multiplying this expression by 2 makes this expression follow a Chi-squared distribution in the limit. The same reasoning applies to larger contingency tables.

How does this compare to the Chi-squared test? We can specifically look at how these functions differ first: \[ \begin{aligned} f_\chi(O,E) &= \frac{(O - E)^2}{E}\\ f_{LRT}(O,E) &= O \ln \left( \frac{O}{E}\right) \end{aligned} \]

Both of these tend towards a \(\chi^2\) distribution with 1 degree of freedom in the 2x2 case and \((J - 1)(K - 1)\) degrees of freedom in the J x K contingency table case.