1 Introduction

This module focuses on the three most commonly used statistical tests in subsequent courses: the Wald, Score, and Likelihood Ratio tests. These three methods form the cornerstone of inference in parametric models, providing powerful and versatile approaches for testing hypotheses, constructing confidence intervals, and assessing model fit.

While rooted in the elegant framework of maximum likelihood estimation, each test offers a distinct geometric or algebraic perspective on the evidence against a null hypothesis. Understanding their theoretical foundations, comparative strengths, and intrinsic relationships is essential for mastering advanced statistical methodology and applying it correctly to complex data.

In this module, we will state formulation, assumptions, steps to perform these tests and explained by numerical examples with manual solution and verification using R.

2 Review of Likelihood Formulation and Estimation

In maximum likelihood estimation, we fit models by finding parameters that make the observed data most probable. This process naturally leads to three related methods for testing hypotheses such as the likelihood ratio, Wald, and score tests. Each test uses a different feature of the likelihood function, but all are rooted in the same principles of likelihood and asymptotic theory. This section reviews basics of maximum likelihood estimation and their asymptotic normality.

Likelihood Function

Let \(X_1, X_2, \dots, X_n\) be independent and identically distributed (i.i.d) random variables with probability density (or mass) function \(f(x_i; \boldsymbol{\theta})\), where \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_p)^T\) is a \(p\)-dimensional parameter vector. The likelihood function is:

\[ L(\boldsymbol{\theta}) = \prod_{i=1}^n f(x_i; \boldsymbol{\theta}) \]

The log-likelihood function is:

\[ \ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^n \log f(x_i; \boldsymbol{\theta}) \]

Maximum Likelihood Estimation

The maximum likelihood estimator (MLE) \(\hat{\boldsymbol{\theta}}\) satisfies:

\[ \hat{\boldsymbol{\theta}} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) \]

Under regularity conditions, \(\hat{\boldsymbol{\theta}}\) is obtained by solving the score equations:

\[ U(\boldsymbol{\theta}) = \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \boldsymbol{0} \]

Fisher Information Matrix

The Fisher information matrix of \(\theta\) based on an entire sample with size \(n\) is given by:

\[ I_n(\boldsymbol{\theta}) = E\left[-\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}\right] \]

The above (theoretical) Fisher information matrix requires integral that integrates variables out from the expression. In practice, we use the following observed information matrix:

\[ J_n(\boldsymbol{\theta}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} \]

Some software programs return Hessian matrix \(H_n( \boldsymbol{\theta}) = -J_n( \boldsymbol{\theta})\), that is

\[ H_n(\boldsymbol{\theta}) = \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} \]

Note that, in some textbooks, research articles and web postings, the Fisher Information Matrix based on single data point is denoted by \(I_0(\boldsymbol{\theta})\),

\[ I_n(\boldsymbol{\theta}) = n\times I_1(\boldsymbol{\theta}) \]

Under some regularity conditions and the null hypothesis \(H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0\), we have the following

\[ \sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) \xrightarrow{d} N(\boldsymbol{0}, I_1^{-1}(\boldsymbol{\theta}_0)). \]

The relationship between Fisher Information Matrix of \(\boldsymbol{\theta}\) and the variance of the MLE \(\hat{\boldsymbol{\theta}}\) is

\[ \text{var}(\hat{\boldsymbol{\theta}}) \approx I_n^{-1}(\boldsymbol{\theta}) = [n\times I_1(\boldsymbol{\theta})]^{-1} = \frac{I_1^{-1}(\boldsymbol{\theta})}{n} \] Therefore, we can rewrite the above asymptotic distribution of \(\boldsymbol{\theta}\) as

\[ \hat{\boldsymbol{\theta}}\to N\left( \boldsymbol{\theta}, \frac{I_1^{-1}(\boldsymbol{\theta})}{n} \right) \quad \text{or equivalently} \quad \hat{\boldsymbol{\theta}}\to N\left( \boldsymbol{\theta}, I_n^{-1}(\boldsymbol{\theta}) \right) \]

The diagonal elements in the covariance \(I_n^{-1}(\boldsymbol{\theta})\) represent the variance of the corresponding parameters in \(\hat{\boldsymbol{\theta}} = (\hat{\theta}_1, \hat{\theta}_2, \cdots, \hat{\theta_k})\). For example. $(_1) = $ the first element in the main diagonal of the inverse of the fisher information matrix based on the entire sample.

For example, let \(\boldsymbol{\theta} =(\theta_1, \theta_2)\), the corresponding MLE is \(\hat{\boldsymbol{\theta}} =(\hat{\theta_1}, \hat{\theta_2)}\). Let the observed Fisher information matrix be

\[ J_n(\boldsymbol{\theta}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}=-\begin{pmatrix} \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\ \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2} \end{pmatrix} \]

The estimated Fisher information matrix is given by

\[ \widehat{J_n(\boldsymbol{\theta})} = -\begin{pmatrix} \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\ \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2} \end{pmatrix}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}. \]

Note that R function optim() returns Hessian matrix \(\widehat{H_n(\boldsymbol{\theta})}=- \widehat{J_n(\boldsymbol{\theta})}\).

Let the inverse of \(widehat{J_n(\boldsymbol{\theta})}\) be of the following form

\[ \left[\widehat{J_n(\boldsymbol{\theta})}\right]^{-1} = -\left[\begin{pmatrix} \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\ \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2} \end{pmatrix}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}\right]^{-1} =\begin{pmatrix} a & c \\ c & d \end{pmatrix}. \]

Then we have

\[ \widehat{\text{var}(\hat{\boldsymbol{\theta})}} \approx \begin{pmatrix} a & c \\ c & d \end{pmatrix}. \]

That is,

\[ \text{var}(\hat{\theta}_1) \approx a, \quad \text{var}(\hat{\theta}_2) \approx d, \quad \text{and} \quad \text{cov}(\hat{\theta}_1, \hat{\theta}_2) \approx c. \]

The above asymptotic normality is the foundation of the Wald and Score tests.


3 The Three Asymptotic \(\chi^2\) Tests

At the foundation of parametric hypothesis testing lies a trio of asymptotically equivalent methods: the likelihood ratio, Wald, and score tests. All three derive from maximum likelihood theory and converge to the same chi-squared distribution under the null hypothesis, yet they approach the problem of testing from meaningfully different angles.

The likelihood ratio test directly compares the goodness-of-fit between the unrestricted model and the model constrained by the null hypothesis. The Wald test evaluates the distance between the estimated parameter vector and its hypothesized value, standardized by the estimated curvature of the likelihood. The score test, alternatively, examines whether the slope of the log-likelihood at the null parameter value is sufficiently close to zero—an indication that the null point is a plausible maximum.

Despite their asymptotic equivalence, these tests differ in computation, finite-sample performance, and invariance properties, leading to practical trade-offs in applied work. The following sub-sections develop their formulations, contrasts their theoretical and practical attributes, and provide guidance for choosing among them in statistical modeling.

3.1 Wald Test

Formulation Tests the distance between the unconstrained MLE and the hypothesized value. It measures geometric displacement from \(H_0\) in parameter space. To make it simple, let’s \(\boldsymbol{\theta} = (\theta_1, \theta_2)\) be the vector of the parameters of a distribution. Consider hypothesis

\[ H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0, \quad \text{i.e.}\quad \begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix} = \begin{pmatrix} \theta_{10} \\ \theta_{20} \end{pmatrix}. \]

The difference between the hypothesized value and the unrestricted MLE is given by

\[ D = \hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0 = \begin{pmatrix} \hat{\theta}_1 -\theta_{10} \\ \hat{\theta}_2 -\theta_{20} \end{pmatrix} \]

Test Statistic

With the above notations, the Wald test statistic is given by

\[ W = \hat{D}^{T} \, \left[ \widehat{\text{var}(\hat{\boldsymbol{\theta}})}\right]^{-1} \, \hat{D} = \hat{D}^{T} \,\text{I}_n(\hat{\boldsymbol{\theta}}) \, \hat{D} \]

More explicitly, we have

\[ W = \begin{pmatrix} \hat{\theta}_1 - \theta_{10} & \hat{\theta}_2 - \theta_{20} \end{pmatrix} \begin{pmatrix} \text{var}(\hat{\theta}_1) & \text{cov}(\hat{\theta}_1, \hat{\theta}_2) \\ \text{cov}(\hat{\theta}_1, \hat{\theta}_2) & \text{var}(\hat{\theta}_2) \end{pmatrix}^{-1} \begin{pmatrix} \hat{\theta}_1 - \theta_{10} \\ \hat{\theta}_2 - \theta_{20} \end{pmatrix}. \]

For testing single parameter case \(H_0: \boldsymbol{\theta}_1 = \boldsymbol{\theta}_{10}\)

\[ W = (\hat{\theta} - \theta_0)[\text{var}(\hat{\theta}_1)]^{-1}(\hat{\theta} - \theta_0) = \frac{(\hat{\theta} - \theta_0)^2}{\text{var}(\hat{\theta}_1)} \]

Asymptotic Distribution: \(W \xrightarrow{d} \chi^2_q\) under \(H_0\), where \(q\) is the number of parameters specified under the null hypothesis.

In the two-parameter and one-parameter cases described above, the degrees of freedom of the \(\chi^2\) test are \(q = 2\) and \(q=1\), respectively.

Assumptions: Wald test requires only the unconstrained MLE (as well as Information matrix to derive the covariance matrix of the MLE) and is sensitive to parameterization.

3.2 Score Test (Rao’s Score Test)

Rao’s score is also known as the Lagrange multiplier test.

Formulation: Tests whether the score at the null hypothesis is significantly different from zero. It measures local gradient at \(H_0\) in likelihood space.

Test Statistic:

\[ S = U(\tilde{\boldsymbol{\theta}})^T I_n^{-1}(\tilde{\boldsymbol{\theta}}) U(\tilde{\boldsymbol{\theta}}) \] where \(\tilde{\boldsymbol{\theta}}\) is the MLE under \(H_0\) (constrained MLE).

Asymptotic Distribution: \(S \xrightarrow{d} \chi^2_q\) under \(H_0\).

Assumptions: Requires only the constrained MLE. Invariant to reparameterization.

3.3 Likelihood Ratio Test

Formulation: Compares the maximized log-likelihood under \(H_0\) and \(H_1\). It measures information loss when imposing \(H_0\).

Test Statistic:

\[ \Lambda = -2[\ell(\tilde{\boldsymbol{\theta}}) - \ell(\hat{\boldsymbol{\theta}})] = 2[\ell(\hat{\boldsymbol{\theta}}) - \ell(\tilde{\boldsymbol{\theta}})] \]

Asymptotic Distribution: \(\Lambda \xrightarrow{d} \chi^2_q\) under \(H_0\).

Assumptions: Requires both constrained and unconstrained MLEs. Invariant to reparameterization.


4 Practical Examples: Single Parameter

To concretely demonstrate the similarities and differences among the three likelihood-based tests, we now turn to practical implementation in R. While these tests often yield similar conclusions with large samples, their distinctions become particularly instructive when examining simple yet fundamental distributions.

Through hands-on examples using Poisson and normal distributions, we will systematically apply the Likelihood Ratio, Wald, and Score tests to identical hypotheses. This comparative approach will illuminate how each test statistic is calculated, showcase scenarios where their results diverge meaningfully, and provide reusable code templates that reveal the computational machinery underlying common statistical procedures.

Unlike t-tests and z-tests which have closed-form formulas, likelihood-based chi-squared tests require explicit derivation of the likelihood function, parameter estimation under null/alternative, and careful construction of the test statistics. This is why they’re more flexible but require more preparation. The key preparation steps required

  • Derive log-likelihood function \(\ell(\theta; x)\)
  • Derive score function \(U(\theta) = \partial \ell/\partial \theta\)
  • Derive Fisher Information \(I_n(\theta) = -E[\partial^2 \ell/\partial \theta^2]\)
  • Find MLEs under both \(H_0\) and \(H_1\)
  • Compute standard errors from inverse Fisher Information
  • Construct test statistics using appropriate formulas

Note that likelihood-based inferences rely on large-sample asymptotic theory. For the purpose of clearly illustrating the procedure, the examples in the following subsections will use small toy datasets.

4.1 Example 1: Poisson Distribution

Problem: Test \(H_0: \lambda = 2\) vs \(H_1: \lambda \neq 2\) for \(X_1, \dots, X_n \sim \text{Poisson}(\lambda)\).

Observed data: 3, 1, 4, 2, 3.

Manual Solution:

  • Likelihood: \(L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}\)

  • Log-likelihood: \(\ell(\lambda) = -n\lambda + (\sum x_i)\log\lambda - \sum \log(x_i!)\)

  • MLE: \(\hat{\lambda} = \bar{x} = \frac{3+1+4+2+3}{5} = 2.6\)

  • Score function: \(U(\lambda) = \frac{d\ell}{d\lambda} = -n + \frac{\sum x_i}{\lambda}\)

  • Information: \(I_n(\lambda) = \frac{n}{\lambda}\) (expected), \(J_n(\lambda) = \frac{\sum x_i}{\lambda^2}\) (observed)

Wald Test

\[ W = (\hat{\lambda} - \lambda_0)^2 \hat{I}_n(\hat{\lambda}) = (2.6 - 2)^2 \times \frac{5}{2.6} = 0.36 \times 1.923 = 0.692 \] Using observed information: \(W = (0.6)^2 \times \frac{13}{2.6^2} = 0.36 \times 1.923 = 0.692\)

Score Test

Score at \(\lambda_0=2\): \(U(2) = -5 + \frac{13}{2} = 1.5\)

\[ I(2) = \frac{5}{2} = 2.5 \]

\[ S = U(2)^2 I_n^{-1}(2) = (1.5)^2 \times \frac{1}{2.5} = 2.25 \times 0.4 = 0.9 \]

Likelihood Ratio Test

\[ \ell(\hat{\lambda}) = -5(2.6) + 13\log(2.6) - \sum \log(x_i!) = -13 + 13(0.9555) - \log(1728) \approx -8.0335 \]

\[ \ell(\lambda_0) = -5(2) + 13\log(2) - \sum \log(x_i!) = -10 + 13(0.6931) - 7.455 \approx -8.4447 \]

\[ \Lambda = 2[-8.0335 - (-8.4447)] = 2(0.4112) = 0.8224 \]

Conclusion All test statistics are approximately between 0.7 and 0.9. Since all three test statistics are less than the critical value \(\chi^2_{1,0.05} = 3.841\), we fail to reject \(H_0\).

R Verification

# Poisson example
x <- c(3, 1, 4, 2, 3)
n <- length(x)

# MLE
lambda_mle <- mean(x)

# Wald test
se_wald <- sqrt(lambda_mle/n)
W <- ((lambda_mle - 2)/se_wald)^2

# Score test
U <- -n + sum(x)/2
I <- n/2
S <- U^2/I

# Likelihood ratio test
ll_mle <- sum(dpois(x, lambda_mle, log = TRUE))
ll_null <- sum(dpois(x, 2, log = TRUE))
LR <- 2*(ll_mle - ll_null)

cat("  Wald:", W, "p-value:", 1-pchisq(W, 1), "\n",
    "Score:", S, "p-value:", 1-pchisq(S, 1), "\n",
    "   LR:", LR, "p-value:", 1-pchisq(LR, 1), "\n")
  Wald: 0.6923077 p-value: 0.4053806 
 Score: 0.9 p-value: 0.3427817 
    LR: 0.8214709 p-value: 0.3647505 

4.2 Example 2: Normal Distribution Mean Test

Problem Test \(H_0: \mu = 10\) vs \(H_1: \mu \neq 10\) for \(X_i \sim N(\mu, \sigma^2)\), where \(\sigma^2 = 0.1\).

Data: 10.2, 9.8, 10.0, 10.2, 9.9.

Manual Solution

  • Likelihood: \(L(\mu) = (2\pi\sigma^2)^{-n/2} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right\}\).

  • Log-likelihood: \(\ell(\mu) = \log L(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\)

  • Score Equation: \(S(\mu) = \frac{\partial \ell(\mu)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)\)

  • MLE \(\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{10.2+9.8+10.0+10.2+9.9}{5} = 10.02\)

  • Information \(I(\mu) = -\frac{\partial^2 \ell(\mu)}{\partial \mu^2} = \frac{n}{\sigma^2} = \frac{5}{0.1} = 50\)

With the above manual work, we next define the three test statistic:

Wald Test

\[ W = (\hat{\mu} - \mu_0)^2 I(\hat{\mu}) = (0.02)^2 \times 50 = 0.0004 \times 50 = 0.02 \]

Score Test

\[ U(\mu) = \frac{n}{\sigma^2}(\bar{x} - \mu) \]

At \(\mu_0=10\): \(U(10) = 50 \times (10.02 - 10) = 1\)

\[ S = U(10)^2 I^{-1}(10) = 1^2 \times \frac{1}{50} = 0.02 \]

Likelihood Ratio Test

\[ \ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu)^2 \]

\[ \ell(\mu_0) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu_0)^2 \]

\[ \ell(\hat{\mu}) - \ell(\mu_0) = -\frac{1}{2\sigma^2}[\sum(x_i - \hat{\mu})^2 - \sum(x_i - \mu_0)^2] \]

\[ = -\frac{1}{0.2}[0.13 - 0.128] = -5 \times 0.002 = -0.01 \]

\[ \Lambda = 2 \times 0.01 = 0.02. \]


R Verification

# Normal example
x <- c(10.2, 9.8, 10.0, 10.2, 9.9)
n <- length(x)
sigma2 <- 0.1
mu0 <- 10

mu_mle <- mean(x)

# Wald test
W <- (mu_mle - mu0)^2 * (n/sigma2)   # Wald test statistic

# Score test
U <- (n/sigma2) * (mu_mle - mu0)
I <- n/sigma2
S <- U^2/I    # Rao test statistic

# Likelihood ratio test
ll_mle <- sum(dnorm(x, mu_mle, sqrt(sigma2), log = TRUE))  # log = TRUE yields log-likelihood
ll_null <- sum(dnorm(x, mu0, sqrt(sigma2), log = TRUE))
LR <- 2*(ll_mle - ll_null)    # log-likelihood ratio test statistic

cat("  Wald:", W, "p-value:", 1-pchisq(W, 1), "\n",
    "Score:", S, "p-value:", 1-pchisq(S, 1), "\n",
    "   LR:", LR, "p-value:", 1-pchisq(LR, 1), "\n")
  Wald: 0.02 p-value: 0.8875371 
 Score: 0.02 p-value: 0.8875371 
    LR: 0.02 p-value: 0.8875371 


5 Practical Example: Multiple Parameters

In the single-parameter case, the Fisher information number is relatively straightforward to derive. For multi-parameter cases, however, we need to obtain the Fisher information matrix in order to conduct both the Wald and score tests. This section uses the Weibull distribution as an example to illustrate the procedure for the three chi-square tests.

5.1 Components for the Three \(\chi^2\) Tests

In this subsection, we use the gamma distribution as an example to illustrate how to test the significance of parameters. Since the gamma distribution is commonly expressed in two distinct formulations, we adopt the following parameterization.

The gamma distribution with shape parameter \(\alpha > 0\) and rate parameter \(\beta > 0\) has probability density function:

\[ f(x; \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0 \]

The case where the shape parameter \(\alpha = 1\) reduces the gamma to an exponential distribution. Thus, testing this null hypothesis indicates whether the gamma distribution provides a significantly better fit or leads to overfitting for the given data.

Assume \(\{x_1, x_2, \cdots, x_n\} \xrightarrow{i.i.d} \text{gamma}(\alpha, \beta)\). \(\psi(\alpha)\) denotes the digamma function

\[ \psi(z) = \frac{d}{dz} \ln \Gamma(z) = \frac{\Gamma'(z)}{\Gamma(z)}. \]

R functions digamma(z), trigamma(z), tetragamma(z) evaluate \(\psi(x), \psi^\prime (z)\) and \(\psi^{\prime\prime}(z)\), respectively.

The objective is to test \(H_0: \alpha = 1\). The major components required for testing the hypothesis are given below.

Log-likelihood

\[ \ell(\alpha, \beta) = n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha-1)\sum_{i=1}^n \log x_i - \beta\sum_{i=1}^n x_i \]

Score Equations

\[ \frac{\partial\ell}{\partial\alpha} = n\log\beta - n\psi(\alpha) + \sum_{i=1}^n \log x_i = 0 \]

\[ \frac{\partial\ell}{\partial\beta} = \frac{n\alpha}{\beta} - \sum_{i=1}^n x_i = 0 \]

Maximum Likelihood Estimation

The maximum likelihood estimates of \(\alpha\) and \(\beta\), denoted by \(\hat{\alpha}\) and \(\hat{\beta}\), are the solutions to the above system of score equations.


Information Matrix

\(\mathcal{I}_n(2,2)\) is the (2,2) element of the Fisher information matrix of \(\alpha\) and \(\beta\) based on entire sample with size \(n\):

\[ \mathcal{I}_n(\alpha, \beta) = - E\begin{pmatrix} \frac{\partial^2\ell}{\partial \alpha^2} & \frac{\partial^2\ell}{\partial\alpha\partial\beta} \\ \frac{\partial^2\ell}{\partial\beta\partial\alpha}& \frac{\partial^2\ell}{\partial\beta^2} \end{pmatrix} = -\begin{pmatrix} n\psi'(\alpha) & \frac{n}{\beta} \\ \frac{n}{\beta} & \frac{n\alpha}{\beta^2} \end{pmatrix} \]

Remark: For the likelihood ratio test, we must obtain both the restricted and unrestricted maximum likelihood estimates (MLEs). The restricted MLE is derived under the null hypothesis \(H_0: \alpha = 1\) (i.e., \(\alpha\) fixed to a constant), while the unrestricted MLE treats both \(\alpha\) and \(\beta\) as unknown parameters. In general, these MLEs must be obtained using numerical approximation methods.

5.2 Numerical Implementation

The dataset used in this example contains measurements of microtubule catastrophe times (the time until a microtubule stops growing). In their Cell paper (2011), Gardner, Zanic, and colleagues (Gardner, Zanic, et al. (2011). Cell, 147(5), 1092-1103.) explicitly modeled these catastrophe times using a gamma distribution. For illustrative purpose, we only use complete times in this example.

We first load the data into R and then follow the steps above to perform the three likelihood-based \(\chi^2\) tests.

time2event <- read.table("https://pengdsci.github.io/STA506/w11/Time2Catastrophe.txt")
colnames(time2event) <- "time"
x <- time2event$time

Loglikelihood function

The following R function evaluate the loglikelihood function and will be used to find the unrestricted MLE of \(\alpha\) and \(\beta\).

# loglikelihood function
log_likelihood <- function(param) {
  alpha <- param[1] 
  beta <- param[2]
  n <- length(x)
  sum_x <- sum(x)
  sum_logx <- sum(log(x))
  ##
  ll <- n * alpha * log(beta) - n * log(gamma(alpha)) + 
         (alpha - 1) * sum_logx - beta * sum(x)
  ll
}

Score Equations: The following R function returns a vector score functions, computed from the data values and parameter estimates.

# Score equations (should be zero at solution)
score_equ <-function(par){
    alpha <- par[1]
    beta <- par[2]
    ##
    n <- length(x)
    sum_x <- sum(x)
    sum_logx <- sum(log(x))
    ##
    score_alpha <- n * log(beta) - n * digamma(alpha) + sum_logx
    score_beta <- n * alpha / beta - sum_x
    c(score_alpha, score_beta)
  }

Fisher Information Matrix is coded in the following

FisherInfo <- function(par){
    # Fisher information matrix (negative expected Hessian)
    # Used for Newton-Raphson update
    # It contains only parameters
    alpha <- par[1]
    beta <- par[2]
    
    # Fisher Info cell
    I_11 <- n * trigamma(alpha)
    I_12 <- -n / beta
    I_22 <- n * alpha / beta^2
    
    # Information matrix
    Fisher <- matrix(c(I_11, I_12, I_12, I_22), nrow = 2, byrow = TRUE)
    Fisher   
}

5.3 Maximum Likelihood Estimation

Instead of relying on special built-in R functions that compute maximum likelihood estimates (MLEs) for specific distribution families, we introduce a general purpose optimization function optim() for obtaining MLEs for arbitrary distributions. Specifically, we will derive the MLEs of the parameters \(\alpha\) and \(\beta\), along with the corresponding Fisher information matrix, which will be used in the Score and Wald tests.

optim() is a wrapper function that provides access to several optimization algorithms. The specific numerical method invoked internally depends on the information supplied to the function. In practice, a popular variant of the Newton–Raphson algorithm known as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is frequently employed. BFGS requires the gradient (score vector) to efficiently guide the search toward the optimal solution. The following R code illustrates how to use the BFGS method via optim().

 x <- time2event$time
 n = length(x)
 # initial values
###
 result <- optim(
          par = c(100,0.1), 
          #need to provide initial values to start the iteration.
                                    # Choosing appropriate initial values is critical.
          fn = log_likelihood,      # Caution: need negative log-likelihood
          gr = score_equ,           # also nee negative score
          method = "BFGS",          # calling BFGS algorithm
          hessian = TRUE,           # return Hessian matrix
          control = list(           # some controls in numerical iteration
                fnscale = -1,       # turn the minimization to maximization problem
                maxit = 1000,       # stopping rule 1: cap number of iterations
                reltol = 1e-8,      # stopping rule: precision control
                REPORT = 10         # The algorithm will print progress updates 
                                    # every 10 iterations
          )
  )
 result
$par
[1] 2.407544212 0.005462866

$value
[1] -1458.997

$counts
function gradient 
      91       20 

$convergence
[1] 0

$message
NULL

$hessian
           [,1]         [,2]
[1,]  -108.2844     38844.57
[2,] 38844.5676 -17612367.12

Double check with the hessian matrix using the derived formula

Fisher <- FisherInfo(result$par)
Fisher
            [,1]        [,2]
[1,]    108.2844   -38624.41
[2,] -38624.4144 17022197.79

Which is consistent with the hessian matrix in the output of optim().

The variance and covariance matrix can be approximated by inverting the Fisher Information Matrix or the inverse of the negative Hessian matrix returned from optim().

varcov <- solve(-result$hessian)   # solve() is a matrix inversion function
varcov
             [,1]         [,2]
[1,] 4.422498e-02 9.753942e-05
[2,] 9.753942e-05 2.719042e-07

That is, \(\text{var}(\alpha) = 4.42\times 10^{-2}\) and \(\text{var}(\beta) = 2.72\times 10^{-7}\).

5.4 Two \(\chi^2\) Tests

The three \(\chi^2\) tests are given respectively in the following.

5.4.1 Likelihood Ratio Test

We have found the un-restrictive MLEs of \(\alpha\) and \(\beta\):

MLE <- result$par
MLE
[1] 2.407544212 0.005462866

We now find restrictive MLE of \(\beta\) under \(H_0: \alpha = 1\). The optimization problems becomes single parameter problem.

##
loglik <- function(beta) n*log(beta)-n -beta*sum(x)
##
score <- function(beta) n/beta - sum(x)
##
 x <- time2event$time
 n = length(x)
 result0 <- optim(
          par = 0.05,             # need to provide initial values to start the iteration.
                                    # Choosing appropriate initial values is critical.
          fn = loglik,              # Caution: need negative log-likelihood
          gr = score,               # also nee negative score
          method = "BFGS",          # calling BFGS algorithm
          hessian = TRUE,           # return Hessian matrix
          control = list(           # some controls in numerical iteration
                fnscale = -1,       # turn the minimization to maximization problem
                maxit = 1000,       # stopping rule 1: cap number of iterations
                reltol = 1e-8,      # stopping rule: precision control
                REPORT = 10         # The algorithm will print progress updates 
                                    # every 10 iterations
          )
  )
 result0
$par
[1] 0.002269067

$value
[1] -1706.65

$counts
function gradient 
      42        7 

$convergence
[1] 0

$message
NULL

$hessian
          [,1]
[1,] -50859711

That is the restrictive MLE of \(\beta\), denoted by \(\hat{\beta}=0.002269067\).

The likelihood ratio test statistic is

\[ LR = −2[\ell(\hat{\alpha}, \hat{\beta})−\ell(1,\hat{\beta})] \to \chi^2_1. \]

The R code that evaluates the above test statistic is given by

 x <- time2event$time
 n = length(x)
 ##
 LR = -2*(loglik(result0$par) - log_likelihood(result$par))
 LR
[1] 495.3061

Since the chi-square test is always right-tailed, the p-value based on the test statistic above is given by \(p = P(\chi^2_1 > LR)\). The p-value can be obtained using the following R code.

pval <- 1 - pchisq(LR, df = 1)
pval
[1] 0

The p-value is nearly zero, leading to the rejection of the null hypothesis \(H_0: \alpha = 1\).

5.4.2 Wald \(\chi^2\) Test

The Wald \(\chi^2\) test statistics for the null hypothesis \(H_0: \alpha = 1\) is given by

\[ W_{\text{Wald}} = [\frac{\hat{\alpha} - 1}{\text{se}(\hat{\alpha})}]^2 = \frac{(\hat{\alpha}-1)^2}{\text{var}(\hat{\alpha})} \to \chi^2_1 \]

where \(\text{Var}(\hat{\alpha})\) is the diagonal element of the inverse of the observed Fisher information matrix (i.e., the variance–covariance matrix). This means that the Fisher information matrix is required for the Wald test.

Next, we define the test statistic and calculate the p-value using the following R code

alpha.hat  <- result$par[1]     # based on the unrestricted likelihood estimation
se.alpha <- sqrt(varcov[1, 1])  # the square root of the top-left diagonal value
W <- (alpha.hat - 1)^2/(se.alpha)^2   # Wald test statistic
##
pval.W <- 1 - pchisq(W, df = 1)
c(W = W, pval.W = pval.W)
           W       pval.W 
4.479778e+01 2.184708e-11 

The p-value is also near 0 implying the rejection of the null hypothesis.


6 Performance Comparison

At the heart of modern statistical inference lies a powerful trio: the Likelihood Ratio, Wald, and Score tests. Each provides a distinct lens for testing hypotheses, transforming the same likelihood function into a chi-squared statistic through different logical and geometrical principles. Their asymptotic equivalence provides a reassuring theoretical foundation, yet their finite-sample divergence reveals a rich landscape of statistical trade-offs, where the choice of test balances considerations of accuracy, convenience, invariance, and computational burden.

Core Performance Characteristics

  • Likelihood Ratio Test (LRT) - best overall performance in most situations
    • Accuracy: Most reliable in finite samples, closest to nominal Type I error rates
    • Robustness: Invariant to parameter transformations (unique advantage)
    • Conservatism: Slightly conservative in very small samples
    • Computation: Most intensive (requires fitting both models)
  • Wald Test - fastest but least reliable in non-ideal conditions
    • Accuracy: Can be poor in small samples, especially with:
      • Skewed likelihoods
      • Parameters near boundaries (variances, probabilities near 0/1)
      • Nonlinear parameters
    • Bias: Often anti-conservative (inflates Type I error)
    • Computation: Fastest (only needs full model)
  • Score Test (Rao)- efficient compromise with specific strengths
    • Accuracy: Good when null hypothesis is true, but can be inconsistent when alternative is true
    • Power: Sometimes less powerful than LRT in small samples
    • Computation: Moderate (only needs null model)
    • Special strength: Best for boundary testing and initial model building

General Recommendations

  • Use the LRT when:
    • Computational cost is not a concern.
    • Sample size is small to moderate.
    • when comparing nested models of general types (GLMs, mixed models, etc.).
  • Use the Wald Test when:
    • only having the unrestricted model output (e.g., from a standard regression summary).
    • Doing quick approximate inference or constructing confidence intervals.
    • The sample size is large and the likelihood is well-behaved (\(\approx\) quadratic).
    • Caution: Can be misleading for parameters like odds ratios (always exponentiate first, then form Wald CI on log scale).
  • Use the Score Test when:
    • The null model is simpler to fit (e.g., testing for an additional regressor in a large data set; only fit the reduced model).
    • In model checking and diagnostics (e.g., residuals tests).
    • When the MLE under the alternative is hard to obtain or at a boundary.

Logical Paradoxes and Resolutions

  • Paradox 1: “Wald can reject when LRT doesn’t”
    • Occurs when: Likelihood is highly non-quadratic
    • Logical explanation: Wald assumes symmetric CI based on curvature at MLE; LRT uses actual likelihood shape
    • Resolution: Trust LRT—it respects the true likelihood geometry
  • Paradox 2: “Score test more powerful than LRT”
    • Occurs when: Parameter at boundary, small samples
    • Logical explanation: Score uses exact null distribution; LRT uses asymptotic \(\chi^2\) approximation
    • Resolution: Use exact distributions or corrected LRT (50:50 mixture \(\chi^2\))
  • Paradox 3: “Tests disagree asymptotically”
    • Logical implication: Likely model misspecification or computational error
    • Resolution: Check model assumptions and numerical stability
---
title: "Wald, Score and Likelihood Ratio Tests"
author: "Cheng Peng"
date: "West Chester University"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("psych")) {
  install.packages("psych")
  library(psych)
}
if (!require("RColorBrewer")) {
  install.packages("RColorBrewer")
  library(RColorBrewer)
}

if (!require("boot")) {
  install.packages("boot")
  library(boot)
}
if (!require("effsize")) {
  install.packages("effsize")
  library(effsize)
}
## library(effsize)
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```

\

# Introduction


This module focuses on the three most commonly used statistical tests in subsequent courses: the Wald, Score, and Likelihood Ratio tests. These three methods form the cornerstone of inference in parametric models, providing powerful and versatile approaches for testing hypotheses, constructing confidence intervals, and assessing model fit. 

While rooted in the elegant framework of maximum likelihood estimation, each test offers a distinct geometric or algebraic perspective on the evidence against a null hypothesis. Understanding their theoretical foundations, comparative strengths, and intrinsic relationships is essential for mastering advanced statistical methodology and applying it correctly to complex data.

In this module, we will state formulation, assumptions, steps to perform these tests and explained by numerical examples with manual solution and verification using R.


# Review of Likelihood Formulation and Estimation

In maximum likelihood estimation, we fit models by finding parameters that make the observed data most probable. This process naturally leads to three related methods for testing hypotheses such as the likelihood ratio, Wald, and score tests. Each test uses a different feature of the likelihood function, but all are rooted in the same principles of likelihood and asymptotic theory. This section reviews basics of maximum likelihood estimation and their asymptotic normality.


**Likelihood Function**

Let $X_1, X_2, \dots, X_n$ be independent and identically distributed (i.i.d) random variables with probability density (or mass) function $f(x_i; \boldsymbol{\theta})$, where $\boldsymbol{\theta} = (\theta_1, \dots, \theta_p)^T$ is a $p$-dimensional parameter vector. The likelihood function is:

$$
L(\boldsymbol{\theta}) = \prod_{i=1}^n f(x_i; \boldsymbol{\theta})
$$

The log-likelihood function is:

$$
\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^n \log f(x_i; \boldsymbol{\theta})
$$

**Maximum Likelihood Estimation**

The maximum likelihood estimator (MLE) $\hat{\boldsymbol{\theta}}$ satisfies:

$$
\hat{\boldsymbol{\theta}} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})
$$

Under regularity conditions, $\hat{\boldsymbol{\theta}}$ is obtained by solving the score equations:

$$
U(\boldsymbol{\theta}) = \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \boldsymbol{0}
$$


**Fisher Information Matrix**

The Fisher information matrix of $\theta$ based on an entire sample with size $n$ is given by:

$$
I_n(\boldsymbol{\theta}) = E\left[-\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}\right]
$$

The above (theoretical) Fisher information matrix requires integral that integrates variables out from the expression. In practice, we use the following observed information matrix:

$$
J_n(\boldsymbol{\theta}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}
$$

Some software programs return Hessian matrix $H_n( \boldsymbol{\theta}) = -J_n( \boldsymbol{\theta})$, that is

$$
H_n(\boldsymbol{\theta}) = \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}
$$

Note that, in some textbooks, research articles and web postings, the **Fisher Information Matrix** based on single data point is denoted by $I_0(\boldsymbol{\theta})$,

$$
I_n(\boldsymbol{\theta}) = n\times I_1(\boldsymbol{\theta})
$$

Under some regularity conditions and the null hypothesis $H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0$, we have the following 

$$
\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) \xrightarrow{d} N(\boldsymbol{0}, I_1^{-1}(\boldsymbol{\theta}_0)).
$$


The relationship between **Fisher Information Matrix** of $\boldsymbol{\theta}$ and the variance of the MLE $\hat{\boldsymbol{\theta}}$ is

$$
\text{var}(\hat{\boldsymbol{\theta}}) \approx I_n^{-1}(\boldsymbol{\theta}) = [n\times I_1(\boldsymbol{\theta})]^{-1} = \frac{I_1^{-1}(\boldsymbol{\theta})}{n}
$$
Therefore, we can rewrite the above asymptotic distribution of $\boldsymbol{\theta}$ as

$$
\hat{\boldsymbol{\theta}}\to N\left( \boldsymbol{\theta}, \frac{I_1^{-1}(\boldsymbol{\theta})}{n} \right) \quad \text{or equivalently} \quad \hat{\boldsymbol{\theta}}\to N\left( \boldsymbol{\theta}, I_n^{-1}(\boldsymbol{\theta}) \right)
$$

The diagonal elements in the covariance $I_n^{-1}(\boldsymbol{\theta})$ represent the variance of the corresponding parameters in $\hat{\boldsymbol{\theta}} = (\hat{\theta}_1, \hat{\theta}_2, \cdots, \hat{\theta_k})$. For example. $\text{var}(\hat{\theta}_1) = $ the first element in the main diagonal of the inverse of the fisher information matrix based on the entire sample.  

For example, let $\boldsymbol{\theta} =(\theta_1, \theta_2)$, the corresponding MLE is $\hat{\boldsymbol{\theta}} =(\hat{\theta_1}, \hat{\theta_2)}$. Let the **observed Fisher information matrix** be

$$
J_n(\boldsymbol{\theta}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}=-\begin{pmatrix}
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2}
\end{pmatrix}
$$

The estimated Fisher information matrix is given by

$$
\widehat{J_n(\boldsymbol{\theta})} = -\begin{pmatrix}
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2}
\end{pmatrix}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}.
$$

Note that R function `optim()` returns Hessian matrix $\widehat{H_n(\boldsymbol{\theta})}=- \widehat{J_n(\boldsymbol{\theta})}$.

Let the inverse of $widehat{J_n(\boldsymbol{\theta})}$ be of the following form

$$
\left[\widehat{J_n(\boldsymbol{\theta})}\right]^{-1} = -\left[\begin{pmatrix}
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_1\partial \theta_2} \\
\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_2\partial \theta_1} & \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial^2 \theta_2}
\end{pmatrix}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}\right]^{-1}
=\begin{pmatrix}
a & c \\
c & d
\end{pmatrix}.
$$

Then we have

$$
\widehat{\text{var}(\hat{\boldsymbol{\theta})}} \approx \begin{pmatrix}
a & c \\
c & d
\end{pmatrix}.
$$

That is,

$$
\text{var}(\hat{\theta}_1) \approx a, \quad  \text{var}(\hat{\theta}_2) \approx d, \quad \text{and} \quad  \text{cov}(\hat{\theta}_1, \hat{\theta}_2) \approx c.
$$

The above asymptotic normality is the foundation of the Wald and Score tests.


\

# The Three Asymptotic $\chi^2$ Tests

At the foundation of parametric hypothesis testing lies a trio of asymptotically equivalent methods: the likelihood ratio, Wald, and score tests. All three derive from maximum likelihood theory and converge to the same chi-squared distribution under the null hypothesis, yet they approach the problem of testing from meaningfully different angles.

The **likelihood ratio test** directly compares the goodness-of-fit between the unrestricted model and the model constrained by the null hypothesis. The **Wald test** evaluates the distance between the estimated parameter vector and its hypothesized value, standardized by the estimated curvature of the likelihood. The **score test**, alternatively, examines whether the slope of the log-likelihood at the null parameter value is sufficiently close to zero—an indication that the null point is a plausible maximum.

Despite their asymptotic equivalence, these tests differ in computation, finite-sample performance, and invariance properties, leading to practical trade-offs in applied work. The following sub-sections develop their formulations, contrasts their theoretical and practical attributes, and provide guidance for choosing among them in statistical modeling.



## Wald Test

**Formulation** Tests the distance between the **unconstrained MLE** and the **hypothesized value**. It measures **geometric displacement** from $H_0$ in parameter space. To make it simple, let's $\boldsymbol{\theta} = (\theta_1, \theta_2)$ be the vector of the parameters of a distribution. Consider hypothesis 

$$
H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0, \quad \text{i.e.}\quad \begin{pmatrix}
\theta_1  \\
\theta_2 
\end{pmatrix}
=
\begin{pmatrix}
\theta_{10}  \\
\theta_{20} 
\end{pmatrix}.
$$

The difference between the **hypothesized value** and the **unrestricted MLE** is given by

$$
D = \hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0 = \begin{pmatrix}
\hat{\theta}_1 -\theta_{10}  \\
\hat{\theta}_2 -\theta_{20}
\end{pmatrix}
$$


**Test Statistic**

With the above notations, the Wald test statistic is given by

$$
W = \hat{D}^{T} \, \left[ \widehat{\text{var}(\hat{\boldsymbol{\theta}})}\right]^{-1} \, \hat{D}  = \hat{D}^{T} \,\text{I}_n(\hat{\boldsymbol{\theta}}) \, \hat{D} 
$$

More explicitly, we have

$$
W = 
\begin{pmatrix} 
\hat{\theta}_1 - \theta_{10} & \hat{\theta}_2 - \theta_{20}
\end{pmatrix}
\begin{pmatrix}
\text{var}(\hat{\theta}_1) & \text{cov}(\hat{\theta}_1, \hat{\theta}_2) \\
\text{cov}(\hat{\theta}_1, \hat{\theta}_2) & \text{var}(\hat{\theta}_2)
\end{pmatrix}^{-1}
\begin{pmatrix}
\hat{\theta}_1 - \theta_{10} \\
\hat{\theta}_2 - \theta_{20}
\end{pmatrix}.
$$



For testing single parameter case $H_0: \boldsymbol{\theta}_1 = \boldsymbol{\theta}_{10}$


$$
W = (\hat{\theta} - \theta_0)[\text{var}(\hat{\theta}_1)]^{-1}(\hat{\theta} - \theta_0) = \frac{(\hat{\theta} - \theta_0)^2}{\text{var}(\hat{\theta}_1)}
$$



**Asymptotic Distribution**: $W \xrightarrow{d} \chi^2_q$ under $H_0$,  where $q$ is the number of parameters specified under the null hypothesis.

In the two-parameter and one-parameter cases described above, the degrees of freedom of the $\chi^2$ test are $q = 2$ and $q=1$, respectively.



**Assumptions**: Wald test requires only the unconstrained MLE (as well as Information matrix to derive the covariance matrix of the MLE) and is sensitive to parameterization.



## Score Test (Rao's Score Test)

Rao's score is also known as the **Lagrange multiplier test**.

**Formulation**: Tests whether the score at the null hypothesis is significantly different from zero. It measures local gradient at $H_0$ in likelihood space.

**Test Statistic**:

$$
S = U(\tilde{\boldsymbol{\theta}})^T I_n^{-1}(\tilde{\boldsymbol{\theta}}) U(\tilde{\boldsymbol{\theta}})
$$
where $\tilde{\boldsymbol{\theta}}$ is the MLE under $H_0$ (constrained MLE).

**Asymptotic Distribution**: $S \xrightarrow{d} \chi^2_q$ under $H_0$.

**Assumptions**: Requires only the constrained MLE. Invariant to reparameterization.


## Likelihood Ratio Test

**Formulation**: Compares the maximized log-likelihood under $H_0$ and $H_1$. It measures information loss when imposing $H_0$.

**Test Statistic**:

$$
\Lambda = -2[\ell(\tilde{\boldsymbol{\theta}}) - \ell(\hat{\boldsymbol{\theta}})] = 2[\ell(\hat{\boldsymbol{\theta}}) - \ell(\tilde{\boldsymbol{\theta}})]
$$

**Asymptotic Distribution**: $\Lambda \xrightarrow{d} \chi^2_q$ under $H_0$.

**Assumptions**: Requires both constrained and unconstrained MLEs. Invariant to reparameterization.

\



# Practical Examples: Single Parameter

To concretely demonstrate the similarities and differences among the three likelihood-based tests, we now turn to practical implementation in R. While these tests often yield similar conclusions with large samples, their distinctions become particularly instructive when examining simple yet fundamental distributions. 

Through hands-on examples using Poisson and normal distributions, we will systematically apply the Likelihood Ratio, Wald, and Score tests to identical hypotheses. This comparative approach will illuminate how each test statistic is calculated, showcase scenarios where their results diverge meaningfully, and provide reusable code templates that reveal the computational machinery underlying common statistical procedures. 

<font color = "blue">**Unlike t-tests and z-tests which have closed-form formulas, likelihood-based chi-squared tests require explicit derivation of the likelihood function, parameter estimation under null/alternative, and careful construction of the test statistics. This is why they're more flexible but require more preparation.**</font> The key preparation steps required

* Derive log-likelihood function $\ell(\theta; x)$
* Derive score function $U(\theta) = \partial \ell/\partial \theta$
* Derive Fisher Information $I_n(\theta) = -E[\partial^2 \ell/\partial \theta^2]$
* Find MLEs under both $H_0$ and $H_1$
* Compute standard errors from inverse Fisher Information
* Construct test statistics using appropriate formulas

<font color = "red">**Note that likelihood-based inferences rely on large-sample asymptotic theory. For the purpose of clearly illustrating the procedure, the examples in the following subsections will use small toy datasets.**</font>


## Example 1: Poisson Distribution

**Problem**: Test $H_0: \lambda = 2$ vs $H_1: \lambda \neq 2$ for $X_1, \dots, X_n \sim \text{Poisson}(\lambda)$. 

**Observed data**: 3, 1, 4, 2, 3.

**Manual Solution**:

* **Likelihood**: $L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda} \lambda^{x_i}}{x_i!}$
    
* **Log-likelihood**: $\ell(\lambda) = -n\lambda + (\sum x_i)\log\lambda - \sum \log(x_i!)$
    
* **MLE**: $\hat{\lambda} = \bar{x} = \frac{3+1+4+2+3}{5} = 2.6$
    
* **Score function**: $U(\lambda) = \frac{d\ell}{d\lambda} = -n + \frac{\sum x_i}{\lambda}$
    
* **Information**: $I_n(\lambda) = \frac{n}{\lambda}$ (expected), $J_n(\lambda) = \frac{\sum x_i}{\lambda^2}$ (observed)


**Wald Test**

$$
W = (\hat{\lambda} - \lambda_0)^2 \hat{I}_n(\hat{\lambda}) = (2.6 - 2)^2 \times \frac{5}{2.6} = 0.36 \times 1.923 = 0.692
$$
Using observed information: $W = (0.6)^2 \times \frac{13}{2.6^2} = 0.36 \times 1.923 = 0.692$

**Score Test**

Score at $\lambda_0=2$: $U(2) = -5 + \frac{13}{2} = 1.5$

$$
I(2) = \frac{5}{2} = 2.5
$$

$$
S = U(2)^2 I_n^{-1}(2) = (1.5)^2 \times \frac{1}{2.5} = 2.25 \times 0.4 = 0.9
$$

**Likelihood Ratio Test**

$$
\ell(\hat{\lambda}) = -5(2.6) + 13\log(2.6) - \sum \log(x_i!) = -13 + 13(0.9555) - \log(1728) \approx -8.0335
$$

$$
\ell(\lambda_0) = -5(2) + 13\log(2) - \sum \log(x_i!) = -10 + 13(0.6931) - 7.455 \approx -8.4447
$$

$$
\Lambda = 2[-8.0335 - (-8.4447)] = 2(0.4112) = 0.8224
$$

**Conclusion** All test statistics are approximately between 0.7 and 0.9.  Since all three test statistics are less than the critical value $\chi^2_{1,0.05} = 3.841$, we fail to reject $H_0$.


**R Verification**

```{r}
# Poisson example
x <- c(3, 1, 4, 2, 3)
n <- length(x)

# MLE
lambda_mle <- mean(x)

# Wald test
se_wald <- sqrt(lambda_mle/n)
W <- ((lambda_mle - 2)/se_wald)^2

# Score test
U <- -n + sum(x)/2
I <- n/2
S <- U^2/I

# Likelihood ratio test
ll_mle <- sum(dpois(x, lambda_mle, log = TRUE))
ll_null <- sum(dpois(x, 2, log = TRUE))
LR <- 2*(ll_mle - ll_null)

cat("  Wald:", W, "p-value:", 1-pchisq(W, 1), "\n",
    "Score:", S, "p-value:", 1-pchisq(S, 1), "\n",
    "   LR:", LR, "p-value:", 1-pchisq(LR, 1), "\n")
```


## Example 2: Normal Distribution Mean Test

**Problem** Test $H_0: \mu = 10$ vs $H_1: \mu \neq 10$ for $X_i \sim N(\mu, \sigma^2)$, where $\sigma^2 = 0.1$. 

**Data**: 10.2, 9.8, 10.0, 10.2, 9.9.

**Manual Solution**

* **Likelihood**: $L(\mu) = (2\pi\sigma^2)^{-n/2} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right\}$.

* **Log-likelihood**: $\ell(\mu) = \log L(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2$

* **Score Equation**: $S(\mu) = \frac{\partial \ell(\mu)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)$

* **MLE** $\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{10.2+9.8+10.0+10.2+9.9}{5} = 10.02$
    
* **Information** $I(\mu) = -\frac{\partial^2 \ell(\mu)}{\partial \mu^2}  = \frac{n}{\sigma^2} = \frac{5}{0.1} = 50$


With the above manual work, we next define the three test statistic:


**Wald Test**

$$
W = (\hat{\mu} - \mu_0)^2 I(\hat{\mu}) = (0.02)^2 \times 50 = 0.0004 \times 50 = 0.02
$$

**Score Test**

$$
U(\mu) = \frac{n}{\sigma^2}(\bar{x} - \mu)
$$

At $\mu_0=10$: $U(10) = 50 \times (10.02 - 10) = 1$

$$
S = U(10)^2 I^{-1}(10) = 1^2 \times \frac{1}{50} = 0.02
$$

**Likelihood Ratio Test**

$$
\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu)^2
$$

$$
\ell(\mu_0) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu_0)^2
$$


$$
\ell(\hat{\mu}) - \ell(\mu_0) = -\frac{1}{2\sigma^2}[\sum(x_i - \hat{\mu})^2 - \sum(x_i - \mu_0)^2]
$$

$$
= -\frac{1}{0.2}[0.13 - 0.128] = -5 \times 0.002 = -0.01
$$

$$
\Lambda = 2 \times 0.01 = 0.02.
$$

\

**R Verification**

```{r}
# Normal example
x <- c(10.2, 9.8, 10.0, 10.2, 9.9)
n <- length(x)
sigma2 <- 0.1
mu0 <- 10

mu_mle <- mean(x)

# Wald test
W <- (mu_mle - mu0)^2 * (n/sigma2)   # Wald test statistic

# Score test
U <- (n/sigma2) * (mu_mle - mu0)
I <- n/sigma2
S <- U^2/I    # Rao test statistic

# Likelihood ratio test
ll_mle <- sum(dnorm(x, mu_mle, sqrt(sigma2), log = TRUE))  # log = TRUE yields log-likelihood
ll_null <- sum(dnorm(x, mu0, sqrt(sigma2), log = TRUE))
LR <- 2*(ll_mle - ll_null)    # log-likelihood ratio test statistic

cat("  Wald:", W, "p-value:", 1-pchisq(W, 1), "\n",
    "Score:", S, "p-value:", 1-pchisq(S, 1), "\n",
    "   LR:", LR, "p-value:", 1-pchisq(LR, 1), "\n")
```

\

# Practical Example: Multiple Parameters

In the single-parameter case, the Fisher information number is relatively straightforward to derive. For multi-parameter cases, however, we need to obtain the Fisher information matrix in order to conduct both the Wald and score tests. This section uses the Weibull distribution as an example to illustrate the procedure for the three chi-square tests.

## Components for the Three $\chi^2$ Tests

In this subsection, we use the gamma distribution as an example to illustrate how to test the significance of parameters. Since the gamma distribution is commonly expressed in two distinct formulations, we adopt the following parameterization.

The gamma distribution with shape parameter $\alpha > 0$ and rate parameter $\beta > 0$ has probability density function:

$$
f(x; \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0
$$

The case where the shape parameter $\alpha = 1$ reduces the gamma to an exponential distribution. Thus, testing this null hypothesis indicates whether the gamma distribution provides a significantly better fit or leads to overfitting for the given data.

Assume $\{x_1, x_2, \cdots, x_n\} \xrightarrow{i.i.d} \text{gamma}(\alpha, \beta)$. $\psi(\alpha)$ denotes the **digamma function**

$$
\psi(z) = \frac{d}{dz} \ln \Gamma(z) = \frac{\Gamma'(z)}{\Gamma(z)}.
$$

**R functions** `digamma(z)`, `trigamma(z)`, `tetragamma(z)` evaluate $\psi(x), \psi^\prime (z)$ and $\psi^{\prime\prime}(z)$, respectively.


The objective is to test $H_0: \alpha = 1$. The major components required for testing the hypothesis are given below.  

**Log-likelihood**

$$
\ell(\alpha, \beta) = n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha-1)\sum_{i=1}^n \log x_i - \beta\sum_{i=1}^n x_i
$$

**Score Equations**

$$
\frac{\partial\ell}{\partial\alpha} = n\log\beta - n\psi(\alpha) + \sum_{i=1}^n \log x_i = 0
$$

$$
\frac{\partial\ell}{\partial\beta} = \frac{n\alpha}{\beta} - \sum_{i=1}^n x_i = 0
$$

**Maximum Likelihood Estimation**

The maximum likelihood estimates of $\alpha$ and $\beta$, denoted by $\hat{\alpha}$ and $\hat{\beta}$, are the solutions to the above **system** of score equations.

\

**Information Matrix**

$\mathcal{I}_n(2,2)$ is the (2,2) element of the **Fisher information matrix** of $\alpha$ and $\beta$ based on entire sample with size $n$:

$$
\mathcal{I}_n(\alpha, \beta) = - E\begin{pmatrix} \frac{\partial^2\ell}{\partial \alpha^2} & \frac{\partial^2\ell}{\partial\alpha\partial\beta} \\ \frac{\partial^2\ell}{\partial\beta\partial\alpha}& \frac{\partial^2\ell}{\partial\beta^2} \end{pmatrix} = -\begin{pmatrix} n\psi'(\alpha) & \frac{n}{\beta} \\ \frac{n}{\beta} & \frac{n\alpha}{\beta^2} \end{pmatrix}
$$


**Remark**: For the likelihood ratio test, we must obtain both the restricted and unrestricted maximum likelihood estimates (MLEs). The restricted MLE is derived under the null hypothesis $H_0: \alpha = 1$ (i.e., $\alpha$ fixed to a constant), while the unrestricted MLE treats both $\alpha$ and $\beta$ as unknown parameters. In general, these MLEs must be obtained using numerical approximation methods.



## Numerical Implementation

The dataset used in this example contains measurements of microtubule catastrophe times (the time until a microtubule stops growing). In their Cell paper (2011), Gardner, Zanic, and colleagues (Gardner, Zanic, et al. (2011). Cell, 147(5), 1092-1103.) explicitly modeled these catastrophe times using a **gamma distribution**. For illustrative purpose, we only use complete times in this example.

We first load the data into R and then follow the steps above to perform the three likelihood-based $\chi^2$ tests.

```{r}
time2event <- read.table("https://pengdsci.github.io/STA506/w11/Time2Catastrophe.txt")
colnames(time2event) <- "time"
x <- time2event$time
```

**Loglikelihood function** 

The following R function evaluate the loglikelihood function and will be used to find the unrestricted MLE of $\alpha$ and $\beta$. 

```{r}
# loglikelihood function
log_likelihood <- function(param) {
  alpha <- param[1] 
  beta <- param[2]
  n <- length(x)
  sum_x <- sum(x)
  sum_logx <- sum(log(x))
  ##
  ll <- n * alpha * log(beta) - n * log(gamma(alpha)) + 
         (alpha - 1) * sum_logx - beta * sum(x)
  ll
}
```

**Score Equations**: The following R function returns a vector score functions, computed from the data values and parameter estimates.

```{r}
# Score equations (should be zero at solution)
score_equ <-function(par){
    alpha <- par[1]
    beta <- par[2]
    ##
    n <- length(x)
    sum_x <- sum(x)
    sum_logx <- sum(log(x))
    ##
    score_alpha <- n * log(beta) - n * digamma(alpha) + sum_logx
    score_beta <- n * alpha / beta - sum_x
    c(score_alpha, score_beta)
  }
```

**Fisher Information Matrix** is coded in the following

```{r}
FisherInfo <- function(par){
    # Fisher information matrix (negative expected Hessian)
    # Used for Newton-Raphson update
    # It contains only parameters
    alpha <- par[1]
    beta <- par[2]
    
    # Fisher Info cell
    I_11 <- n * trigamma(alpha)
    I_12 <- -n / beta
    I_22 <- n * alpha / beta^2
    
    # Information matrix
    Fisher <- matrix(c(I_11, I_12, I_12, I_22), nrow = 2, byrow = TRUE)
    Fisher   
}
```


## Maximum Likelihood Estimation

Instead of relying on special built-in R functions that compute maximum likelihood estimates (MLEs) for specific distribution families, we introduce a general purpose optimization function `optim()` for obtaining MLEs for arbitrary distributions. Specifically, we will derive the MLEs of the parameters $\alpha$
 and $\beta$, along with the corresponding Fisher information matrix, which will be used in the Score and Wald tests.

`optim()` is a wrapper function that provides access to several optimization algorithms. The specific numerical method invoked internally depends on the information supplied to the function. In practice, a popular variant of the Newton–Raphson algorithm known as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is frequently employed. BFGS requires the gradient (score vector) to efficiently guide the search toward the optimal solution. The following R code illustrates how to use the BFGS method via `optim()`.


```{r}
 x <- time2event$time
 n = length(x)
 # initial values
###
 result <- optim(
          par = c(100,0.1), 
          #need to provide initial values to start the iteration.
                                    # Choosing appropriate initial values is critical.
          fn = log_likelihood,      # Caution: need negative log-likelihood
          gr = score_equ,           # also nee negative score
          method = "BFGS",          # calling BFGS algorithm
          hessian = TRUE,           # return Hessian matrix
          control = list(           # some controls in numerical iteration
                fnscale = -1,       # turn the minimization to maximization problem
                maxit = 1000,       # stopping rule 1: cap number of iterations
                reltol = 1e-8,      # stopping rule: precision control
                REPORT = 10         # The algorithm will print progress updates 
                                    # every 10 iterations
          )
  )
 result
```

Double check with the hessian matrix using the derived formula

```{r}
Fisher <- FisherInfo(result$par)
Fisher
```

Which is consistent with the **hessian matrix** in the output of `optim()`.


The variance and covariance matrix can be approximated by inverting the **Fisher Information Matrix** or the inverse of the negative Hessian matrix returned from `optim()`.


```{r}
varcov <- solve(-result$hessian)   # solve() is a matrix inversion function
varcov
```

That is, $\text{var}(\alpha) = 4.42\times 10^{-2}$ and $\text{var}(\beta) = 2.72\times 10^{-7}$.


## Two $\chi^2$ Tests 

The three $\chi^2$ tests are given respectively in the following.

### Likelihood Ratio Test

We have found the un-restrictive MLEs of $\alpha$ and $\beta$:

```{r}
MLE <- result$par
MLE
```

We now find restrictive MLE of $\beta$  under $H_0: \alpha = 1$. The optimization problems becomes single parameter problem.


```{r}
##
loglik <- function(beta) n*log(beta)-n -beta*sum(x)
##
score <- function(beta) n/beta - sum(x)
##
 x <- time2event$time
 n = length(x)
 result0 <- optim(
          par = 0.05,             # need to provide initial values to start the iteration.
                                    # Choosing appropriate initial values is critical.
          fn = loglik,              # Caution: need negative log-likelihood
          gr = score,               # also nee negative score
          method = "BFGS",          # calling BFGS algorithm
          hessian = TRUE,           # return Hessian matrix
          control = list(           # some controls in numerical iteration
                fnscale = -1,       # turn the minimization to maximization problem
                maxit = 1000,       # stopping rule 1: cap number of iterations
                reltol = 1e-8,      # stopping rule: precision control
                REPORT = 10         # The algorithm will print progress updates 
                                    # every 10 iterations
          )
  )
 result0
```


That is the restrictive MLE of $\beta$, denoted by $\hat{\beta}=0.002269067$.

The likelihood ratio test statistic is

$$
LR = −2[\ell(\hat{\alpha}, \hat{\beta})−\ell(1,\hat{\beta})] \to \chi^2_1.
$$

The R code that evaluates the above test statistic is given by

```{r}
 x <- time2event$time
 n = length(x)
 ##
 LR = -2*(loglik(result0$par) - log_likelihood(result$par))
 LR
```


Since the chi-square test is always right-tailed, the p-value based on the test statistic above is given by $p = P(\chi^2_1 > LR)$. The p-value can be obtained using the following R code.

```{r}
pval <- 1 - pchisq(LR, df = 1)
pval
```

The p-value is nearly zero, leading to the rejection of the null hypothesis $H_0: \alpha = 1$.


### Wald  $\chi^2$  Test

The Wald $\chi^2$ test statistics for the null hypothesis $H_0: \alpha = 1$ is given by

$$
W_{\text{Wald}} = [\frac{\hat{\alpha} - 1}{\text{se}(\hat{\alpha})}]^2 = \frac{(\hat{\alpha}-1)^2}{\text{var}(\hat{\alpha})} \to \chi^2_1
$$

where $\text{Var}(\hat{\alpha})$ is the diagonal element of the inverse of the observed Fisher information matrix (i.e., the variance–covariance matrix). This means that the Fisher information matrix is required for the Wald test.


Next, we define the test statistic and calculate the p-value using the following R code

```{r}
alpha.hat  <- result$par[1]     # based on the unrestricted likelihood estimation
se.alpha <- sqrt(varcov[1, 1])  # the square root of the top-left diagonal value
W <- (alpha.hat - 1)^2/(se.alpha)^2   # Wald test statistic
##
pval.W <- 1 - pchisq(W, df = 1)
c(W = W, pval.W = pval.W)
```

The p-value is also near 0 implying the rejection of the null hypothesis.


\

# Performance Comparison

At the heart of modern statistical inference lies a powerful trio: the Likelihood Ratio, Wald, and Score tests. Each provides a distinct lens for testing hypotheses, transforming the same likelihood function into a chi-squared statistic through different logical and geometrical principles. Their asymptotic equivalence provides a reassuring theoretical foundation, yet their finite-sample divergence reveals a rich landscape of statistical trade-offs, where the choice of test balances considerations of accuracy, convenience, invariance, and computational burden.


**Core Performance Characteristics**

* **Likelihood Ratio Test (LRT)** - best overall performance in most situations
  + **Accuracy**: Most reliable in finite samples, closest to nominal Type I error rates
  + **Robustness**: Invariant to parameter transformations (unique advantage)
  + **Conservatism**: Slightly conservative in very small samples
  + **Computation**: Most intensive (requires fitting both models)


* **Wald Test** - fastest but least reliable in non-ideal conditions
  + **Accuracy**: Can be poor in small samples, especially with:
    - Skewed likelihoods
    - Parameters near boundaries (variances, probabilities near 0/1)
    - Nonlinear parameters
  + **Bias**: Often anti-conservative (inflates Type I error)
  + **Computation**: Fastest (only needs full model)

* **Score Test (Rao)**- efficient compromise with specific strengths
  + **Accuracy**: Good when null hypothesis is true, but can be inconsistent when alternative is true
  + **Power**: Sometimes less powerful than LRT in small samples
  + **Computation**: Moderate (only needs null model)
  + **Special strength**: Best for boundary testing and initial model building


**General Recommendations**

* Use the LRT when:
  + Computational cost is not a concern.
  + Sample size is small to moderate.
  + when comparing nested models of general types (GLMs, mixed models, etc.).

* Use the Wald Test when:
  + only having the unrestricted model output (e.g., from a standard regression summary).
  + Doing quick approximate inference or constructing confidence intervals.
  + The sample size is large and the likelihood is well-behaved ($\approx$ quadratic).
  + **Caution**: Can be misleading for parameters like odds ratios (always exponentiate first, then form Wald CI on log scale).

* Use the Score Test when:
  + The null model is simpler to fit (e.g., testing for an additional regressor in a large data set; only fit the reduced model).
  + In model checking and diagnostics (e.g., residuals tests).
  + When the MLE under the alternative is hard to obtain or at a boundary.


**Logical Paradoxes and Resolutions**

* **Paradox 1**: "Wald can reject when LRT doesn't"
  + **Occurs when**: Likelihood is highly non-quadratic
  + **Logical explanation**: Wald assumes symmetric CI based on curvature at MLE; LRT uses actual likelihood shape
  + **Resolution**: Trust LRT—it respects the true likelihood geometry

* **Paradox 2**: "Score test more powerful than LRT"
  + **Occurs when**: Parameter at boundary, small samples
  + **Logical explanation**: Score uses exact null distribution; LRT uses asymptotic $\chi^2$ approximation
  + **Resolution**: Use exact distributions or corrected LRT (50:50 mixture $\chi^2$)

* **Paradox 3**: "Tests disagree asymptotically"
  + **Logical implication**: Likely model misspecification or computational error
  + **Resolution**: Check model assumptions and numerical stability





