Introduction
Sampling distributions form the cornerstone of statistical inference.
They describe the probability distribution of a sample
statistic calculated from random samples. This note explores
both exact (finite-sample) and asymptotic (large-sample) distributions
for key statistics including sample means, proportions, and related test
statistics.
Sampling Distribution
of the Sample Mean
When the population is normal, by the property of normal
distribution, the sum of the iid random variables are
exactly normally distributed. If the population is not
a normal distribution, using the Central Limit Theorem (CLT), the sum of
the iid random variables is asymptotically normally
distributed.
Exact
Distribution
For a random sample \(X_1, X_2, \ldots,
X_n\) from a normal population \(N(\mu,
\sigma^2)\), the sample mean has an exact normal
distribution:
\[
\bar{X} \to N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)
\]
The standardized version is:
\[
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \overset{d}{\to} N(0, 1)
\]
Example: We simulate data from normal distribution
to explain the above sampling distribution of sample means from a normal
population.
set.seed(123)
n <- 10
mu <- 5
sigma <- 2
n.samples <- 10000
sample.means <- replicate(n.samples, mean(rnorm(n, mu, sigma))) # replicate() is a wrapper function
# sapply()
# Create theoretical curve data
x.vals <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100)
theory.density <- dnorm(x.vals, mean = mu, sd = sigma/sqrt(n))
theory.df <- data.frame(x = x.vals, density = theory.density)
xbar.plt <- ggplot(data.frame(mean = sample.means), aes(x = mean)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "gray") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
labs(title = "Exact Sampling Distribution of Sample Mean \nNormal Population (n = 10)",
x = "Sample Mean", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(xbar.plt)
Asymptotic Sampling
Distribution (Central Limit Theorem)
The asymptotic sampling distribution is the approximate probability
distribution of a sample statistic (like the mean, proportion, or
regression coefficient) when the sample size \(n\) is very large (approaches
infinity).
For any population with finite mean \(\mu\) and variance \(\sigma^2\), as \(n \to \infty\):
\[
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \to_{\text{approx}} N(0, 1)
\]
Example We simulation 100000 ransom samples with
size 50 from a skewed exponential population. As the sample size gets
larger, the sampling distribution of the sample means are approximately
distributed.
set.seed(123)
n.large <- 50
lambda <- 1/5 # Mean = 5
# Generate multiple samples from exponential distribution
n.samples <- 10000
exp.means <- replicate(n.samples, mean(rexp(n.large, rate = lambda)))
# Compare with normal approximation
theoretical.mean <- 1/lambda # 5
theoretical.sd <- (1/lambda)/sqrt(n.large) # 5/sqrt(50)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)
# Option 1: Use only stat_function for theoretical curve (Recommended)
gg.clt <- ggplot(data.frame(mean = exp.means), aes(x = mean)) +
geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.7, fill = "lightgreen") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
labs(title = "Asymptotic Sampling Distribution of Sample Mean \nExponential Population (n = 50)",
x = "Sample Mean", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
#gg.clt
ggplotly(gg.clt)
Note: The asymptotic
approximation to the sampling distribution of the sample mean holds
regardless of the shape of the population distribution (provided the
population has finite mean and variance).
Remark: There is no parameter
in the standard normal distribution.
Student’s
t-Distribution
Let \(\{X_1,
X_2, \cdots, X_n \} \overset{\text{i.i.d}}{\sim} N(\mu,
\sigma)\). Define the sample mean to be
\[
\bar{X} = \frac{\sum_{i=1}^n X_i}{n}.
\]
When population variance \(\sigma^2\) is unknown and estimated by
sample variance \(S^2\):
\[
T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \overset{d}{\to} t_{n-1}
\]
where
\[
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
\]
Note: The t-distribution
depends on a single parameter: the degrees of freedom, \(\nu\), which equals \(n-1\) for a sample of size \(n\). Because \(\nu\) is fixed once the sample size is
known—and does not need to be estimated from the sample data—it is
occasionally treated in applications as if the distribution had no
parameters.
Example: Since the above t-distribution is defined
based on a normal distribution, we will simulate random samples from a
normal distribution with finite mean and variance.
set.seed(123)
n <- 10
mu <- 5
sigma <- 2
# Generate t-statistics
n.samples <- 10000
t.stats <- numeric(n.samples) # This defines a 10000 dimensional zero vector
# t.test <- NULL uses more computing resource
for(i in 1:n.samples) {
sample.data <- rnorm(n, mu, sigma)
x.bar <- mean(sample.data)
s <- sd(sample.data)
t.stats[i] <- (x.bar - mu) / (s/sqrt(n))
}
# Compare with theoretical t-distribution
x.vals <- seq(-4, 4, length.out = 200)
theoretical.t <- dt(x.vals, df = n-1) # calling t-density function
theoretical.normal <- dnorm(x.vals) # standard normal distribution
comparison.df <- data.frame(
x = rep(x.vals, 2),
density = c(theoretical.t, theoretical.normal),
distribution = rep(c("t(9)", "N(0,1)"), each = length(x.vals))
)
t.plt <- ggplot(comparison.df, aes(x = x, y = density, color = distribution)) +
geom_line(size = 1) +
labs(title = "t-Distribution vs Normal Distribution",
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")) +
scale_color_manual(values = c("red", "blue"))
ggplotly(t.plt)
Caution: The standardized
sample mean follows a t-distribution. More precisely, when sampling from
a normal population with unknown mean \(\mu\) and standard deviation \(\sigma\), the t-statistic (calculated using
the sample standard deviation) follows a t-distribution.
Sampling Distribution
of Sample Proportion
Let \(X_1, X_2, \dots, X_n\) be
independent and identically distributed Bernoulli random variables with
parameter \(p\), where:
- \(X_i = 1\) with probability \(p\) (success)
- \(X_i = 0\) with probability \(1-p\) (failure)
The sample proportion is defined as:
\[
\hat{p} = \frac{1}{n} \sum_{i=1}^n X_i
\]
where \(n\) is the fixed sample
size.
In practice, when the sample size is large, the sampling distribution
of the sample mean is generally characterized using approximations. For
small samples, however, the exact sampling distribution must be
used.
Exact
Distribution
For a binomial population with success probability \(p\), the sample proportion \(\hat{p} = X/n\) where \(X \sim Binomial(n,p)\).
The exact distribution is simply the probability mass function of a
binomial distribution with n trials and success
probability \(p\):
\[
P(\hat{p} =k/n)= P(n\times \hat{p} = k) = P(X = k)=\frac{n!}{k!(n-k)!}
p^k (1−p)^{n-k}, \ \ k = 0, 1, 2, \cdots, n.
\]
Asymptotic Sampling
Distribution (Large \(n\))
By the Central Limit Theorem (specifically, the De Moivre-Laplace
Theorem for Bernoulli trials):
\[
\hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad
\text{for large } n
\]
More rigorously, in standardized form:
\[
Z_n = \frac{\hat{p}_n - p}{\sqrt{\frac{p(1-p)}{n}}} \stackrel{d}{\to}
N(0,1) \quad \text{as } n \to \infty
\]
Sufficient Conditions for Approximation:
\(np \geq 10\) and \(n(1-p) \geq 10\) (common rule of
thumb)
Alternative: \(n > 9 \times
\max\left(\frac{p}{1-p}, \frac{1-p}{p}\right)\)
Example: We simulate random samples from binary
population (also called Bernoulli population) to demonstrate the
asymptotic sampling distribution of sample proportion.
set.seed(123)
n <- 100
p <- 0.3
# Generate sample proportions
n.samples <- 10000
sample.props <- replicate(n.samples, rbinom(1, n, p)/n) # replicate() is a wrapper
# function of sapply()
# Compare with normal approximation
theoretical.mean <- p
theoretical.sd <- sqrt(p*(1-p)/n)
x.vals <- seq(0,0.6, length=100)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)
binom.plt <- ggplot(data.frame(prop = sample.props), aes(x = prop)) +
geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7, fill = "skyblue") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
#stat_function(fun = dnorm,
# args = list(mean = theoretical_mean, sd = theoretical_sd),
# color = "red", size = 1) +
labs(title = "Sampling Distribution of Sample Proportion",
subtitle = "p = 0.3, n = 100",
x = "Sample Proportion", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(binom.plt)
Sampling Distribution
of Sample Variance
We first introduce the asymptotic sampling distribution of sample
variance without derivation. The basic setting is given in the
following.
Let \(X_1, X_2, \dots, X_n
\stackrel{\text{i.i.d.}}{\sim} F\) with:
- \(E[X_i] = \mu\)
- \(\text{Var}(X_i) = \sigma^2 <
\infty\)
- \(E[(X_i - \mu)^4] = \mu_4 <
\infty\) (finite fourth central moment)
Define the sample variance:
\[
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2
\] where \(\bar{X} = \frac{1}{n}
\sum_{i=1}^n X_i\).
Asymptotic Sampling
Distribution of Sample Variance
When sample size n is large, the sample variance \(S^2\) is approximately normally distributed
as
\[
S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right)
\quad \text{as } n \to \infty,
\] equivalently,
\[
\frac{S^2 - \sigma^2}{\sqrt{\frac{\mu_4 - \sigma^4}{n}}}
\stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty.
\]
In practice, the fourth moment, \(\mu_4\), can be estimated from sample,
which will be discussed in subsequent topics.
Special Case: Normal
Population
When sampling from a normally distributed population, the sampling
distribution of the sample variance can be fully characterized through a
chi-squared distribution with appropriate scaling.
The chi-squared distribution is a special case of the gamma
distribution and can also be constructed from the standard normal
distribution. Specifically, we have the following result:
For \(Z_1, Z_2, \ldots, Z_k
\stackrel{iid}{\sim} N(0,1)\), using moment generating function,
we can show that
\[
Q=\sum_{i=1}^k Z_i^2 \overset{d}{\to} \chi_k^2.
\]
Using the relationship between the standard normal and chi-squared
distributions, we can derive the exact distribution of the scaled sample
variance for a normal population:
\[
\frac{(n-1)S^2}{\sigma^2} \overset{d}{\to} \chi_{n-1}^2.
\]
Proof [optional]: We
prove this in several steps:
We show that for \(X_1, \dots, X_n
\stackrel{iid}{\sim} N(\mu, \sigma^2)\), with
\[
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \quad \bar{X} =
\frac{1}{n} \sum_{i=1}^n X_i,
\]
we have
\[
\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2.
\]
Step 1: Standardize and define notation
Let \(Z_i = \frac{X_i - \mu}{\sigma} \sim
N(0,1)\), i.i.d. Then
\[
\bar{Z} = \frac{1}{n} \sum_{i=1}^n Z_i = \frac{\bar{X} - \mu}{\sigma}.
\]
We can write:
\[
\sum_{i=1}^n (X_i - \bar{X})^2 = \sigma^2 \sum_{i=1}^n (Z_i -
\bar{Z})^2.
\]
So
\[
\frac{(n-1)S^2}{\sigma^2} = \frac{\sum_{i=1}^n (X_i -
\bar{X})^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2.
\]
Step 2: Orthogonal transformation
Let \(\mathbf{Z} = (Z_1, \dots,
Z_n)^T\). Choose an \(n \times
n\) orthogonal matrix \(Q\)
whose first row is \(\left(
\frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}} \right)\).
Define
\[
\mathbf{Y} = Q \mathbf{Z}.
\]
Then:
- \(Y_1 = \frac{1}{\sqrt{n}} \sum_{i=1}^n
Z_i = \sqrt{n} \, \bar{Z}\).
- Since \(Q\) is orthogonal and \(\mathbf{Z} \sim N(0, I_n)\), we have \(\mathbf{Y} \sim N(0, I_n)\) as well, so
\(Y_1, \dots, Y_n\) are i.i.d. \(N(0,1)\).
Step 3: Express sum of squares in terms of \(Y_j\)
Orthogonality implies:
\[
\sum_{i=1}^n Z_i^2 = \sum_{j=1}^n Y_j^2.
\]
Also,
\[
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n Z_i^2 - n \bar{Z}^2.
\]
But \(n \bar{Z}^2 = Y_1^2\), so
\[
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=1}^n Y_j^2 - Y_1^2 =
\sum_{j=2}^n Y_j^2.
\]
Step 4: Distribution
Since \(Y_2, \dots, Y_n\) are
i.i.d. \(N(0,1)\), we have
\[
\sum_{j=2}^n Y_j^2 \sim \chi_{n-1}^2.
\]
Thus
\[
\frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2 =
\sum_{j=2}^n Y_j^2 \overset{d}{\to} \chi_{n-1}^2.
\]
Step 5: Independence from \(\bar{X}\)
Since \(Y_1 = \sqrt{n} \bar{Z}\) is
independent of \(Y_2, \dots, Y_n\), it
follows that \(\bar{X}\) is independent
of \(S^2\). That is,
\[
\boxed{\frac{(n-1)S^2}{\sigma^2} \to \chi_{n-1}^2}
\]
Example: The \(\chi^2\) distribution is derived from the
standard normal distribution. We simulate standard normal random numbers
and then transform them into \(\chi^2\)
random variables based on the derivations above. A histogram will be
plotted and overlaid with the theoretical \(\chi^2\) density curve.
set.seed(123)
n <- 10
sigma <- 2
# Generate chi-square statistics
n.samples <- 10000
chisq.stats <- numeric(n.samples)
for(i in 1:n.samples) {
sample.data <- rnorm(n, 0, sigma)
chisq.stats[i] <- sum((sample.data/sigma)^2)
}
# Compare with theoretical chi-square
x.vals <- seq(0, 30, length.out = 200)
theoretical.chisq <- dchisq(x.vals, df = n)
theory.df <- data.frame(x = x.vals, density = theoretical.chisq)
chi.plt <- ggplot(data.frame(x = chisq.stats), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "steelblue") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
#stat_function(fun = dchisq, args = list(df = n), color = "red", size = 1) +
labs(title = "Chi-Square Distribution",
subtitle = "Sum of squared standard normals",
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(chi.plt)
Remark Both the chi-squared and
t-distributions are parameterized by degrees of freedom. In particular,
the scaled sample variance from a sample of size \(n\) follows a chi-squared distribution with
\(n−1\) degrees of freedom.
F-Distribution
The F-distribution serves as the sampling distribution for the ratio
of two independent sample variances. Variance is a key measure of
quality across disciplines, where higher variance corresponds to lower
quality. When comparing quality via variances, both differences and
ratios are conceivable. However, under normal population assumptions,
the difference of two sample variances lacks a convenient known
distribution, while the appropriately scaled ratio follows the \(F\) distribution.
The following is the setup of the definition for \(F\) distribution. For two
independent random sample from two normal
populations:
\[
\{X_1, X_2, \cdots, X_{n_1}\} \overset{i.i.d}{\sim} N(\mu_1,
\sigma_1^2) \quad\text{ and } \quad \{Y_1, Y_2, \cdots,
Y_{n_2}\} \overset{i.i.d}{\sim} N(\mu_2, \sigma_2^2),
\]
Define
\[
S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 \quad\text{
and } \quad S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y_i - \bar{Y})^2
\]
\[
F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \overset{d}{\to} F_{n_1-1,
n_2-1}
\]
where \(n_1-1\) and \(n_2-1\) are degrees of freedom of numerator
and denominator respectively. Since \(S_1^2\) and \(S_2^2\) are unbiased for \(\sigma_1^2\) and \(\sigma_2^2\), if our hypothesis is \(\sigma_1^2 = \sigma_2^2\) (indicating equal
product quality in variance terms), the expected F-ratio \(F = S_1^2/S_2^2\) is approximately 1,
though its actual distribution is \(F_{n_1-1,
n_2-1}\).
Example: The F distribution is directly defined
based on two independent \(\chi^2\)
distributions, which are themselves derived from standard normal
distributions. Therefore, we could generate data from normal
distributions and then transform them into F random variables. To keep
the process simple, we generate data directly from \(\chi^2\) distributions.
set.seed(123)
df1 <- 10
df2 <- 15
# Generate F statistics
n.samples <- 10000
f.stats <- numeric(n.samples)
for(i in 1:n.samples) {
u1 <- rchisq(1, df1)
u2 <- rchisq(1, df2)
f.stats[i] <- (u1/df1) / (u2/df2)
}
# Compare with theoretical F-distribution
x.vals <- seq(0, 5, length.out = 200)
theoretical.f <- df(x.vals, df1, df2)
theory.df <- data.frame(x = x.vals, density = theoretical.f)
f.plt <- ggplot(data.frame(x = f.stats), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "purple3") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
coord_cartesian(xlim = c(0, 5)) +
labs(title = paste("F-Distribution \n F(", df1, ",", df2, ")", sep = ""),
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(f.plt)
Remarks:
F has two parameters, degrees of freedom on the numerator and
denominator, what are corresponding sample sizes minus 1. This is
similar to that of t and chi-squared distributions.
From previous section, we see that both numerator and denominator
can be re-expressed in terms of two independent chi-squared
distributions. To see this, both numerators in the following follow
chi-squared distribution with degrees of freedom of \(n_1-1\) and \(n_2-1\), respectively.
\[
\frac{S_1^2}{\sigma_1^2} = \frac{(n_1-1)S_1^2/\sigma_1^2}{n_1-1}
\quad \text{ and } \quad \frac{(n_2-1)S_2^2/\sigma_2^2}{n_2-1}
\]
- Denote \(U_1 = (n_1-1)S_1^2/\sigma_1^2
\overset{d}{\sim} \chi^2_{n_1-1}\) and \(U_2 = (n_2-1)S_2^2/\sigma_2^2 \overset{d}{\sim}
\chi^2_{n_2-1}\). Then, we can re-express the F-ratio as
\[
F = \frac{U_1/(n_1-1)}{U_2/(n_2-1)} \overset{d}{\to} F_{n_1-1, n_2-1}.
\]
- One can also derive the asymptotic sampling
distribution of \(S_1^2/S_2^2\) using a linear approximation
based on a Taylor expansion. However, the analytic expression is complex
and beyond the scope of this class.
Summary of Key
Relationships
We have discussed several exact and asymptotic sampling distributions
for sample means, variances, and their functions. The following table
summarizes these distributions.
| \(\bar{X}\) |
\(N(\mu,
\sigma^2/n)\) |
\(N(\mu,
\sigma^2/n)\) |
Normal population or large n |
| \(\frac{\bar{X}-\mu}{S/\sqrt{n}}\) |
\(t_{n-1}\) |
\(N(0,1)\) |
Normal population |
| \(\hat{p}\) |
\(Binomial(n,p)/n\) |
\(N(p,
p(1-p)/n)\) |
\(np, n(1-p) \geq
5\) |
| \(S^2\) |
- |
\(N(\sigma^2, (\mu_4-\sigma^4
)/n)\) |
large n |
| \(\frac{(n-1)S^2}{\sigma^2}\) |
\(\chi^2_{n-1}\) |
- |
Normal population |
| \(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\) |
\(F_{n_1-1,n_2-1}\) |
- |
Normal populations |
Pivotal Quantity
A pivotal quantity (or pivot) is a function of the sample data and an
unknown parameter whose probability distribution does
not depend on the unknown parameter.
For example, in normal distribution with known variance, we have
\[
X_1, \dots, X_n \sim \text{N}(\mu, \sigma^2), \quad \sigma^2 \text{
known}
\]
The sample mean follows normal distribution: \(\bar{X} \sim \text{N}\left(\mu,
\frac{\sigma^2}{n}\right)\)
According to the definition of pivotal quantity,
\[
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{N}(0,1)
\]
is pivot since \(N(0, 1)\) is
independent on \(\mu\). If the normal
distribution has unknown variance, the sample variance
\[
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
\]
The following standardized expression
\[
T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1}
\]
is a pivotal quantity since the t-distribution does not depend on
\(\mu\).
Similarly, \((n-1)S^2/\sigma^2\) and
\(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\)
in the above summary table are pivotal quantities.
Conclusion
Understanding sampling distributions is fundamental to
statistical inference:
Exact distributions provide precise results when assumptions are
met
Asymptotic distributions offer approximations for large
samples
The choice between exact and asymptotic methods depends on sample
size, distributional assumptions, and the specific parameter being
estimated
Modern computing allows for empirical verification of these
theoretical results
These distributions form the theoretical foundation for hypothesis
testing, confidence intervals, and many other statistical
procedures.
---
title: "Sampling Distributions"
author: "Cheng Peng"
date: "West Chester University"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}

## library(leaps)
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```

\


# Introduction

Sampling distributions form the cornerstone of statistical inference. They describe the probability distribution of a **sample statistic** calculated from random samples. This note explores both exact (finite-sample) and asymptotic (large-sample) distributions for key statistics including sample means, proportions, and related test statistics.


# Sampling Distribution of the Sample Mean

When the population is normal, by the property of normal distribution, the sum of the iid random variables are **exactly** normally distributed. If the population is not a normal distribution, using the Central Limit Theorem (CLT), the sum of the iid random variables is **asymptotically** normally distributed.


## Exact Distribution

For a random sample $X_1, X_2, \ldots, X_n$ from a normal population $N(\mu, \sigma^2)$, the sample mean has an exact normal distribution:

$$
\bar{X} \to N\left(\mu,  \frac{\sigma}{\sqrt{n}}\right)
$$

The standardized version is:

$$
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \overset{d}{\to} N(0, 1)
$$

**Example**: We simulate data from normal distribution to explain the above sampling distribution of sample means from a normal population.

```{r}
set.seed(123)
n <- 10
mu <- 5
sigma <- 2

n.samples <- 10000
sample.means <- replicate(n.samples, mean(rnorm(n, mu, sigma)))  # replicate() is a wrapper function 
                                                                 # sapply()

# Create theoretical curve data
x.vals <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100)
theory.density <- dnorm(x.vals, mean = mu, sd = sigma/sqrt(n))
theory.df <- data.frame(x = x.vals, density = theory.density)

xbar.plt <- ggplot(data.frame(mean = sample.means), aes(x = mean)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "gray") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  labs(title = "Exact Sampling Distribution of Sample Mean \nNormal Population (n = 10)",
       x = "Sample Mean", y = "Density") +
   theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))

ggplotly(xbar.plt)

```


## Asymptotic Sampling Distribution (Central Limit Theorem)

The asymptotic sampling distribution is the approximate probability distribution of a sample statistic (like the mean, proportion, or regression coefficient) when the sample size $n$ is very large (*approaches infinity*).


For any population with finite mean $\mu$ and variance $\sigma^2$, as $n \to \infty$:



$$
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \to_{\text{approx}} N(0, 1)
$$


**Example** We simulation 100000 ransom samples with size 50 from a skewed exponential population. As the sample size gets larger, the sampling distribution of the sample means are approximately distributed.

```{r}
set.seed(123)
n.large <- 50
lambda <- 1/5  # Mean = 5

# Generate multiple samples from exponential distribution
n.samples <- 10000
exp.means <- replicate(n.samples, mean(rexp(n.large, rate = lambda)))

# Compare with normal approximation
theoretical.mean <- 1/lambda  # 5
theoretical.sd <- (1/lambda)/sqrt(n.large)  # 5/sqrt(50)

theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)

# Option 1: Use only stat_function for theoretical curve (Recommended)
gg.clt <- ggplot(data.frame(mean = exp.means), aes(x = mean)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.7, fill = "lightgreen") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  labs(title = "Asymptotic Sampling Distribution of Sample Mean \nExponential Population (n = 50)",
       x = "Sample Mean", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
#gg.clt
ggplotly(gg.clt)
```


<font color = "red">**Note**: The asymptotic approximation to the sampling distribution of the sample mean holds regardless of the shape of the population distribution (provided the population has finite mean and variance).</font>

<font color = "blue">**Remark**: There is no parameter in the standard normal distribution.</font>

\

# Student's t-Distribution


<font color = "red">**Let $\{X_1, X_2, \cdots, X_n \} \overset{\text{i.i.d}}{\sim} N(\mu, \sigma)$.**</font> Define the sample mean to be

$$
\bar{X} = \frac{\sum_{i=1}^n X_i}{n}.
$$

When population variance $\sigma^2$ is unknown and estimated by sample variance $S^2$:

$$
T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \overset{d}{\to}  t_{n-1}
$$

where 

$$
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
$$

<font color = "red">**Note**: *The t-distribution depends on a single parameter: the degrees of freedom, $\nu$, which equals $n-1$ for a sample of size $n$. Because $\nu$ is fixed once the sample size is known—and does not need to be estimated from the sample data—it is occasionally treated in applications as if the distribution had no parameters.* </font>


**Example**: Since the above t-distribution is defined based on a normal distribution, we will simulate random samples from a normal distribution with finite mean and variance.

```{r}
set.seed(123)
n <- 10
mu <- 5
sigma <- 2

# Generate t-statistics
n.samples <- 10000
t.stats <- numeric(n.samples)  # This defines a 10000 dimensional zero vector
                               # t.test <- NULL uses more computing resource
for(i in 1:n.samples) {
  sample.data <- rnorm(n, mu, sigma)
  x.bar <- mean(sample.data)
  s <- sd(sample.data)
  t.stats[i] <- (x.bar - mu) / (s/sqrt(n))
}

# Compare with theoretical t-distribution
x.vals <- seq(-4, 4, length.out = 200)
theoretical.t <- dt(x.vals, df = n-1)    # calling t-density function
theoretical.normal <- dnorm(x.vals)      # standard normal distribution

comparison.df <- data.frame(
  x = rep(x.vals, 2),
  density = c(theoretical.t, theoretical.normal),
  distribution = rep(c("t(9)", "N(0,1)"), each = length(x.vals))
)

t.plt <- ggplot(comparison.df, aes(x = x, y = density, color = distribution)) +
  geom_line(size = 1) +
  labs(title = "t-Distribution vs Normal Distribution",
       x = "Value", y = "Density") +
    theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")) +
   scale_color_manual(values = c("red", "blue"))
ggplotly(t.plt)
```


<font color = "red"> **Caution:** The standardized sample mean follows a t-distribution. More precisely, when sampling from a normal population with unknown mean $\mu$ and standard deviation $\sigma$, the t-statistic (calculated using the sample standard deviation) follows a t-distribution. </font>



# Sampling Distribution of Sample Proportion

Let $X_1, X_2, \dots, X_n$ be independent and identically distributed Bernoulli random variables with parameter $p$, where:

* $X_i = 1$ with probability $p$ (success)
* $X_i = 0$ with probability $1-p$ (failure)
 

The sample proportion is defined as:

$$
\hat{p} = \frac{1}{n} \sum_{i=1}^n X_i
$$

where $n$ is the fixed sample size.

In practice, when the sample size is large, the sampling distribution of the sample mean is generally characterized using approximations. For small samples, however, the exact sampling distribution must be used.

## Exact Distribution

For a binomial population with success probability $p$, the sample proportion $\hat{p} = X/n$ where $X \sim Binomial(n,p)$.

The exact distribution is simply the probability mass function of a **binomial distribution** with n trials and success probability $p$:

$$
P(\hat{p} =k/n)= P(n\times \hat{p} = k) = P(X = k)=\frac{n!}{k!(n-k)!} p^k (1−p)^{n-k}, \ \ k = 0, 1, 2, \cdots, n.
$$ 


## Asymptotic Sampling Distribution (Large $n$)

By the Central Limit Theorem (specifically, the De Moivre-Laplace Theorem for Bernoulli trials):

$$
\hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad \text{for large } n
$$

More rigorously, in standardized form:

$$
Z_n = \frac{\hat{p}_n - p}{\sqrt{\frac{p(1-p)}{n}}} \stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty
$$

**Sufficient Conditions for Approximation:**

* $np \geq 10$ and $n(1-p) \geq 10$ (common rule of thumb)

* Alternative: $n > 9 \times \max\left(\frac{p}{1-p}, \frac{1-p}{p}\right)$



**Example**: We simulate random samples from binary population (also called Bernoulli population) to demonstrate the asymptotic sampling distribution of sample proportion.

```{r}
set.seed(123)
n <- 100
p <- 0.3

# Generate sample proportions
n.samples <- 10000
sample.props <- replicate(n.samples, rbinom(1, n, p)/n) # replicate() is a wrapper 
                                                        # function of sapply()

# Compare with normal approximation
theoretical.mean <- p
theoretical.sd <- sqrt(p*(1-p)/n)

x.vals <- seq(0,0.6, length=100)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)

binom.plt <- ggplot(data.frame(prop = sample.props), aes(x = prop)) +
  geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7, fill = "skyblue") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  #stat_function(fun = dnorm, 
  #              args = list(mean = theoretical_mean, sd = theoretical_sd),
  #              color = "red", size = 1) +
  labs(title = "Sampling Distribution of Sample Proportion",
       subtitle = "p = 0.3, n = 100",
       x = "Sample Proportion", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(binom.plt)
```


# Sampling Distribution of Sample Variance

We first introduce the asymptotic sampling distribution of sample variance without derivation. The basic setting is given in the following.

Let $X_1, X_2, \dots, X_n \stackrel{\text{i.i.d.}}{\sim} F$ with:

* $E[X_i] = \mu$
* $\text{Var}(X_i) = \sigma^2 < \infty$
* $E[(X_i - \mu)^4] = \mu_4 < \infty$ (finite fourth central moment)


Define the sample variance:

$$
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2
$$
where $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$.


## Asymptotic Sampling Distribution of Sample Variance

When sample size n is large, the sample variance $S^2$ is approximately normally distributed as

$$
S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right) \quad \text{as } n \to \infty,
$$
equivalently,

$$
\frac{S^2 - \sigma^2}{\sqrt{\frac{\mu_4 - \sigma^4}{n}}} \stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty.
$$


In practice, the fourth moment,  $\mu_4$, can be estimated from sample, which will be discussed in subsequent topics.


## Special Case: Normal Population

When sampling from a normally distributed population, the sampling distribution of the sample variance can be fully characterized through a chi-squared distribution with appropriate scaling. 

The chi-squared distribution is a special case of the gamma distribution and can also be constructed from the standard normal distribution. Specifically, we have the following result:

For $Z_1, Z_2, \ldots, Z_k \stackrel{iid}{\sim} N(0,1)$, using moment generating function, we can show that

$$
Q=\sum_{i=1}^k Z_i^2 \overset{d}{\to} \chi_k^2.
$$
 
Using the relationship between the standard normal and chi-squared distributions, we can derive the exact distribution of the scaled sample variance for a normal population:

$$
\frac{(n-1)S^2}{\sigma^2} \overset{d}{\to} \chi_{n-1}^2.
$$

**Proof <font color = "red">[optional]</font>**: We prove this in several steps:

We show that for $X_1, \dots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)$, with

$$
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \quad \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i,
$$

we have

$$
\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2.
$$

**Step 1: Standardize and define notation**

Let $Z_i = \frac{X_i - \mu}{\sigma} \sim N(0,1)$, i.i.d. Then

$$
\bar{Z} = \frac{1}{n} \sum_{i=1}^n Z_i = \frac{\bar{X} - \mu}{\sigma}.
$$

We can write:

$$
\sum_{i=1}^n (X_i - \bar{X})^2 = \sigma^2 \sum_{i=1}^n (Z_i - \bar{Z})^2.
$$

So

$$
\frac{(n-1)S^2}{\sigma^2} = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2.
$$

**Step 2: Orthogonal transformation**

Let \( \mathbf{Z} = (Z_1, \dots, Z_n)^T \). Choose an \( n \times n \) orthogonal matrix \( Q \) whose first row is \( \left( \frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}} \right) \). Define

$$
\mathbf{Y} = Q \mathbf{Z}.
$$

Then:

* $Y_1 = \frac{1}{\sqrt{n}} \sum_{i=1}^n Z_i = \sqrt{n} \, \bar{Z}$.
* Since $Q$ is orthogonal and $\mathbf{Z} \sim N(0, I_n)$, we have $\mathbf{Y} \sim N(0, I_n)$ as well, so $Y_1, \dots, Y_n$ are i.i.d.\ $N(0,1)$.


**Step 3: Express sum of squares in terms of \( Y_j \)**

Orthogonality implies:

$$
\sum_{i=1}^n Z_i^2 = \sum_{j=1}^n Y_j^2.
$$

Also,

$$
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n Z_i^2 - n \bar{Z}^2.
$$

But $n \bar{Z}^2 = Y_1^2$, so

$$
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=1}^n Y_j^2 - Y_1^2 = \sum_{j=2}^n Y_j^2.
$$

**Step 4: Distribution**

Since $Y_2, \dots, Y_n$ are i.i.d.\ $N(0,1)$, we have

$$
\sum_{j=2}^n Y_j^2 \sim \chi_{n-1}^2.
$$

Thus

$$
\frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=2}^n Y_j^2 \overset{d}{\to} \chi_{n-1}^2.
$$

**Step 5: Independence from \( \bar{X} \)**

Since $Y_1 = \sqrt{n} \bar{Z}$ is independent of $Y_2, \dots, Y_n$, it follows that $\bar{X}$ is independent of $S^2$. That is,

$$
\boxed{\frac{(n-1)S^2}{\sigma^2} \to \chi_{n-1}^2}
$$

**Example**: The $\chi^2$ distribution is derived from the standard normal distribution. We simulate standard normal random numbers and then transform them into $\chi^2$ random variables based on the derivations above. A histogram will be plotted and overlaid with the theoretical $\chi^2$ density curve.


```{r}
set.seed(123)
n <- 10
sigma <- 2

# Generate chi-square statistics
n.samples <- 10000
chisq.stats <- numeric(n.samples)

for(i in 1:n.samples) {
  sample.data <- rnorm(n, 0, sigma)
  chisq.stats[i] <- sum((sample.data/sigma)^2)
}

# Compare with theoretical chi-square
x.vals <- seq(0, 30, length.out = 200)
theoretical.chisq <- dchisq(x.vals, df = n)
theory.df <- data.frame(x = x.vals, density = theoretical.chisq)

chi.plt <- ggplot(data.frame(x = chisq.stats), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "steelblue") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  #stat_function(fun = dchisq, args = list(df = n), color = "red", size = 1) +
  labs(title = "Chi-Square Distribution",
       subtitle = "Sum of squared standard normals",
       x = "Value", y = "Density") +
   theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(chi.plt)
```


<font color = "red"> **Remark** Both the chi-squared and t-distributions are parameterized by degrees of freedom. In particular, the scaled sample variance from a sample of size $n$ follows a chi-squared distribution with $n−1$ degrees of freedom. </font>


# F-Distribution

The F-distribution serves as the sampling distribution for the ratio of two independent sample variances. Variance is a key measure of quality across disciplines, where higher variance corresponds to lower quality. When comparing quality via variances, both differences and ratios are conceivable. However, under normal population assumptions, the difference of two sample variances lacks a convenient known distribution, while the appropriately scaled ratio follows the $F$ distribution.  

The following is the setup of the definition for $F$ distribution. For two **independent** random sample from two normal populations:

$$
\{X_1, X_2, \cdots, X_{n_1}\}  \overset{i.i.d}{\sim} N(\mu_1, \sigma_1^2) \quad\text{ and } \quad \{Y_1, Y_2, \cdots, Y_{n_2}\}  \overset{i.i.d}{\sim} N(\mu_2, \sigma_2^2),
$$

Define

$$
S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 \quad\text{ and } \quad S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y_i - \bar{Y})^2 
$$

$$
F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \overset{d}{\to} F_{n_1-1, n_2-1}
$$
 
where $n_1-1$ and $n_2-1$ are degrees of freedom of numerator and denominator respectively. Since $S_1^2$ and $S_2^2$ are unbiased for $\sigma_1^2$ and $\sigma_2^2$, if our hypothesis is $\sigma_1^2 = \sigma_2^2$ (indicating equal product quality in variance terms), the expected F-ratio $F = S_1^2/S_2^2$ is approximately 1, though its actual distribution is $F_{n_1-1, n_2-1}$. 





**Example**: The F distribution is directly defined based on two independent $\chi^2$ distributions, which are themselves derived from standard normal distributions. Therefore, we could generate data from normal distributions and then transform them into F random variables. To keep the process simple, we generate data directly from $\chi^2$ distributions.

```{r}
set.seed(123)
df1 <- 10
df2 <- 15

# Generate F statistics
n.samples <- 10000
f.stats <- numeric(n.samples)

for(i in 1:n.samples) {
  u1 <- rchisq(1, df1)
  u2 <- rchisq(1, df2)
  f.stats[i] <- (u1/df1) / (u2/df2)
}

# Compare with theoretical F-distribution
x.vals <- seq(0, 5, length.out = 200)
theoretical.f <- df(x.vals, df1, df2)
theory.df <- data.frame(x = x.vals, density = theoretical.f)




f.plt <- ggplot(data.frame(x = f.stats), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "purple3") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  coord_cartesian(xlim = c(0, 5)) +
  labs(title = paste("F-Distribution \n F(", df1, ",", df2, ")", sep = ""),
       x = "Value", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(f.plt)
```


**Remarks**: 

1. F has two parameters, degrees of freedom on the numerator and denominator, what are corresponding sample sizes minus 1. This is similar to that of t and chi-squared distributions.

2. From previous section, we see that both numerator and denominator can be re-expressed in terms of two independent chi-squared distributions. To see this, both numerators in the following follow chi-squared distribution with degrees of freedom of $n_1-1$ and $n_2-1$, respectively. 

$$
\frac{S_1^2}{\sigma_1^2} = \frac{(n_1-1)S_1^2/\sigma_1^2}{n_1-1} \quad  \text{ and } \quad \frac{(n_2-1)S_2^2/\sigma_2^2}{n_2-1}
$$

3.   Denote $U_1 = (n_1-1)S_1^2/\sigma_1^2 \overset{d}{\sim} \chi^2_{n_1-1}$ and $U_2 = (n_2-1)S_2^2/\sigma_2^2 \overset{d}{\sim} \chi^2_{n_2-1}$. Then, we can re-express the F-ratio as

$$
F = \frac{U_1/(n_1-1)}{U_2/(n_2-1)} \overset{d}{\to} F_{n_1-1, n_2-1}.
$$

4. One can also derive the **asymptotic sampling distribution** of $S_1^2/S_2^2$ using a linear approximation based on a Taylor expansion. However, the analytic expression is complex and beyond the scope of this class.




# Summary of Key Relationships

We have discussed several exact and asymptotic sampling distributions for sample means, variances, and their functions. The following table summarizes these distributions.


|Statistic	| Exact Distribution |	Asymptotic Distribution |	Conditions |
|:----------|:--------------|:--------------------|:-------------|
| $\bar{X}$	| $N(\mu, \sigma^2/n)$| 	$N(\mu, \sigma^2/n)$| 	Normal population or large n| 
| $\frac{\bar{X}-\mu}{S/\sqrt{n}}$| 	$t_{n-1}$	| $N(0,1)$| 	Normal population| 
| $\hat{p}$	| $Binomial(n,p)/n$	| $N(p, p(1-p)/n)$| 	$np, n(1-p) \geq 5$| 
| $S^2$     | - | $N(\sigma^2, (\mu_4-\sigma^4 )/n)$ | large n |
| $\frac{(n-1)S^2}{\sigma^2}$	| $\chi^2_{n-1}$ |-	| Normal population| 
| $\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}$| $F_{n_1-1,n_2-1}$| 	-	| Normal populations| 


**Pivotal Quantity**

A pivotal quantity (or pivot) is a function of the sample data and an **unknown parameter** whose probability distribution does not depend on the **unknown parameter**.

For example, in normal distribution with known variance, we have

$$
X_1, \dots, X_n \sim \text{N}(\mu, \sigma^2), \quad \sigma^2 \text{ known}
$$

The sample mean follows normal distribution: $\bar{X} \sim \text{N}\left(\mu, \frac{\sigma^2}{n}\right)$ 

According to the definition of pivotal quantity, 

$$
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{N}(0,1)
$$

is pivot since $N(0, 1)$ is independent on $\mu$.  If the normal distribution has unknown variance, the sample variance

$$
 S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
$$

The following standardized expression

$$
T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1}
$$

is a pivotal quantity since the t-distribution does not depend on $\mu$. 

Similarly, $(n-1)S^2/\sigma^2$ and $\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}$ in the above summary table are pivotal quantities.



**Conclusion**

* Understanding sampling distributions is fundamental to statistical inference:

* Exact distributions provide precise results when assumptions are met

* Asymptotic distributions offer approximations for large samples

* The choice between exact and asymptotic methods depends on sample size, distributional assumptions, and the specific parameter being estimated

* Modern computing allows for empirical verification of these theoretical results


These distributions form the theoretical foundation for hypothesis testing, confidence intervals, and many other statistical procedures.





