Introduction
Sampling distributions form the cornerstone of statistical inference.
They describe the probability distribution of a sample
statistic calculated from random samples. This note explores
both exact (finite-sample) and asymptotic (large-sample) distributions
for key statistics including sample means, proportions, and related test
statistics.
Sampling Distribution
of the Sample Mean
When the population is normal, by the property of normal
distribution, the sum of the iid random variables are
exactly normally distributed. If the population is not
a normal distribution, using the Central Limit Theorem (CLT), the sum of
the iid random variables is asymptotically normally
distributed.
Exact
Distribution
For a random sample \(X_1, X_2, \ldots,
X_n\) from a normal population \(N(\mu,
\sigma^2)\), the sample mean has an exact normal
distribution:
\[
\bar{X} \to N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)
\]
The standardized version is:
\[
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \overset{d}{\to} N(0, 1)
\]
Example: We simulate data from normal distribution
to explain the above sampling distribution of sample means from a normal
population.
set.seed(123)
n <- 10
mu <- 5
sigma <- 2
n.samples <- 10000
sample.means <- replicate(n.samples, mean(rnorm(n, mu, sigma))) # replicate() is a wrapper function
# sapply()
# Create theoretical curve data
x.vals <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100)
theory.density <- dnorm(x.vals, mean = mu, sd = sigma/sqrt(n))
theory.df <- data.frame(x = x.vals, density = theory.density)
xbar.plt <- ggplot(data.frame(mean = sample.means), aes(x = mean)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "gray") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
labs(title = "Exact Sampling Distribution of Sample Mean \nNormal Population (n = 10)",
x = "Sample Mean", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(xbar.plt)
Asymptotic Sampling
Distribution (Central Limit Theorem)
The asymptotic sampling distribution is the approximate probability
distribution of a sample statistic (like the mean, proportion, or
regression coefficient) when the sample size \(n\) is very large (approaches
infinity).
For any population with finite mean \(\mu\) and variance \(\sigma^2\), as \(n \to \infty\):
\[
Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \to_{\text{approx}} N(0, 1)
\]
Example We simulation 100000 ransom samples with
size 50 from a skewed exponential population. As the sample size gets
larger, the sampling distribution of the sample means are approximately
distributed.
set.seed(123)
n.large <- 50
lambda <- 1/5 # Mean = 5
# Generate multiple samples from exponential distribution
n.samples <- 10000
exp.means <- replicate(n.samples, mean(rexp(n.large, rate = lambda)))
# Compare with normal approximation
theoretical.mean <- 1/lambda # 5
theoretical.sd <- (1/lambda)/sqrt(n.large) # 5/sqrt(50)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)
# Option 1: Use only stat_function for theoretical curve (Recommended)
gg.clt <- ggplot(data.frame(mean = exp.means), aes(x = mean)) +
geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.7, fill = "lightgreen") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
labs(title = "Asymptotic Sampling Distribution of Sample Mean \nExponential Population (n = 50)",
x = "Sample Mean", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
#gg.clt
ggplotly(gg.clt)
Note: The asymptotic
approximation to the sampling distribution of the sample mean holds
regardless of the shape of the population distribution (provided the
population has finite mean and variance).
Remark: There is no parameter
in the standard normal distribution.
Student’s
t-Distribution
Let \(\{X_1,
X_2, \cdots, X_n \} \overset{\text{i.i.d}}{\sim} N(\mu,
\sigma)\). Define the sample mean to be
\[
\bar{X} = \frac{\sum_{i=1}^n X_i}{n}.
\]
When population variance \(\sigma^2\) is unknown and estimated by
sample variance \(S^2\):
\[
T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \overset{d}{\to} t_{n-1}
\]
where
\[
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
\]
Note: The t-distribution
depends on a single parameter: the degrees of freedom, \(\nu\), which equals \(n-1\) for a sample of size \(n\). Because \(\nu\) is fixed once the sample size is
known—and does not need to be estimated from the sample data—it is
occasionally treated in applications as if the distribution had no
parameters.
Example: Since the above t-distribution is defined
based on a normal distribution, we will simulate random samples from a
normal distribution with finite mean and variance.
set.seed(123)
n <- 10
mu <- 5
sigma <- 2
# Generate t-statistics
n.samples <- 10000
t.stats <- numeric(n.samples) # This defines a 10000 dimensional zero vector
# t.test <- NULL uses more computing resource
for(i in 1:n.samples) {
sample.data <- rnorm(n, mu, sigma)
x.bar <- mean(sample.data)
s <- sd(sample.data)
t.stats[i] <- (x.bar - mu) / (s/sqrt(n))
}
# Compare with theoretical t-distribution
x.vals <- seq(-4, 4, length.out = 200)
theoretical.t <- dt(x.vals, df = n-1) # calling t-density function
theoretical.normal <- dnorm(x.vals) # standard normal distribution
comparison.df <- data.frame(
x = rep(x.vals, 2),
density = c(theoretical.t, theoretical.normal),
distribution = rep(c("t(9)", "N(0,1)"), each = length(x.vals))
)
t.plt <- ggplot(comparison.df, aes(x = x, y = density, color = distribution)) +
geom_line(size = 1) +
labs(title = "t-Distribution vs Normal Distribution",
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")) +
scale_color_manual(values = c("red", "blue"))
ggplotly(t.plt)
Caution: The standardized
sample mean follows a t-distribution. More precisely, when sampling from
a normal population with unknown mean \(\mu\) and standard deviation \(\sigma\), the t-statistic (calculated using
the sample standard deviation) follows a t-distribution.
Sampling Distribution
of Sample Proportion
Let \(X_1, X_2, \dots, X_n\) be
independent and identically distributed Bernoulli random variables with
parameter \(p\), where:
- \(X_i = 1\) with probability \(p\) (success)
- \(X_i = 0\) with probability \(1-p\) (failure)
The sample proportion is defined as:
\[
\hat{p} = \frac{1}{n} \sum_{i=1}^n X_i
\]
where \(n\) is the fixed sample
size.
In practice, when the sample size is large, the sampling distribution
of the sample mean is generally characterized using approximations. For
small samples, however, the exact sampling distribution must be
used.
Exact
Distribution
For a binomial population with success probability \(p\), the sample proportion \(\hat{p} = X/n\) where \(X \sim Binomial(n,p)\).
The exact distribution is simply the probability mass function of a
binomial distribution with n trials and success
probability \(p\):
\[
P(\hat{p} =k/n)= P(n\times \hat{p} = k) = P(X = k)=\frac{n!}{k!(n-k)!}
p^k (1−p)^{n-k}, \ \ k = 0, 1, 2, \cdots, n.
\]
Asymptotic Sampling
Distribution (Large \(n\))
By the Central Limit Theorem (specifically, the De Moivre-Laplace
Theorem for Bernoulli trials):
\[
\hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad
\text{for large } n
\]
More rigorously, in standardized form:
\[
Z_n = \frac{\hat{p}_n - p}{\sqrt{\frac{p(1-p)}{n}}} \stackrel{d}{\to}
N(0,1) \quad \text{as } n \to \infty
\]
Sufficient Conditions for Approximation:
\(np \geq 10\) and \(n(1-p) \geq 10\) (common rule of
thumb)
Alternative: \(n > 9 \times
\max\left(\frac{p}{1-p}, \frac{1-p}{p}\right)\)
Example: We simulate random samples from binary
population (also called Bernoulli population) to demonstrate the
asymptotic sampling distribution of sample proportion.
set.seed(123)
n <- 100
p <- 0.3
# Generate sample proportions
n.samples <- 10000
sample.props <- replicate(n.samples, rbinom(1, n, p)/n) # replicate() is a wrapper
# function of sapply()
# Compare with normal approximation
theoretical.mean <- p
theoretical.sd <- sqrt(p*(1-p)/n)
x.vals <- seq(0,0.6, length=100)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)
binom.plt <- ggplot(data.frame(prop = sample.props), aes(x = prop)) +
geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7, fill = "skyblue") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
#stat_function(fun = dnorm,
# args = list(mean = theoretical_mean, sd = theoretical_sd),
# color = "red", size = 1) +
labs(title = "Sampling Distribution of Sample Proportion",
subtitle = "p = 0.3, n = 100",
x = "Sample Proportion", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(binom.plt)
Sampling Distribution
of Sample Variance
We first introduce the asymptotic sampling distribution of sample
variance without derivation. The basic setting is given in the
following.
Let \(X_1, X_2, \dots, X_n
\stackrel{\text{i.i.d.}}{\sim} F\) with:
- \(E[X_i] = \mu\)
- \(\text{Var}(X_i) = \sigma^2 <
\infty\)
- \(E[(X_i - \mu)^4] = \mu_4 <
\infty\) (finite fourth central moment)
Define the sample variance:
\[
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2
\] where \(\bar{X} = \frac{1}{n}
\sum_{i=1}^n X_i\).
Asymptotic Sampling
Distribution of Sample Variance
When sample size n is large, the sample variance \(S^2\) is approximately normally distributed
as
\[
S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right)
\quad \text{as } n \to \infty,
\] equivalently,
\[
\frac{S^2 - \sigma^2}{\sqrt{\frac{\mu_4 - \sigma^4}{n}}}
\stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty.
\]
In practice, the fourth moment, \(\mu_4\), can be estimated from sample,
which will be discussed in subsequent topics.
Special Case: Normal
Population
When sampling from a normally distributed population, the sampling
distribution of the sample variance can be fully characterized through a
chi-squared distribution with appropriate scaling.
The chi-squared distribution is a special case of the gamma
distribution and can also be constructed from the standard normal
distribution. Specifically, we have the following result:
For \(Z_1, Z_2, \ldots, Z_k
\stackrel{iid}{\sim} N(0,1)\), using moment generating function,
we can show that
\[
Q=\sum_{i=1}^k Z_i^2 \overset{d}{\to} \chi_k^2.
\]
Using the relationship between the standard normal and chi-squared
distributions, we can derive the exact distribution of the scaled sample
variance for a normal population:
\[
\frac{(n-1)S^2}{\sigma^2} \overset{d}{\to} \chi_{n-1}^2.
\]
Proof [optional]: We
prove this in several steps:
We show that for \(X_1, \dots, X_n
\stackrel{iid}{\sim} N(\mu, \sigma^2)\), with
\[
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \quad \bar{X} =
\frac{1}{n} \sum_{i=1}^n X_i,
\]
we have
\[
\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2.
\]
Step 1: Standardize and define notation
Let \(Z_i = \frac{X_i - \mu}{\sigma} \sim
N(0,1)\), i.i.d. Then
\[
\bar{Z} = \frac{1}{n} \sum_{i=1}^n Z_i = \frac{\bar{X} - \mu}{\sigma}.
\]
We can write:
\[
\sum_{i=1}^n (X_i - \bar{X})^2 = \sigma^2 \sum_{i=1}^n (Z_i -
\bar{Z})^2.
\]
So
\[
\frac{(n-1)S^2}{\sigma^2} = \frac{\sum_{i=1}^n (X_i -
\bar{X})^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2.
\]
Step 2: Orthogonal transformation
Let \(\mathbf{Z} = (Z_1, \dots,
Z_n)^T\). Choose an \(n \times
n\) orthogonal matrix \(Q\)
whose first row is \(\left(
\frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}} \right)\).
Define
\[
\mathbf{Y} = Q \mathbf{Z}.
\]
Then:
- \(Y_1 = \frac{1}{\sqrt{n}} \sum_{i=1}^n
Z_i = \sqrt{n} \, \bar{Z}\).
- Since \(Q\) is orthogonal and \(\mathbf{Z} \sim N(0, I_n)\), we have \(\mathbf{Y} \sim N(0, I_n)\) as well, so
\(Y_1, \dots, Y_n\) are i.i.d. \(N(0,1)\).
Step 3: Express sum of squares in terms of \(Y_j\)
Orthogonality implies:
\[
\sum_{i=1}^n Z_i^2 = \sum_{j=1}^n Y_j^2.
\]
Also,
\[
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n Z_i^2 - n \bar{Z}^2.
\]
But \(n \bar{Z}^2 = Y_1^2\), so
\[
\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=1}^n Y_j^2 - Y_1^2 =
\sum_{j=2}^n Y_j^2.
\]
Step 4: Distribution
Since \(Y_2, \dots, Y_n\) are
i.i.d. \(N(0,1)\), we have
\[
\sum_{j=2}^n Y_j^2 \sim \chi_{n-1}^2.
\]
Thus
\[
\frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2 =
\sum_{j=2}^n Y_j^2 \overset{d}{\to} \chi_{n-1}^2.
\]
Step 5: Independence from \(\bar{X}\)
Since \(Y_1 = \sqrt{n} \bar{Z}\) is
independent of \(Y_2, \dots, Y_n\), it
follows that \(\bar{X}\) is independent
of \(S^2\). That is,
\[
\boxed{\frac{(n-1)S^2}{\sigma^2} \to \chi_{n-1}^2}
\]
Example: The \(\chi^2\) distribution is derived from the
standard normal distribution. We simulate standard normal random numbers
and then transform them into \(\chi^2\)
random variables based on the derivations above. A histogram will be
plotted and overlaid with the theoretical \(\chi^2\) density curve.
set.seed(123)
n <- 10
sigma <- 2
# Generate chi-square statistics
n.samples <- 10000
chisq.stats <- numeric(n.samples)
for(i in 1:n.samples) {
sample.data <- rnorm(n, 0, sigma)
chisq.stats[i] <- sum((sample.data/sigma)^2)
}
# Compare with theoretical chi-square
x.vals <- seq(0, 30, length.out = 200)
theoretical.chisq <- dchisq(x.vals, df = n)
theory.df <- data.frame(x = x.vals, density = theoretical.chisq)
chi.plt <- ggplot(data.frame(x = chisq.stats), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "steelblue") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
#stat_function(fun = dchisq, args = list(df = n), color = "red", size = 1) +
labs(title = "Chi-Square Distribution",
subtitle = "Sum of squared standard normals",
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(chi.plt)
Remark Both the chi-squared and
t-distributions are parameterized by degrees of freedom. In particular,
the scaled sample variance from a sample of size \(n\) follows a chi-squared distribution with
\(n−1\) degrees of freedom.
F-Distribution
The F-distribution serves as the sampling distribution for the ratio
of two independent sample variances. Variance is a key measure of
quality across disciplines, where higher variance corresponds to lower
quality. When comparing quality via variances, both differences and
ratios are conceivable. However, under normal population assumptions,
the difference of two sample variances lacks a convenient known
distribution, while the appropriately scaled ratio follows the \(F\) distribution.
The following is the setup of the definition for \(F\) distribution. For two
independent random sample from two normal
populations:
\[
\{X_1, X_2, \cdots, X_{n_1}\} \overset{i.i.d}{\sim} N(\mu_1,
\sigma_1^2) \quad\text{ and } \quad \{Y_1, Y_2, \cdots,
Y_{n_2}\} \overset{i.i.d}{\sim} N(\mu_2, \sigma_2^2),
\]
Define
\[
S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 \quad\text{
and } \quad S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y_i - \bar{Y})^2
\]
\[
F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \overset{d}{\to} F_{n_1-1,
n_2-1}
\]
where \(n_1-1\) and \(n_2-1\) are degrees of freedom of numerator
and denominator respectively. Since \(S_1^2\) and \(S_2^2\) are unbiased for \(\sigma_1^2\) and \(\sigma_2^2\), if our hypothesis is \(\sigma_1^2 = \sigma_2^2\) (indicating equal
product quality in variance terms), the expected F-ratio \(F = S_1^2/S_2^2\) is approximately 1,
though its actual distribution is \(F_{n_1-1,
n_2-1}\).
Example: The F distribution is directly defined
based on two independent \(\chi^2\)
distributions, which are themselves derived from standard normal
distributions. Therefore, we could generate data from normal
distributions and then transform them into F random variables. To keep
the process simple, we generate data directly from \(\chi^2\) distributions.
set.seed(123)
df1 <- 10
df2 <- 15
# Generate F statistics
n.samples <- 10000
f.stats <- numeric(n.samples)
for(i in 1:n.samples) {
u1 <- rchisq(1, df1)
u2 <- rchisq(1, df2)
f.stats[i] <- (u1/df1) / (u2/df2)
}
# Compare with theoretical F-distribution
x.vals <- seq(0, 5, length.out = 200)
theoretical.f <- df(x.vals, df1, df2)
theory.df <- data.frame(x = x.vals, density = theoretical.f)
f.plt <- ggplot(data.frame(x = f.stats), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "purple3") +
geom_line(data = theory.df, aes(x = x, y = density),
color = "red", linewidth = 1) +
coord_cartesian(xlim = c(0, 5)) +
labs(title = paste("F-Distribution \n F(", df1, ",", df2, ")", sep = ""),
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(f.plt)
Remarks:
F has two parameters, degrees of freedom on the numerator and
denominator, what are corresponding sample sizes minus 1. This is
similar to that of t and chi-squared distributions.
From previous section, we see that both numerator and denominator
can be re-expressed in terms of two independent chi-squared
distributions. To see this, both numerators in the following follow
chi-squared distribution with degrees of freedom of \(n_1-1\) and \(n_2-1\), respectively.
\[
\frac{S_1^2}{\sigma_1^2} = \frac{(n_1-1)S_1^2/\sigma_1^2}{n_1-1}
\quad \text{ and } \quad \frac{(n_2-1)S_2^2/\sigma_2^2}{n_2-1}
\]
- Denote \(U_1 = (n_1-1)S_1^2/\sigma_1^2
\overset{d}{\sim} \chi^2_{n_1-1}\) and \(U_2 = (n_2-1)S_2^2/\sigma_2^2 \overset{d}{\sim}
\chi^2_{n_2-1}\). Then, we can re-express the F-ratio as
\[
F = \frac{U_1/(n_1-1)}{U_2/(n_2-1)} \overset{d}{\to} F_{n_1-1, n_2-1}.
\]
- One can also derive the asymptotic sampling
distribution of \(S_1^2/S_2^2\) using a linear approximation
based on a Taylor expansion. However, the analytic expression is complex
and beyond the scope of this class.
Summary of Key
Relationships
We have discussed several exact and asymptotic sampling distributions
for sample means, variances, and their functions. The following table
summarizes these distributions.
| \(\bar{X}\) |
\(N(\mu,
\sigma^2/n)\) |
\(N(\mu,
\sigma^2/n)\) |
Normal population or large n |
| \(\frac{\bar{X}-\mu}{S/\sqrt{n}}\) |
\(t_{n-1}\) |
\(N(0,1)\) |
Normal population |
| \(\hat{p}\) |
\(Binomial(n,p)/n\) |
\(N(p,
p(1-p)/n)\) |
\(np, n(1-p) \geq
5\) |
| \(S^2\) |
- |
\(N(\sigma^2, (\mu_4-\sigma^4
)/n)\) |
large n |
| \(\frac{(n-1)S^2}{\sigma^2}\) |
\(\chi^2_{n-1}\) |
- |
Normal population |
| \(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\) |
\(F_{n_1-1,n_2-1}\) |
- |
Normal populations |
Pivotal Quantity
A pivotal quantity (or pivot) is a function of the sample data and an
unknown parameter whose probability distribution does
not depend on the unknown parameter.
For example, in normal distribution with known variance, we have
\[
X_1, \dots, X_n \sim \text{N}(\mu, \sigma^2), \quad \sigma^2 \text{
known}
\]
The sample mean follows normal distribution: \(\bar{X} \sim \text{N}\left(\mu,
\frac{\sigma^2}{n}\right)\)
According to the definition of pivotal quantity,
\[
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{N}(0,1)
\]
is pivot since \(N(0, 1)\) is
independent on \(\mu\). If the normal
distribution has unknown variance, the sample variance
\[
S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.
\]
The following standardized expression
\[
T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1}
\]
is a pivotal quantity since the t-distribution does not depend on
\(\mu\).
Similarly, \((n-1)S^2/\sigma^2\) and
\(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\)
in the above summary table are pivotal quantities.
Conclusion
Understanding sampling distributions is fundamental to
statistical inference:
Exact distributions provide precise results when assumptions are
met
Asymptotic distributions offer approximations for large
samples
The choice between exact and asymptotic methods depends on sample
size, distributional assumptions, and the specific parameter being
estimated
Modern computing allows for empirical verification of these
theoretical results
These distributions form the theoretical foundation for hypothesis
testing, confidence intervals, and many other statistical
procedures.
