1 Introduction

Sampling distributions form the cornerstone of statistical inference. They describe the probability distribution of a sample statistic calculated from random samples. This note explores both exact (finite-sample) and asymptotic (large-sample) distributions for key statistics including sample means, proportions, and related test statistics.

2 Sampling Distribution of the Sample Mean

When the population is normal, by the property of normal distribution, the sum of the iid random variables are exactly normally distributed. If the population is not a normal distribution, using the Central Limit Theorem (CLT), the sum of the iid random variables is asymptotically normally distributed.

2.1 Exact Distribution

For a random sample \(X_1, X_2, \ldots, X_n\) from a normal population \(N(\mu, \sigma^2)\), the sample mean has an exact normal distribution:

\[ \bar{X} \to N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) \]

The standardized version is:

\[ Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \overset{d}{\to} N(0, 1) \]

Example: We simulate data from normal distribution to explain the above sampling distribution of sample means from a normal population.

set.seed(123)
n <- 10
mu <- 5
sigma <- 2

n.samples <- 10000
sample.means <- replicate(n.samples, mean(rnorm(n, mu, sigma)))  # replicate() is a wrapper function 
                                                                 # sapply()

# Create theoretical curve data
x.vals <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100)
theory.density <- dnorm(x.vals, mean = mu, sd = sigma/sqrt(n))
theory.df <- data.frame(x = x.vals, density = theory.density)

xbar.plt <- ggplot(data.frame(mean = sample.means), aes(x = mean)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "gray") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  labs(title = "Exact Sampling Distribution of Sample Mean \nNormal Population (n = 10)",
       x = "Sample Mean", y = "Density") +
   theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))

ggplotly(xbar.plt)

2.2 Asymptotic Sampling Distribution (Central Limit Theorem)

The asymptotic sampling distribution is the approximate probability distribution of a sample statistic (like the mean, proportion, or regression coefficient) when the sample size \(n\) is very large (approaches infinity).

For any population with finite mean \(\mu\) and variance \(\sigma^2\), as \(n \to \infty\):

\[ Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \to_{\text{approx}} N(0, 1) \]

Example We simulation 100000 ransom samples with size 50 from a skewed exponential population. As the sample size gets larger, the sampling distribution of the sample means are approximately distributed.

set.seed(123)
n.large <- 50
lambda <- 1/5  # Mean = 5

# Generate multiple samples from exponential distribution
n.samples <- 10000
exp.means <- replicate(n.samples, mean(rexp(n.large, rate = lambda)))

# Compare with normal approximation
theoretical.mean <- 1/lambda  # 5
theoretical.sd <- (1/lambda)/sqrt(n.large)  # 5/sqrt(50)

theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)

# Option 1: Use only stat_function for theoretical curve (Recommended)
gg.clt <- ggplot(data.frame(mean = exp.means), aes(x = mean)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.7, fill = "lightgreen") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  labs(title = "Asymptotic Sampling Distribution of Sample Mean \nExponential Population (n = 50)",
       x = "Sample Mean", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
#gg.clt
ggplotly(gg.clt)

Note: The asymptotic approximation to the sampling distribution of the sample mean holds regardless of the shape of the population distribution (provided the population has finite mean and variance).

Remark: There is no parameter in the standard normal distribution.


3 Student’s t-Distribution

Let \(\{X_1, X_2, \cdots, X_n \} \overset{\text{i.i.d}}{\sim} N(\mu, \sigma)\). Define the sample mean to be

\[ \bar{X} = \frac{\sum_{i=1}^n X_i}{n}. \]

When population variance \(\sigma^2\) is unknown and estimated by sample variance \(S^2\):

\[ T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \overset{d}{\to} t_{n-1} \]

where

\[ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2. \]

Note: The t-distribution depends on a single parameter: the degrees of freedom, \(\nu\), which equals \(n-1\) for a sample of size \(n\). Because \(\nu\) is fixed once the sample size is known—and does not need to be estimated from the sample data—it is occasionally treated in applications as if the distribution had no parameters.

Example: Since the above t-distribution is defined based on a normal distribution, we will simulate random samples from a normal distribution with finite mean and variance.

set.seed(123)
n <- 10
mu <- 5
sigma <- 2

# Generate t-statistics
n.samples <- 10000
t.stats <- numeric(n.samples)  # This defines a 10000 dimensional zero vector
                               # t.test <- NULL uses more computing resource
for(i in 1:n.samples) {
  sample.data <- rnorm(n, mu, sigma)
  x.bar <- mean(sample.data)
  s <- sd(sample.data)
  t.stats[i] <- (x.bar - mu) / (s/sqrt(n))
}

# Compare with theoretical t-distribution
x.vals <- seq(-4, 4, length.out = 200)
theoretical.t <- dt(x.vals, df = n-1)    # calling t-density function
theoretical.normal <- dnorm(x.vals)      # standard normal distribution

comparison.df <- data.frame(
  x = rep(x.vals, 2),
  density = c(theoretical.t, theoretical.normal),
  distribution = rep(c("t(9)", "N(0,1)"), each = length(x.vals))
)

t.plt <- ggplot(comparison.df, aes(x = x, y = density, color = distribution)) +
  geom_line(size = 1) +
  labs(title = "t-Distribution vs Normal Distribution",
       x = "Value", y = "Density") +
    theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")) +
   scale_color_manual(values = c("red", "blue"))
ggplotly(t.plt)

Caution: The standardized sample mean follows a t-distribution. More precisely, when sampling from a normal population with unknown mean \(\mu\) and standard deviation \(\sigma\), the t-statistic (calculated using the sample standard deviation) follows a t-distribution.

4 Sampling Distribution of Sample Proportion

Let \(X_1, X_2, \dots, X_n\) be independent and identically distributed Bernoulli random variables with parameter \(p\), where:

  • \(X_i = 1\) with probability \(p\) (success)
  • \(X_i = 0\) with probability \(1-p\) (failure)

The sample proportion is defined as:

\[ \hat{p} = \frac{1}{n} \sum_{i=1}^n X_i \]

where \(n\) is the fixed sample size.

In practice, when the sample size is large, the sampling distribution of the sample mean is generally characterized using approximations. For small samples, however, the exact sampling distribution must be used.

4.1 Exact Distribution

For a binomial population with success probability \(p\), the sample proportion \(\hat{p} = X/n\) where \(X \sim Binomial(n,p)\).

The exact distribution is simply the probability mass function of a binomial distribution with n trials and success probability \(p\):

\[ P(\hat{p} =k/n)= P(n\times \hat{p} = k) = P(X = k)=\frac{n!}{k!(n-k)!} p^k (1−p)^{n-k}, \ \ k = 0, 1, 2, \cdots, n. \]

4.2 Asymptotic Sampling Distribution (Large \(n\))

By the Central Limit Theorem (specifically, the De Moivre-Laplace Theorem for Bernoulli trials):

\[ \hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad \text{for large } n \]

More rigorously, in standardized form:

\[ Z_n = \frac{\hat{p}_n - p}{\sqrt{\frac{p(1-p)}{n}}} \stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty \]

Sufficient Conditions for Approximation:

  • \(np \geq 10\) and \(n(1-p) \geq 10\) (common rule of thumb)

  • Alternative: \(n > 9 \times \max\left(\frac{p}{1-p}, \frac{1-p}{p}\right)\)

Example: We simulate random samples from binary population (also called Bernoulli population) to demonstrate the asymptotic sampling distribution of sample proportion.

set.seed(123)
n <- 100
p <- 0.3

# Generate sample proportions
n.samples <- 10000
sample.props <- replicate(n.samples, rbinom(1, n, p)/n) # replicate() is a wrapper 
                                                        # function of sapply()

# Compare with normal approximation
theoretical.mean <- p
theoretical.sd <- sqrt(p*(1-p)/n)

x.vals <- seq(0,0.6, length=100)
theory.density <- dnorm(x.vals, mean = theoretical.mean, sd = theoretical.sd)
theory.df <- data.frame(x = x.vals, density = theory.density)

binom.plt <- ggplot(data.frame(prop = sample.props), aes(x = prop)) +
  geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.7, fill = "skyblue") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  #stat_function(fun = dnorm, 
  #              args = list(mean = theoretical_mean, sd = theoretical_sd),
  #              color = "red", size = 1) +
  labs(title = "Sampling Distribution of Sample Proportion",
       subtitle = "p = 0.3, n = 100",
       x = "Sample Proportion", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(binom.plt)

5 Sampling Distribution of Sample Variance

We first introduce the asymptotic sampling distribution of sample variance without derivation. The basic setting is given in the following.

Let \(X_1, X_2, \dots, X_n \stackrel{\text{i.i.d.}}{\sim} F\) with:

  • \(E[X_i] = \mu\)
  • \(\text{Var}(X_i) = \sigma^2 < \infty\)
  • \(E[(X_i - \mu)^4] = \mu_4 < \infty\) (finite fourth central moment)

Define the sample variance:

\[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \] where \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\).

5.1 Asymptotic Sampling Distribution of Sample Variance

When sample size n is large, the sample variance \(S^2\) is approximately normally distributed as

\[ S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right) \quad \text{as } n \to \infty, \] equivalently,

\[ \frac{S^2 - \sigma^2}{\sqrt{\frac{\mu_4 - \sigma^4}{n}}} \stackrel{d}{\to} N(0,1) \quad \text{as } n \to \infty. \]

In practice, the fourth moment, \(\mu_4\), can be estimated from sample, which will be discussed in subsequent topics.

5.2 Special Case: Normal Population

When sampling from a normally distributed population, the sampling distribution of the sample variance can be fully characterized through a chi-squared distribution with appropriate scaling.

The chi-squared distribution is a special case of the gamma distribution and can also be constructed from the standard normal distribution. Specifically, we have the following result:

For \(Z_1, Z_2, \ldots, Z_k \stackrel{iid}{\sim} N(0,1)\), using moment generating function, we can show that

\[ Q=\sum_{i=1}^k Z_i^2 \overset{d}{\to} \chi_k^2. \]

Using the relationship between the standard normal and chi-squared distributions, we can derive the exact distribution of the scaled sample variance for a normal population:

\[ \frac{(n-1)S^2}{\sigma^2} \overset{d}{\to} \chi_{n-1}^2. \]

Proof [optional]: We prove this in several steps:

We show that for \(X_1, \dots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)\), with

\[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \quad \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i, \]

we have

\[ \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2. \]

Step 1: Standardize and define notation

Let \(Z_i = \frac{X_i - \mu}{\sigma} \sim N(0,1)\), i.i.d. Then

\[ \bar{Z} = \frac{1}{n} \sum_{i=1}^n Z_i = \frac{\bar{X} - \mu}{\sigma}. \]

We can write:

\[ \sum_{i=1}^n (X_i - \bar{X})^2 = \sigma^2 \sum_{i=1}^n (Z_i - \bar{Z})^2. \]

So

\[ \frac{(n-1)S^2}{\sigma^2} = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2. \]

Step 2: Orthogonal transformation

Let \(\mathbf{Z} = (Z_1, \dots, Z_n)^T\). Choose an \(n \times n\) orthogonal matrix \(Q\) whose first row is \(\left( \frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}} \right)\). Define

\[ \mathbf{Y} = Q \mathbf{Z}. \]

Then:

  • \(Y_1 = \frac{1}{\sqrt{n}} \sum_{i=1}^n Z_i = \sqrt{n} \, \bar{Z}\).
  • Since \(Q\) is orthogonal and \(\mathbf{Z} \sim N(0, I_n)\), we have \(\mathbf{Y} \sim N(0, I_n)\) as well, so \(Y_1, \dots, Y_n\) are i.i.d. \(N(0,1)\).

Step 3: Express sum of squares in terms of \(Y_j\)

Orthogonality implies:

\[ \sum_{i=1}^n Z_i^2 = \sum_{j=1}^n Y_j^2. \]

Also,

\[ \sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n Z_i^2 - n \bar{Z}^2. \]

But \(n \bar{Z}^2 = Y_1^2\), so

\[ \sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=1}^n Y_j^2 - Y_1^2 = \sum_{j=2}^n Y_j^2. \]

Step 4: Distribution

Since \(Y_2, \dots, Y_n\) are i.i.d. \(N(0,1)\), we have

\[ \sum_{j=2}^n Y_j^2 \sim \chi_{n-1}^2. \]

Thus

\[ \frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{j=2}^n Y_j^2 \overset{d}{\to} \chi_{n-1}^2. \]

Step 5: Independence from \(\bar{X}\)

Since \(Y_1 = \sqrt{n} \bar{Z}\) is independent of \(Y_2, \dots, Y_n\), it follows that \(\bar{X}\) is independent of \(S^2\). That is,

\[ \boxed{\frac{(n-1)S^2}{\sigma^2} \to \chi_{n-1}^2} \]

Example: The \(\chi^2\) distribution is derived from the standard normal distribution. We simulate standard normal random numbers and then transform them into \(\chi^2\) random variables based on the derivations above. A histogram will be plotted and overlaid with the theoretical \(\chi^2\) density curve.

set.seed(123)
n <- 10
sigma <- 2

# Generate chi-square statistics
n.samples <- 10000
chisq.stats <- numeric(n.samples)

for(i in 1:n.samples) {
  sample.data <- rnorm(n, 0, sigma)
  chisq.stats[i] <- sum((sample.data/sigma)^2)
}

# Compare with theoretical chi-square
x.vals <- seq(0, 30, length.out = 200)
theoretical.chisq <- dchisq(x.vals, df = n)
theory.df <- data.frame(x = x.vals, density = theoretical.chisq)

chi.plt <- ggplot(data.frame(x = chisq.stats), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "steelblue") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  #stat_function(fun = dchisq, args = list(df = n), color = "red", size = 1) +
  labs(title = "Chi-Square Distribution",
       subtitle = "Sum of squared standard normals",
       x = "Value", y = "Density") +
   theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(chi.plt)

Remark Both the chi-squared and t-distributions are parameterized by degrees of freedom. In particular, the scaled sample variance from a sample of size \(n\) follows a chi-squared distribution with \(n−1\) degrees of freedom.

6 F-Distribution

The F-distribution serves as the sampling distribution for the ratio of two independent sample variances. Variance is a key measure of quality across disciplines, where higher variance corresponds to lower quality. When comparing quality via variances, both differences and ratios are conceivable. However, under normal population assumptions, the difference of two sample variances lacks a convenient known distribution, while the appropriately scaled ratio follows the \(F\) distribution.

The following is the setup of the definition for \(F\) distribution. For two independent random sample from two normal populations:

\[ \{X_1, X_2, \cdots, X_{n_1}\} \overset{i.i.d}{\sim} N(\mu_1, \sigma_1^2) \quad\text{ and } \quad \{Y_1, Y_2, \cdots, Y_{n_2}\} \overset{i.i.d}{\sim} N(\mu_2, \sigma_2^2), \]

Define

\[ S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 \quad\text{ and } \quad S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y_i - \bar{Y})^2 \]

\[ F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \overset{d}{\to} F_{n_1-1, n_2-1} \]

where \(n_1-1\) and \(n_2-1\) are degrees of freedom of numerator and denominator respectively. Since \(S_1^2\) and \(S_2^2\) are unbiased for \(\sigma_1^2\) and \(\sigma_2^2\), if our hypothesis is \(\sigma_1^2 = \sigma_2^2\) (indicating equal product quality in variance terms), the expected F-ratio \(F = S_1^2/S_2^2\) is approximately 1, though its actual distribution is \(F_{n_1-1, n_2-1}\).

Example: The F distribution is directly defined based on two independent \(\chi^2\) distributions, which are themselves derived from standard normal distributions. Therefore, we could generate data from normal distributions and then transform them into F random variables. To keep the process simple, we generate data directly from \(\chi^2\) distributions.

set.seed(123)
df1 <- 10
df2 <- 15

# Generate F statistics
n.samples <- 10000
f.stats <- numeric(n.samples)

for(i in 1:n.samples) {
  u1 <- rchisq(1, df1)
  u2 <- rchisq(1, df2)
  f.stats[i] <- (u1/df1) / (u2/df2)
}

# Compare with theoretical F-distribution
x.vals <- seq(0, 5, length.out = 200)
theoretical.f <- df(x.vals, df1, df2)
theory.df <- data.frame(x = x.vals, density = theoretical.f)




f.plt <- ggplot(data.frame(x = f.stats), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.7, fill = "purple3") +
  geom_line(data = theory.df, aes(x = x, y = density), 
            color = "red", linewidth = 1) +
  coord_cartesian(xlim = c(0, 5)) +
  labs(title = paste("F-Distribution \n F(", df1, ",", df2, ")", sep = ""),
       x = "Value", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt"))
ggplotly(f.plt)

Remarks:

  1. F has two parameters, degrees of freedom on the numerator and denominator, what are corresponding sample sizes minus 1. This is similar to that of t and chi-squared distributions.

  2. From previous section, we see that both numerator and denominator can be re-expressed in terms of two independent chi-squared distributions. To see this, both numerators in the following follow chi-squared distribution with degrees of freedom of \(n_1-1\) and \(n_2-1\), respectively.

\[ \frac{S_1^2}{\sigma_1^2} = \frac{(n_1-1)S_1^2/\sigma_1^2}{n_1-1} \quad \text{ and } \quad \frac{(n_2-1)S_2^2/\sigma_2^2}{n_2-1} \]

  1. Denote \(U_1 = (n_1-1)S_1^2/\sigma_1^2 \overset{d}{\sim} \chi^2_{n_1-1}\) and \(U_2 = (n_2-1)S_2^2/\sigma_2^2 \overset{d}{\sim} \chi^2_{n_2-1}\). Then, we can re-express the F-ratio as

\[ F = \frac{U_1/(n_1-1)}{U_2/(n_2-1)} \overset{d}{\to} F_{n_1-1, n_2-1}. \]

  1. One can also derive the asymptotic sampling distribution of \(S_1^2/S_2^2\) using a linear approximation based on a Taylor expansion. However, the analytic expression is complex and beyond the scope of this class.

7 Summary of Key Relationships

We have discussed several exact and asymptotic sampling distributions for sample means, variances, and their functions. The following table summarizes these distributions.

Statistic Exact Distribution Asymptotic Distribution Conditions
\(\bar{X}\) \(N(\mu, \sigma^2/n)\) \(N(\mu, \sigma^2/n)\) Normal population or large n
\(\frac{\bar{X}-\mu}{S/\sqrt{n}}\) \(t_{n-1}\) \(N(0,1)\) Normal population
\(\hat{p}\) \(Binomial(n,p)/n\) \(N(p, p(1-p)/n)\) \(np, n(1-p) \geq 5\)
\(S^2\) - \(N(\sigma^2, (\mu_4-\sigma^4 )/n)\) large n
\(\frac{(n-1)S^2}{\sigma^2}\) \(\chi^2_{n-1}\) - Normal population
\(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\) \(F_{n_1-1,n_2-1}\) - Normal populations

Pivotal Quantity

A pivotal quantity (or pivot) is a function of the sample data and an unknown parameter whose probability distribution does not depend on the unknown parameter.

For example, in normal distribution with known variance, we have

\[ X_1, \dots, X_n \sim \text{N}(\mu, \sigma^2), \quad \sigma^2 \text{ known} \]

The sample mean follows normal distribution: \(\bar{X} \sim \text{N}\left(\mu, \frac{\sigma^2}{n}\right)\)

According to the definition of pivotal quantity,

\[ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{N}(0,1) \]

is pivot since \(N(0, 1)\) is independent on \(\mu\). If the normal distribution has unknown variance, the sample variance

\[ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2. \]

The following standardized expression

\[ T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1} \]

is a pivotal quantity since the t-distribution does not depend on \(\mu\).

Similarly, \((n-1)S^2/\sigma^2\) and \(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2}\) in the above summary table are pivotal quantities.

Conclusion

  • Understanding sampling distributions is fundamental to statistical inference:

  • Exact distributions provide precise results when assumptions are met

  • Asymptotic distributions offer approximations for large samples

  • The choice between exact and asymptotic methods depends on sample size, distributional assumptions, and the specific parameter being estimated

  • Modern computing allows for empirical verification of these theoretical results

These distributions form the theoretical foundation for hypothesis testing, confidence intervals, and many other statistical procedures.

