This module reviews the basic inferential statistics covered in the previous class. The major topics are sampling distribution of sample means and sample proportions, constructing confidence intervals for population means, and testing hypotheses of population means. Details will not be covered in this note. Some of the examples will be explained using software programs such as R and SPSS.
The following YouTube video from Ace Tutors demonstrates how to use the standard normal table to find probabilities of general normal distributions.
The general estimation method in statistics is to estimate the population parameters such as population mean (usually denoted by the Greek letter \(\mu\)) and population proportion (denoted by \(p\)).
Assume that a random sample \(\{x_1,x_2, \cdots, x_n \}\) is from a population with unknown population \(\mu\), then the estimated population mean is given by
\[ \bar{x} = \frac{x_1+ x_2+\cdots+x_n}{n} = \frac{\sum_{i=1}^n x_i}{n}. \]
Since the sample is random, therefore, the estimated sample mean \(\bar{x}\) is also random. The question is what is the distribution of \(\bar{x}\)
We break down the discussion of the sampling distribution of \(\bar{x}\) in three different scenarios.
\[ \bar{x} \rightarrow N\left(\mu, \frac{s}{\sqrt{n}} \right) \ \ \text{ or equivalently } \ \ \frac{\bar{x}-\mu}{s/\sqrt{n}} \rightarrow N(0, 1). \] Note that, in this scenario, the assumption of a large sample size is crucial to guarantee a good normal approximation. Although there is no theoretically recommended threshold of the sample size to determine whether a sample is large, we use an operational threshold of \(n = 30\). That is, in this course, if \(n > 30\), the sample is considered a large sample, and the above sampling distribution applies.
\[ \bar{x} \rightarrow N\left(\mu, \frac{\sigma}{\sqrt{n}} \right) \ \ \text{ or equivalently } \ \ \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \rightarrow N(0, 1). \]
\[ \frac{\bar{x} - \mu}{s/\sqrt{n}} \rightarrow t_{n-1}. \]
A t distribution with \(df\) is a symmetric distribution with mean 0 but variance \(df/(df-2)> 1\). If degrees of freedom get bigger, the t distribution approaches standard normal distribution; consequently, we can simply use the central limit theorem to claim the sampling distribution of \(\bar{x}\) to be normal as discussed in scenario #1. However, under this scenario, if the sample size \(n\) is small, the t distribution with \(df = n-1\) must be used to characterize the sampling of
\[ \frac{\bar{x} - \mu}{s/\sqrt{n}} \rightarrow t_{n-1}. \]
The discussion of the sampling distribution of sample proportions is relatively straightforward. Without loss of generality, a binary population has two distinct values: success and failure. They assume the population proportion size is \(N\), the population proportion.
\[ p = \frac{sucesses}{N}. \]
Consider a random sample \(\{x_1, x_2, \cdots, x_n\}\) taken from a binary population with distinct values success and failure. Let \(X = \text{ number of successes in the sample}\), then the sample proportion is defined to be \(\hat{p} = X/n\).
Based on the same logic in the sample mean, \(\hat{p}\) is random. Its distribution can be approximated by a normal distribution if the following conditions are satisfied.
\[ n\hat{p} \ge 10 \ \ \text{ and } \ \ n(1-\hat{p}) \ge 10. \]
To be more specific, under the above assumptions,
\[ \hat{p} \rightarrow N\left(p, \sqrt{\frac{p(1-p)}{n}} \right). \]
We will review one-sample and two-sample confidence intervals separately.
The confidence interval of population mean (\(\mu\)) and proportion (\(p\)) are based on the corresponding sample mean \(\bar{x}\) and sample proportion \(\hat{p}\) and their associated sampling errors. The explicit form of the confidence interval of \(\mu\) and \(p\) is given respectively as
\[ \mu \in (\bar{x} - E_{\bar{x}}, \bar{x} + E_{\bar{x}}) \ \ \text{and} \ \ p \in (\hat{p}-E_{\hat{p}}, \hat{p}+E_{\hat{p}}). \] Where E is the margin of error associated with the sample mean (\(\bar{x}\)) and proportion (\(\hat{p}\)) are defined respectively by
\[ E_{\bar{x}} = \text{CV}\frac{s}{\sqrt{n}} \ \ \text{ and } \ \ E_{\hat{p}} = \text{CV}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. \]
The critical value \(CV\) is a table value (either from the t- or normal table based on the given confidence level).
Before working on examples, let’s watch the following YouTube video that provides an example of constructing a normal confidence interval of the population mean.
Example 1. Suppose we want to estimate, with 95% confidence level, the mean (average) length of all walleye fingerlings in a fish hatchery pond. A random sample of 100 fingerlings was selected. The average length is 7.5 inches, and the standard deviation is 2.3 inches. That is, \(\bar{x} = 7.5\), \(s = 2.3\), and \(n = 100\).
Solution: Since \(n = 100\), by scenario #1, this means the sampling distribution of sample mean \(\bar{x}\) approximately normally distributed. The critical value corresponding \(1-\alpha = 95\%\) confidence interval is given by \(Z_{\alpha/2} = Z_{0.025} = 1.96\). The margin of error \(E\) is
\[ E = 1.96\times \frac{2.3}{100} = 0.045. \]
The \(95\%\) confidence interval for the population mean is \((7.5-0.045, 7.5 + 0.045) = (7.05, 7.95)\).
Example 2. In a survey of 1219 U.S. adults, 354 said that their favorite sport to watch is football. Construct a 95% confidence interval for the proportion of adults in the United States who say that their favorite sport to watch is football.
Solution: First of all, the sample proportion \(\hat{p} = 354/1219 \approx 0.29\) and sample size \(n = 1219\). Since \(n\hat{p} = 354 > 5\) and \(n(1-\hat{p}) = 1219 \times 0.71 = 865 > 5\), the sampling distribution of \(\hat{p}\) is normally distributed. The critical value corresponding to a 95% confidence level is \(Z_{0.025} = 1.96\). The margin of error
\[ E = 1.96 \times \sqrt{\frac{0.29(1-0.29)}{1219}} \approx 0.0255 \]
Therefore, the \(95\%\) confidence interval is given by \((0.29- 0.0255, 0.29 + 0.0255) = (0.2645, 0.2155)\).
Example 3. . Estimating Car
Pollution - In a sample of seven cars, each car was tested for
nitrogen-oxide emissions (in grams per mile) and the following results
were obtained: 0.06, 0.11, 0.16, 0.15, 0.14, 0.08, 0.15
(based on data from the Environmental Protection Agency). Assuming that
this sample is representative of the cars in use. Further, the amounts
of nitrogen oxide emissions for all cars are normally distributed.
Construct a 98% confidence interval estimate of the mean amount of
nitrogen oxide emission for all cars.
Solution We first calculate the sample mean and sample standard deviation using the formulas introduced in the note on descriptive statistics. \[ \bar{x} = 0.1214, \ \ \ s = 0.0389. \] Since this small sample was taken from a normal population with an unknown standard deviation. The critical value should be based on the t-distribution with \(7 - 1 = 6\) degrees of freedom. The \(98\%\) critical value \(t_{6, 0.01} = 3.143\). Therefore, the margin of error is
\[ E = 3.143 \times 0.0389/\sqrt{7} \approx 0.0462. \]
The resulting \(98\%\) confidence interval is given by
\[ (0.1214 - 0.0462, 0.1214 + 0.0462) = (0.0752, 0.0752). \]
The information of two-sample inference is the sampling distribution of the difference between the two population parameters, such as means and proportions. The following YouTube video discusses the sampling distribution of the difference of two sample means.
We will only review the confidence intervals for the difference of two population means from two independent random samples. The basic settings are summarized in the following table.
Population #1 | Population #2 | |
---|---|---|
sample mean | \(\bar{x}_1\) | \(\bar{x}_2\) |
(Sample) Standard Deviation | \(s_1\) or \(\sigma_1\) | \(s_2\) or \(\sigma_2\) |
Sample size | \(n_1\) | \(n_2\) |
Based on different given conditions, we outline the confidence intervals in three different cases:
Case 1: Both Sample sizes Are Large If both sample sizes are large, the sampling distribution of \(\bar{x}_1 - \bar{x}_2\) is given by
\[ \bar{x}_1 - \bar{x}_2 \rightarrow N\left(\mu_1-\mu_2, \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \right). \] The \(100(1-\alpha)\%\) confidence interval is explicitly given by
\[ (\bar{x}_1 - \bar{x}_2) \pm Z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}. \]
Case 2: Both Populations Are Normal and Standard Deviations (\(\sigma_1\), \(\sigma_2\)) Are Known In this case, there is no restriction on the sample sizes. The sampling distribution of \(\bar{x}_1 - \bar{x}_2\) is given by
\[ \bar{x}_1 - \bar{x}_2 \rightarrow N\left(\mu_1-\mu_2, \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \right). \]
The corresponding confidence \(100(1-\alpha)\%\) confidence interval of \(\mu_i-\mu_2\) is explicitly given by
\[ (\bar{x}_1 - \bar{x}_2) \pm Z_{\alpha/2}\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}. \]
Case 3: Both Populations Are Normal and Standard Deviations (\(\sigma_1\), \(\sigma_2\)) Are Unknown but Equal In this case, we need to combine the two samples to estimate the common variance. If the descriptive statistics are given, the common variance of the pooled samples is
\[ \sigma_{\text{pool}}^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}. \]
The sampling distribution related to (\(\bar{x}_1 - \bar{x}_2\)) is given by
\[ \frac{(\bar{x}_1-\bar{x}_2)-(\mu_1-\mu_2)}{\sqrt{\sigma_{\text{pool}}^2/n_1 + \sigma_{\text{pool}}^2/n_2}} \rightarrow t_{n_1+n_2-2}. \]
Using t-critical value, we write the \(100(1-\alpha)\%\) confidence interval of \(\bar{x}_1 - \bar{x}_2\) as
\[ (\bar{x}_1 - \bar{x}_2) \pm t_{n_1+n_2-2, \alpha/2} \sqrt{\frac{\sigma_{\text{pool}}^2}{n_1} + \frac{\sigma_{\text{pool}}^2}{n_2}}. \]
The following YouTube video summarizes the two-sample t confidence interval with an example.
We’ve reviewed confidence intervals (CI), which estimate a population parameter with a range of plausible values. Let’s focus on the logic of hypothesis testing, the other major type of statistical inference. While CIs provide a range, hypothesis testing evaluates a specific claim about a population parameter. We primarily focus on testing hypotheses about one and two population means.
Hypothesis testing follows a structured process to determine whether sample data provides sufficient evidence to reject a default assumption (null hypothesis) in favor of an alternative claim. Here’s the step-by-step reasoning:
The following YouTube video from Ace Tutors summarizes the basic logic and process of performing hypothesis testing.
This type of inference about the population mean is to verify a claim about the population mean, which has one of the aforementioned six forms. Based on the available amount of information, we discuss the test in the following scenarios.
\[
TS = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \rightarrow N(0, 1).
\]
The critical value and p-value are based on the standard normal
distribution.
\[ TS = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} \rightarrow N(0, 1). \] In this case, the critical value and p-value are based on the standard normal distribution.
\[ TS = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \rightarrow t_{n-1}. \]
The following YouTube video explains when a Z-test or t-test should be used.
This type of inference focuses on comparing the difference between two population means \(\mu_1- \mu_2\). The definition of the test statistic of this hypothesis testing is dependent on the given conditions.
Both Sample Sizes Are Large: If both sample sizes are large (say \(n_1 > 30\) and \(n_2 > 30\)), the test statistic is defined to be
\[ TS_1 = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}, \]
where \(s_1^2\) and \(s_2^2\) are sample variances. If both population variances \(\sigma_1^2\) and \(\sigma_2^2\) are given, replace the two sample variances in the above formula with the given population variances. Using the Central Limit Theorem (CLT), \(TS_1 \rightarrow N(0, 1)\)
Both Populations Are Normal and Corresponding Variances Are Given: In this case, we don’t have a restriction on the sample sizes. The test statistic is defined to be
\[ TS_2 = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \rightarrow N(0,1). \]
Both Populations Are Normal and Population Variances Are Unknown But Equal: In this case, we need to pool the two samples to estimate the common variances. For given descriptive statistics (i.e., sample sizes, sample means, sample variances), the estimated common variance is given by
\[ s_{\text{pool}}^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}. \]
The test statistic is defined as
\[ TS_3 = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 -\mu_2)}{s_{\text{pool}}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \rightarrow t_{n_1+n_2-2}. \]
Note that the condition of equal variances is crucial to guarrantee the t distribution with \(n_1 + n_2 -2\) degrees of freedom. If this equal variances is not satisfied, \(TS_3\) is not a t distribution. However, we can define the test statistic and approximate a t distribution.
The next YouTube video explains the pooled sample t-test.
In introductory statistics, we also learned how to characterize the relationship between two numerical variables using the correlation coefficient and least squares regression.
The Pearson correlation coefficient measures the magnitude and the direction of the linear relationship. The formula for the sample correlation coefficient (Pearson) is given by
\[ r = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}} \]
where \(x\) and \(y\) are two numerical variables. Although \(r\) carries the magnitude and direction of the linear relationship, it does not indicate how one variable influences the other variable.
Simple linear regression (SLR) explicitly shows the influence of one variable on the other. A simple linear regression (SLR) is defined as
\[ y = \beta_0 + \beta_1 x + \epsilon, \]
Where \(y\) is called the response variable (also called the dependent variable), \(x\) is called the predictor variable (also called the independent variable or explanatory variable). \(\epsilon\) is a random variable which is called residual. The basic assumptions on SLR are
\(\beta_0\) and \(\beta_1\) are intercept and slope parameters in the SLR. The slope parameter \(\beta_1\) reflects the change of \(y\) when \(x\) increases by one unit.
Relationship between Pearson Correlation Coefficient (\(r\)) and Slope parameter (\(\beta_1\)): For SLR, both regression coefficients (\(\beta_0\) and \(\beta_1\)) can be explicitly based on the sample data.
obs | 1 | 2 | 3 | \(\dots\) | \(n-1\) | \(n\) | |
---|---|---|---|---|---|---|---|
\(x\) | \(x_1\) | \(x_2\) | \(x_3\) | \(\dots\) | \(x_{n-1}\) | \(x_{n}\) | |
\(y\) | \(y_1\) | \(y_2\) | \(y_3\) | \(\dots\) | \(y_{n-1}\) | \(y_{n}\) |
Let
\[ S_{xx} = \sum_{i=1}^n (x_i-\bar{x})^2, \ \ S_{yy} = \sum_{i=1}^n (y_i-\bar{y})^2, \ \ \text{and} \ \ S_{xy} = \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}) \]
Let \(\hat{\beta}_0\) and \(\hat{\beta}_1\) be the estimated values of \(\beta_0\) and \(\beta_1\) and can be expressed by
\[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \ \ \text{and} \ \ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\times \bar{x}. \]
Note that
\[ r = \frac{S_{xy}}{\sqrt{S_{xx}}\sqrt{S_{yy}}} = \left( \frac{S_{xy}}{S_{xx}}\right)\left( \frac{\sqrt{S_{xx}}}{\sqrt{S_{yy}}}\right) = \hat{\beta}_1\sqrt{\frac{S_{xx}}{S_{yy}}}. \]