STA200-Statistics II

Week 5: A Brief Introduction to Multiple Linear Regression with Applications

Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM

1. Multiple Linear Regression Models with Categorical Predictors: ANOVA Regression

Lecture Note: |HTML| PDF|

Simple Linear Regression Model

Model structure and interpretations of the model coefficients
Assumptions and model diagnostics

Linear Regression Model with Single Multiple-category Categorical Variable

How regression models handle multi-category categorical variables
Interpretation of coefficients of dummy variables: comparisons between non-baseline categories with the reference.
Linear regression approach to one-way ANOVA

Understand the outputs of lm() and TukeyHSD()

Multiple Linear Regression Model with Multiple Categorical Variables

Model structure: main effect
How regression coefficients are estimated from the original data table
Interpretation of regression coefficients: Main effect and interaction effect
Understand the statistics in the TukeyHSD()
Connections between regression coefficients and TukeyHSD outputs

2. A Brief Introduction of General Linear Regression Models: Polynomial and ANCOVA Regression

Lecture Note: |HTML| PDF|

Overview of types Multiple Linear Regression Models
Polynomial Regression Models

Model structure: Nonlinear relationship between response and predictor, but linear relationship among regression coefficients
Multicollinearity issue in polynomial regression and remedy
If a high-order term is significant, all lower degree terms must be retained in the model

Analysis of Covariance (ANCOVA) Models

Handling categorical variables in multiple linear regression model
Understand and correctly interpret interaction terms in ANCOVA
Understand the visual representation of interaction effect

3. Final Exam: Sunday 3:30 PM - 5:30 PM, 08/03/2025

Important Note: The final exam is cumulative. Everyone is required to take it. Failure to do so will result in a failing grade for the course.

Open: Sunday at 3:30 PM (starting time for everyone!)
Close: Sunday at 5:30 PM (6:30 PM for students who have submitted accommodations through OEA)
Guideline: |PDF|

Week 4: Principles of Experimental Designs and Analysis of Variance (ANOVA)

Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM

1. Principles of Experimental Designs

Lecture Note: |HTML| PDF|

Rationale and Logical Process of Experimental Design

Research questions and objective: Ssetting up hypotheses
Identify the response variable and Factor variables
Commonly used experimental designs
Roles of randomization, blinding, and assignment of experimental units
Sampling plans and data collection

Simple random sampling
Stratified sampling
Systematic sampling
Clustering sampling

Completely Random Design and Randomized Block Design

Completely random design (CRD)
Randomized block design (RBD) with and without replicates

2. One-way Analysis of Variance (ANOVA)

Lecture Note: |HTML| PDF|

Objective and Logic of Analyzing CRD Data

The objective and logic of analyzing CRD data
Measuring Discrepancy between the null and alternative hypotheses: sum of squares decomposition
Basics of F-distribution
R commands for F-distribution

One-way ANOVA

One-way ANOVA table structure
Understanding the statistics for testing the null hypothesis
Preparing data for ANOVA analysis using R
R function aov() for one-way ANOVA testing

Multiple Comparisons

Simultaneous comparisons between factor levels (treatment levels)
Understanding family significance level (observed type I error): avoiding inflation of type I error
Tukey's HSD and Bonferoni procedures
Steps for implementing Tukey's HSD using TukeyHSD()

3. Two-way Analysis of Variance (ANOVA)

Lecture Note: |HTML| PDF|

Types of Null Hypotheses in two-way ANOVA with replicated RBD data

Testing main effect only RBD with no replicates
Testing interactive effects with replicated RBD

Understanding the structure of two-way ANOVA

The structure of two-way ANOVA
Identify test statistics for testing appropriate hypotheses
Implementing two-way ANOVA with R

Multiple Comparison: Post-hoc Tests

Tukey's HSD
Main effect: comparing levels within that factor
Interactive effect: comparing one factor within levels of another
Using R to implement Tukey's HSD
Visual presentation of post-hoc comparisons

4. Weekly Exam #4

Open: Friday at noon
Close: Sunday at midnight
Guideline: |PDF|

Answer Key and Summary: |PDF|

Week 3: Parametric and Nonparametric One- and Two-Sample Tests

Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM

1. One-Sample Tests

Lecture Note: |HTML| PDF|

Data Structure

R dataframe structure: row and column indices
Access the data frame using row and column indices
Selecting columns and rows
Conditional selection: R function which()

One-Sample t-test Revisited

Basic assumptions

Data are independently observed
The population is normally distributed
Population variance is unknown

Manual calculation based on descriptive statistics using R commands
Calling the R function t.test() from the base R package stats

Linear Regression Approach to One-Sample t Test

Linear regression Assumption review

Data are independently observed
The population is normally distributed
Population variance is unknown

Regression with only intercept: $y = \beta_0 + \epsilon$

General set-up in R: lm(y ~ 1, data = dataset.name)
Regression set-up for one-sample t test in R:lm(I(y-mu_0) ~ 1, data = dataset.name)

Nonparametric One-Sample Test for Median

Binomial distribution: R functionspbinom() and qbinom()
Test procedure formulation
Exact method: binomial critical value and p-value - built-in function binom.test() and SIGN.test()
Normal approximation

2. Two-Sample Tests

Lecture Note: |HTML| PDF|

Two-Sample t Test Revisited

Assumptions: independent samples, normal populations, unknown but equal variances
Manual calculation: pooled variance and test statistic (TS and its distribution, degrees of freedom)
Built-in function: t.test()

Linear Regression Approach to Two-Sample t Test

Response variable and group variable must be either in a dataframe or defined separately.
The group variable must be a factor: R function factor(group.variable)
Model formula in R: lm(y ~ factor(x), data = dataset.name)
Understand the output of the regression model: estimated difference of the two population means, TS, p-value, and degrees of freedom.

Non-parametric Two-Sample Test: Wilcoxon Signed Rank Test

Two-tailed Wilcoxon signed rank test: What is the null hypothesis?
Formulation of the test procedure and basic steps

Exact approach: requires tabulation of critical values based on given significance levels
Normal approximation approach

Implementation in R: wilcox.test()
Understand the output of wilcox.test()

Two-Paired-Sample Tests

Paired t-test: assumptions and R function t.test()
Nonparametric two-paired-sample test: Wilcoxon Signed Rank Test - wilcox.test()

3.Weekly Exam #3 Information

Open: Friday at noon
Close: Sunday at midnight
Guideline: |PDF|

Answer Key and Summary: |PDF|

Week 2: Chi-square Tests for Goodness-of-fit and Independence

Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM

1. Chi-square Distribution and Goodness-of-fit Test

Lecture Note: |HTML| PDF|

Chi-square Distribution

Chi-square distribution density curve: skewed to the right
Finding right-tail probability: related to p-value of chi-square tests
Find percentile: related to critical value.
Two R built-in functions

right-tail probability: pchisq(x, df, lower.tail = FALSE)
percentile: qchisq(p, df, lower.tail = TRUE)

Chi-square Goodness-of-fit Test

Setting up Hypothesis

H₀: The data follows the given distribution p = (p₁, p₂, ..., p_k)
H_a: The data NOT follow the given distribution p = (p₁, p₂, ..., p_k)

Calculate expected frequencies under H₀

Find the total number of observations (i.e.,sample size) denoted by n
Expected observation of j-th cell: E_j = n* p_j

Test statistics: $G^2= \sum_{j=1}^k (O_j - E_j)^2/E_j \rightarrow \chi^2_{k-1}$

Implementation in R: chisq.test(obs.freq, p)

Observed Frequency: obs.freq = $(n_1, n_2, \cdots, n_k)$
Hypothetical probability distribution: $p=(p_1, p_2, \cdots, p_k)$

2. Chi-square Test of Independence

Lecture Note: |HTML| PDF|

Study Designs

Retropsective corhort study design (look at histoorical data)
Prospective corhort study design (follow-up study)
Cross-sectional study (look at the current data, a snapshot observations)

Measures of Association

Absolute risk, relative risk, and attributable risk
Odds ratio
Risk measures vs study designs

$\chi^2$ test of independence between two categorical variables

Observed two-way contingency tables: I rows and J columns
Null hypothesis: $H_0$ - two categorical variables are independent.
Expected frequencies under $H_0$:

Expected frequency of i-th row and j-th column: $E_{ij}=\text{i-th row total}\times \text{j-th column total}/\text{grand total}$

Test statistic: $G^2 = \sum_{i=1}^I\sum_{j=1}^J (O_{ij}-E_{ij})^2/E_{ij} \rightarrow \chi^2_{(I-1)(J-1)}$

Implementation in R: chisq.test()

Using observed contingency table: chisq.test(obs.table)
Using two categorical variables directly: chisq.test(x,y)

3. Weekly Exam #2 Information

Open: Friday at noon
Close: Sunday at midnight
Guideline: |PDF|
Answer Key and Summary |PDF|

Week 1: Computing Software and Review of Introductory Statistics

Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM

1. Course Information and Learning Advice

Note: |HTML| PDF|

2. Getting started with R and RStudio

Lecture Note: |HTML| PDF|

Install the current version of R. It is freely available at https://www.r-project.org/
Install the free version of RStudio. The downloading site is at https://posit.co/products/open-source/rstudio/?sid=1
Use R as a super graphing calculator
Bult-in R functions for descriptive statistics
Base R packages
Writing simple and re-usable R scripts

3. Basic statistics review

Lecture Note: |HTML| PDF|

Sampling distributions and Central Limit Theorem (CLT)
Confidence intervals of sample means: normal and t distributions
Testing hypothesis: logic, steps, p-value

4. Least square simple linear regression (SLR): inference and applications

Lecture Note: |HTML| PDF|

Structure, interpretation of coeffcients with an emphasis on the slope parameters
Assumptions and model diagnostics (validation)
R functions for creating various residual diagnostic plots
Clear understanding of R output and know hoe to use the information to

perform hypothesis testing on the slope
construct confidence interval of the slope
assess the goodness-of-fit using R²

5. Weekly Exam #1

Information: |PDF|
Answer Key and Summary of Exam #1: |PDF|