Week 5: A Brief Introduction to Multiple Linear Regression with Applications
Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM
1. Multiple Linear Regression Models with Categorical Predictors: ANOVA Regression
Lecture Note:
|HTML|
PDF|
- Simple Linear Regression Model
- Model structure and interpretations of the model coefficients
- Assumptions and model diagnostics
- Linear Regression Model with Single Multiple-category Categorical Variable
- How regression models handle multi-category categorical variables
- Interpretation of coefficients of dummy variables: comparisons between non-baseline categories with the reference.
- Linear regression approach to one-way ANOVA
Understand the outputs of lm()
and TukeyHSD()
- Multiple Linear Regression Model with Multiple Categorical Variables
- Model structure: main effect
- How regression coefficients are estimated from the original data table
- Interpretation of regression coefficients: Main effect and interaction effect
- Understand the statistics in the
TukeyHSD()
- Connections between regression coefficients and TukeyHSD outputs
2. A Brief Introduction of General Linear Regression Models: Polynomial and ANCOVA Regression
Lecture Note:
|HTML|
PDF|
- Overview of types Multiple Linear Regression Models
- Polynomial Regression Models
- Model structure: Nonlinear relationship between response and predictor, but linear relationship among regression coefficients
- Multicollinearity issue in polynomial regression and remedy
- If a high-order term is significant, all lower degree terms must be retained in the model
- Analysis of Covariance (ANCOVA) Models
- Handling categorical variables in multiple linear regression model
- Understand and correctly interpret interaction terms in ANCOVA
- Understand the visual representation of interaction effect
3. Final Exam: Sunday 3:30 PM - 5:30 PM, 08/03/2025
Important Note: The final exam is cumulative. Everyone is required to take it. Failure to do so will result in a failing grade for the course.
- Open: Sunday at 3:30 PM (starting time for everyone!)
- Close: Sunday at 5:30 PM (6:30 PM for students who have submitted accommodations through OEA)
- Guideline: |PDF|
Week 4: Principles of Experimental Designs and Analysis of Variance (ANOVA)
Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM
1. Principles of Experimental Designs
Lecture Note:
|HTML|
PDF|
- Rationale and Logical Process of Experimental Design
- Research questions and objective: Ssetting up hypotheses
- Identify the response variable and Factor variables
- Commonly used experimental designs
- Roles of randomization, blinding, and assignment of experimental units
- Sampling plans and data collection
- Simple random sampling
- Stratified sampling
- Systematic sampling
- Clustering sampling
- Completely Random Design and Randomized Block Design
- Completely random design (CRD)
- Randomized block design (RBD) with and without replicates
2. One-way Analysis of Variance (ANOVA)
Lecture Note:
|HTML|
PDF|
- Objective and Logic of Analyzing CRD Data
- The objective and logic of analyzing CRD data
- Measuring Discrepancy between the null and alternative hypotheses: sum of squares decomposition
- Basics of F-distribution
- R commands for F-distribution
- One-way ANOVA
- One-way ANOVA table structure
- Understanding the statistics for testing the null hypothesis
- Preparing data for ANOVA analysis using R
- R function
aov()
for one-way ANOVA testing
- Multiple Comparisons
- Simultaneous comparisons between factor levels (treatment levels)
- Understanding family significance level (observed type I error): avoiding inflation of type I error
- Tukey's HSD and Bonferoni procedures
- Steps for implementing Tukey's HSD using
TukeyHSD()
3. Two-way Analysis of Variance (ANOVA)
Lecture Note:
|HTML|
PDF|
- Types of Null Hypotheses in two-way ANOVA with replicated RBD data
- Testing main effect only RBD with no replicates
- Testing interactive effects with replicated RBD
- Understanding the structure of two-way ANOVA
- The structure of two-way ANOVA
- Identify test statistics for testing appropriate hypotheses
- Implementing two-way ANOVA with R
- Multiple Comparison: Post-hoc Tests
- Tukey's HSD
- Main effect: comparing levels within that factor
- Interactive effect: comparing one factor within levels of another
- Using R to implement Tukey's HSD
- Visual presentation of post-hoc comparisons
4. Weekly Exam #4
- Open: Friday at noon
- Close: Sunday at midnight
- Guideline: |PDF|
- Answer Key and Summary: |PDF|
Week 3: Parametric and Nonparametric One- and Two-Sample Tests
Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM
1. One-Sample Tests
Lecture Note:
|HTML|
PDF|
- Data Structure
- R dataframe structure: row and column indices
- Access the data frame using row and column indices
- Selecting columns and rows
- Conditional selection: R function
which()
- One-Sample t-test Revisited
- Basic assumptions
- Data are independently observed
- The population is normally distributed
- Population variance is unknown
- Manual calculation based on descriptive statistics using R commands
- Calling the R function
t.test()
from the base R package stats
- Linear Regression Approach to One-Sample t Test
- Linear regression Assumption review
- Data are independently observed
- The population is normally distributed
- Population variance is unknown
- Regression with only intercept: $y = \beta_0 + \epsilon$
- General set-up in R:
lm(y ~ 1, data = dataset.name)
- Regression set-up for one-sample t test in R:
lm(I(y-mu_0) ~ 1, data = dataset.name)
- Nonparametric One-Sample Test for Median
- Binomial distribution: R functions
pbinom()
and qbinom()
- Test procedure formulation
- Exact method: binomial critical value and p-value - built-in function
binom.test()
and SIGN.test()
- Normal approximation
2. Two-Sample Tests
Lecture Note:
|HTML|
PDF|
- Two-Sample t Test Revisited
- Assumptions: independent samples, normal populations, unknown but equal variances
- Manual calculation: pooled variance and test statistic (TS and its distribution, degrees of freedom)
- Built-in function:
t.test()
- Linear Regression Approach to Two-Sample t Test
- Response variable and group variable must be either in a dataframe or defined separately.
- The group variable must be a factor: R function
factor(group.variable)
- Model formula in R:
lm(y ~ factor(x), data = dataset.name)
- Understand the output of the regression model: estimated difference of the two population means, TS, p-value, and degrees of freedom.
- Non-parametric Two-Sample Test: Wilcoxon Signed Rank Test
- Two-tailed Wilcoxon signed rank test: What is the null hypothesis?
- Formulation of the test procedure and basic steps
- Exact approach: requires tabulation of critical values based on given significance levels
- Normal approximation approach
- Implementation in R:
wilcox.test()
- Understand the output of
wilcox.test()
- Two-Paired-Sample Tests
- Paired t-test: assumptions and R function
t.test()
-
Nonparametric two-paired-sample test: Wilcoxon Signed Rank Test -
wilcox.test()
3.Weekly Exam #3 Information
- Open: Friday at noon
- Close: Sunday at midnight
- Guideline: |PDF|
- Answer Key and Summary: |PDF|
Week 2: Chi-square Tests for Goodness-of-fit and Independence
Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM
1. Chi-square Distribution and Goodness-of-fit Test
Lecture Note:
|HTML|
PDF|
- Chi-square Distribution
- Chi-square distribution density curve: skewed to the right
- Finding right-tail probability: related to p-value of chi-square tests
- Find percentile: related to critical value.
- Two R built-in functions
- right-tail probability: pchisq(x, df, lower.tail = FALSE)
- percentile: qchisq(p, df, lower.tail = TRUE)
- Chi-square Goodness-of-fit Test
- Setting up Hypothesis
- H0: The data follows the given distribution p = (p1, p2, ..., pk)
- Ha: The data NOT follow the given distribution p = (p1, p2, ..., pk)
- Calculate expected frequencies under H0
- Find the total number of observations (i.e.,sample size) denoted by n
- Expected observation of j-th cell: Ej = n* pj
- Test statistics: $G^2= \sum_{j=1}^k (O_j - E_j)^2/E_j \rightarrow \chi^2_{k-1}$
- Implementation in R: chisq.test(obs.freq, p)
- Observed Frequency: obs.freq = $(n_1, n_2, \cdots, n_k)$
- Hypothetical probability distribution: $p=(p_1, p_2, \cdots, p_k)$
2. Chi-square Test of Independence
Lecture Note:
|HTML|
PDF|
- Study Designs
- Retropsective corhort study design (look at histoorical data)
- Prospective corhort study design (follow-up study)
- Cross-sectional study (look at the current data, a snapshot observations)
- Measures of Association
- Absolute risk, relative risk, and attributable risk
- Odds ratio
- Risk measures vs study designs
- $\chi^2$ test of independence between two categorical variables
- Observed two-way contingency tables: I rows and J columns
- Null hypothesis: $H_0$ - two categorical variables are independent.
- Expected frequencies under $H_0$:
- Expected frequency of i-th row and j-th column: $E_{ij}=\text{i-th row total}\times \text{j-th column total}/\text{grand total}$
- Test statistic: $G^2 = \sum_{i=1}^I\sum_{j=1}^J (O_{ij}-E_{ij})^2/E_{ij} \rightarrow \chi^2_{(I-1)(J-1)}$
- Implementation in R: chisq.test()
- Using observed contingency table: chisq.test(obs.table)
- Using two categorical variables directly: chisq.test(x,y)
3. Weekly Exam #2 Information
- Open: Friday at noon
- Close: Sunday at midnight
- Guideline: |PDF|
- Answer Key and Summary |PDF|
Week 1: Computing Software and Review of Introductory Statistics
Zoom Office Hours: Tuesday/Wednesday/Thursday 1:30 PM - 3:00 PM
1. Course Information and Learning Advice
Note: |HTML|
PDF|
2. Getting started with R and RStudio
Lecture Note:
|HTML|
PDF|
3. Basic statistics review
Lecture Note:
|HTML|
PDF|
- Sampling distributions and Central Limit Theorem (CLT)
- Confidence intervals of sample means: normal and t distributions
- Testing hypothesis: logic, steps, p-value
4. Least square simple linear regression (SLR): inference and applications
Lecture Note:
|HTML|
PDF|
- Structure, interpretation of coeffcients with an emphasis on the slope parameters
- Assumptions and model diagnostics (validation)
- R functions for creating various residual diagnostic plots
- Clear understanding of R output and know hoe to use the information to
- perform hypothesis testing on the slope
- construct confidence interval of the slope
- assess the goodness-of-fit using R2
5. Weekly Exam #1
- Information:
|PDF|
- Answer Key and Summary of Exam #1: |PDF|