Epidemologic Study
Designs
In descriptive statistics, we introduce frequency table and bar-chart
to characterize the distribution. We also briefly introduced chi-square
tests of independence of two categorical variables. We will review
chi-square tests related tests and perform data analysis using R in the
subsequent module 4.
To help you gain better understanding of various chisquare tests to
assess the association between two categorical variables. For
illustration, we focus on the case of two binary categorical variables.
Some of concepts in the following can be generalized to case of
categorical variables with multiple categories.
Study Designs for
Data Collection
These three study designs are fundamental in epidemiology and
clinical research in which two categorical variables are involved:
outcome variable (such as disease status) and
exposure variable (such as smoking status). Each has
distinct strengths, weaknesses, and applications. Below is a structured
comparison.
1. Cohort Study Design
A cohort study is a type of epidemiological study in which a group of
people with a common characteristic is followed
over time to find how many reach a certain health
outcome of interest (disease, condition, event, death,
or a change in health status or behavior). A
Cohort studies compare an exposed group of
individuals to an unexposed (or less exposed) group of individuals to
determine if the outcome of interest is associated with
exposure. That is, cohort studies focus on the
relationship between the outcome and exposure variables.
Data Collection in Cohort Studies is based on
stratified sampling. That is, the entire population is divided into
subpopulations according to the values of the outcome variable or the
exposure variable. Sub-random samples are then taken from each
subpopulation, respectively.
- If the population is stratified by the outcome variable - the status
of lung cancer, the two stratified samples are taken from cancer
population and cancer-free population.
- If the population is stratified by the exposure variable - the
status of smoking, the two stratified samples are taken from smoking
population and non-smoking population.
The combined sample is called stratified sample. There are two types
of cohort studies: prospective and
retrospective (or historical) cohorts.
Prospective studies
follow a cohort into the
future for a health outcome. This means, the exposure variable splits
the population in exposed population and unexposed population. The two
subsamples will be take from exposed population and unexposed population
respectively and then follow up for some time to observe the
outcome.
retrospective studies trace the cohort
back
in time for exposure information
after the outcome has occurred. The retrospective cohort study is also
called case-control study.
Caution:
2. Cross-sectional Study Design
A cross-sectional study design is a type of
observational research method that analyzes data from a population, or a
representative subset, at a single point in
time. It is commonly used in epidemiology, public
health, social sciences, and market research.
Most of statistical models were developed based on
crodd-sectional data. For example, a researcher surveys 1,000
adults to assess the prevalence of hypertension and collects information
about age, BMI, and smoking status at the time of the
survey.
3. Randomized Controlled Trials
A Randomized Controlled Trial (RCT) is a
prospective experimental study design considered the
gold standard for evaluating the effectiveness of interventions (e.g.,
drugs, treatments, policies).
Randomization: Participants are randomly
assigned to either an intervention group or a control group (e.g.,
placebo or standard treatment).
Control: A comparison group is used to measure
the effect of the intervention.
Blinding (optional but common): Reduces bias.
Can be:
Prospective: Follows participants forward in
time after assignment.
Example: A clinical trial randomly assigns 200
patients with high blood pressure to receive either a new
antihypertensive drug or a placebo. Blood pressure is monitored over 6
months to evaluate the drug’s effectiveness.
The following short YouTube video summarized the above major study
designs.
Measures of
Association
Contingency
Tables
ontingency tables (also called cross-tabulations or
crosstabs) are fundamental tools in statistics for
analyzing relationships between categorical variables. They organize
data into rows and columns to display frequency distributions, enabling
researchers to identify patterns, test hypotheses, and measure
associations.
Structure of Contingency Tables
The general structure of a contingency table depicted in the
following:
Exposure (yes) |
a |
b |
a + b |
Exposure (No) |
c |
d |
c + c |
Total |
a + c |
b +d |
a + b + c + d |
We can see that a basic contingency table is an
\(r \times c\) matrix where:
+ Rows (\(r\)):
Represent categories of one variable.
+ Columns (\(c\)):
Represent categories of another variable.
+ Cells: Contain frequency counts for each variable
combination.
The above contingency table is essentially a two-way (or bivariate)
frequency table. Similar to we did in introductory statistics
(MAT121/125 at WCU), we can turn the above raw (ordinary)
frequency table the corresponding relative bivariate
frequency table in the following form, where T = a + b + c +
d.
Exposure (yes) |
a/T |
b/T |
(a + b)/T |
Exposure (No) |
c/T |
d/T |
(c + c)/T |
Total |
(a + c)/T |
(b +d)/T |
T= a + b + c + d |
Important note on the layout of
contingency table:Column names
MUST be the distict values of OUTCOME variable and row names MUST be the
distinct values of EXPOSURE variable!!!
The following YouTube video (https://www.youtube.com/watch?v=W95BgQCp_rQ) explains
the above contingency table with an example.
Risk Measures of
Association
For single categorical variable, we focus on the distribution using
frequency tables and charts. In the case of two categorical variables,
we focus primarily on the association between them. The basic analytic
logic is to assess whether the association between the two categorical
variables, if exists, we need to define numerical measures to measure
the strength of the related association.
We have introduced different study designs for data collection. A
dataset collected using Different study designs
contains different amount information. This means that
when analyzing a contingency table and defining measures of association,
we need to know the study design associated with the contingency
table.
Here are commonly used risk measures based on the following general
2-by-2 contingency table.
Exposed (+) |
a |
b |
a+b |
Unexposed (-) |
c |
d |
c+d |
Total |
a+c |
b+d |
N=a+b+c+d |
1. Absolute Risk Measures
- Risk in Exposed (Attack Rate) is the probability of disease in
exposed group which is defined by
\[
AR_\text{exp} = \frac{a}{a+b}.
\]
- Risk in Unexposed is the probability of disease in unexposed group
which is defined by
\[
AR_\text{unexp} = \frac{c}{c+d}.
\]
2. Relative Risk Measures
\[
RR = \frac{AR_{exp}}{AR_{unexp}} = \frac{a/(a+b)}{c/(c+d)}
\]
\[
OR = \frac{a/b}{c/d} = \frac{ad}{bc},
\] where \(a/b\) is the odds of
disease and \(c/d\) the odds of
disease-free.
Interpretation
RR (OR) = 1: No association
RR (OR)> 1: Increased risk with exposure
RR (OR)< 1: Protective effect
When to use:
Case-Control |
Odds Ratio (OR) |
\(OR=ad/bc\) |
Cohort |
Relative Risk (RR) |
\(RR=\frac{a/(a+b)}{c/(c+d)}\) |
Cross-Sectional |
Prevalence Ratio |
\(\frac{\text{Prevalence}_1}{\text{Prevalence}_2}\) |
The next excellent YouTube video (https://www.youtube.com/watch?v=Sec4fewyUig) discusses
the two most commonly used measures of risk: relative risk (RR) and odds
ratio (OR).
Chi-square Test of
Independence
Let \(X\) and \(Y\) be two categorical variables with \(k\) and \(m\) categories respectively. Their
relationship between \(X\) and \(Y\) is characterized by their joint
distribution (table). For simplicity, we use the following two special
categorical to explain the ideas of statistical testing of
independence.
Independence of Two
Categorical Variables
We use the following example to illustrate
independence and dependence between
two categorical variables.
Example 5. Joint
probabilities and contingency tables. Let \(X =\) political preference (Democrat vs
Republican) and \(Y =\)gender (Male and
Female). Let’s assume their joint distribution to be of
the following contingency table.
include_graphics("week02/twoWayContingencyTable.png")

The cell numbers are joint probabilities. For example, \(p_{12} = 0.2 = 20\%\) says \(20\%\) of the study population are
male republicans. The row and column totals represent the percentage of
male/female and democrats/republicans in the study population. Any
observed data table is governed by the above joint
distribution table.
Definition Two categorical variables are
independent if and only if their joint probabilities
are equal to the product of their corresponding marginal
probabilities.
With this definition, we can see that \(X\) and \(Y\) with joint distribution specified in
the above table (in Example 5) ar NOT independent since \(p_{11} =0.3 \ne 0.5\times 0.5 = 0.25\).
Example 6. We
consider two variables \(X =\)
preference of hair color (Blonde and Brunette) and \(Y =\) gender (Male and Female). Assume the
joint distribution of the two variables is given by
include_graphics("week02/independenceContingencyTable.png")

Based on the definition of independence. The preference for hair
color is independent of gender. Since all joint
probabilities are equal to the product of their corresponding marginal
probabilities.
\(0.45\times 0.40 = 0.18, 0.45 \times 0.60
= 0.27, 0.55 \times 0.40 = 0.22\), and \(0.55 \times 0.60 = 0.33.\)
Expected Table Under
Independence Assumption (\(H_0\))
We construct the expected table under the
null hypothesis of independence and the
observed contingency table. For ease of interpretation,
we use an example to illustrate the steps for obtaining the expected
table.
Example 7. Consider
the potential dependence between the attendance (good vs poor) and
course grade (pass vs fail). We take 50 students from a population and
obtain the following observed table.
include_graphics("week02/attendancePassFail.png")

Question: Whether the attendance is independent of
class performance? \[
Ho: \ \ attendance \ is \ \ independent \ \ of \ the \ performance
\] \[versus\]
\[
Ha: \ \ attendance \ is \ \ dependent \ of \ the \ performance
\]
To obtain the expected table, we follow the next few steps.
- Estimate the marginal probabilities
include_graphics("week02/marginalTable.png")

where marginal probabilities are calculated by Pr(Good) =
27/50 = 0.54, Pr(Poor) = 23/50 = 0.46, Pr(Pass) = 33/50 = 0.66, Pr(Fail)
= 17/50 = 0.34.
- Estimate the joint probability under the null hypothesis of
independence
include_graphics("week02/jointProb.png")

where joint probabilities under the independence
assumption (\(H_0\)) are calculated by
taking the product of the corresponding marginal probabilities. For
example, 0.54 \(\times\) 0.66 =
0.3564.
- Calculate Expected Table
The expected frequencies are calculated in the following
table (with detailed steps).
include_graphics("week02/expectedTable.png")

Remarks
For categorical variables with more than two
categories, the expected table can be found using ** the same 3
steps** as those used in the above example.
The generic formula for calculate any expected frequency is given
by
\[
\text{Expected Frequency} = \frac{\text{column total} \times \text{row
total}}{\text{grand total}}
\]
- We use the above formula on the previous example. Find the expected
frequency in the first column (
Pass
) and second row
(Poor
). The sample size is 50 (the grand total).
\[
15.1 \approx \frac{33\times 23}{50}.
\]
The following YouTube video (https://www.youtube.com/watch?v=S9XTAXn_qm4) gives
another example on how to calculate the expected frequency manually.
Formulation of
Chi-squares Test of Independence
The test statistic used to test the independence of two categorical
variables is the same as that used in the goodness-of-fit test. That is
the standardized “distance” between the observed and the expected table
(under \(H_0\)).
color{red}Assume that the two
categorical variables have \(k\) and
\(m\) categories respectively, then the
resulting test statistic has a chi-square distribution with \((k-1)\times(m-1)\) degrees of
freedom.
Example 8.
[Continuation of Example 7]. Test whether
attendance and class performance.
Solution: We have found the expected table under
\(H_0\) in Example 7,
we put the observed and expected tables in the following.
include_graphics("week02/example08ExpObs.png")

The test statistic is given by
\[
TS = \frac{(25-18.82)^2}{17.82} + \frac{(2-9.18)^2}{9.18} +
\frac{(8-15.18)^2}{15.18} + \frac{(15-7.82)^2}{7.82} = 17.75
\]
The test statistic has a chi-square distribution with \((2-1)\times (2-1) = 1\) degrees of freedom.
The critical value at the significance level of 0.05 is found in the
following figure.
include_graphics("week02/example08ChisqCV.png")

Since the test statistic is inside the rejection region, we reject
the null hypothesis that attendance and class performance are
independent.
Example 9. Do some
college majors require more studying than others? The National Survey of
Student Engagement asked a number of college freshmen what their major
was and how many hours per week they spent studying, on average. A
sample of 1000 of these students was chosen, and the numbers of students
in each category are tabulated in the following two-way contingency
table.
include_graphics("week02/example09Data.png")

Solution: The null and alternative hypotheses are
given by
Ho: studying time is INDEPENDENT on majors
versus
Ha: studying time is DEPENDENT on majors.
Under the null hypothesis, we obtained the expected table using the
same steps in Example 7 in the following.
include_graphics("week02/example09ExpTable.png")

The test statistic is given by
include_graphics("week02/example09TS.png")

The critical value and rejection region based on significance level
0.05 is given by
include_graphics("week02/example09CV.png")

Conclusion: Since the test statistic is inside the
rejection region, we reject the null hypothesis and conclude that the
studying time is dependent on the majors.
To conclude this section, watch the following YouTube video (https://www.youtube.com/watch?v=y5nxiL6civU) for another
manually worked-out example of chisquared test of independence.
Practice Exercises
- Political Affiliation and Opinion
The following table based on the sample will be used to explore the
relationship between Party Affiliation and Opinion on Tax Reform.
include_graphics("week02/practiceEx02Data.png")

Find the expected counts for all of the cells.
- Tire Quality
The operations manager of a company that manufactures tires wants to
determine whether there are any differences in the quality of work among
the three daily shifts. She randomly selects 496 tires and carefully
inspects them. Each tire is either classified as perfect, satisfactory,
or defective, and the shift that produced it is also recorded. The two
categorical variables of interest are the shift and condition of the
tire produced. The data can be summarized by the accompanying two-way
table. Does the data provide sufficient evidence at the 5% significance
level to infer that there are differences in quality among the three
shifts?
include_graphics("week02/practiceEx03Data.png")

- Condiment preference and gender
A food services manager for a baseball park wants to know if there is
a relationship between gender (male or female) and the preferred
condiment on a hot dog. The following table summarizes the results. Test
the hypothesis with a significance level of 10%.
include_graphics("week02/practiceEx04Data.png")

