Introduction
Regardless of the research design, statistics are a crucial component
of research since it allows the researchers to summarize the collected
data and give it to others for interpretation.
We need a defined analytic plan before we start collecting data. The
SAP (statistical analysis plan) will direct us from the beginning to the
conclusion, help us summarize and describe the data, and test our
hypotheses.
The statistical analysis plan (SAP) describes the intended clinical
trial analysis. The SAP is a technical document that describes the
statistical methods of research analysis, as opposed to the protocol,
which represents the analysis.
What is a statistical
analysis plan?
Statistical Analysis Plan (SAP) is a detailed
document specifying how data will be analyzed, ensuring transparency,
reproducibility, and minimization of bias.
- Data Collection
- Primary Data: Clearly define the main outcome
variables (e.g., clinical endpoints, survey responses).
- Secondary Data: Specify additional variables (e.g.,
covariates, confounders) and their sources.
- Data Handling: Procedures for missing data,
outliers, and transformations (e.g., log-transformation for skewed
data).
- Methods of Analysis
- Descriptive Statistics: Summarize data (means,
proportions, SDs, visualizations).
- Inferential Statistics: Hypothesis testing,
confidence intervals, effect sizes.
- Software/Tools: Specify (e.g., R, SAS, SPSS) and
versions.
- Primary Analysis
- Define the primary outcome(s) and statistical test(s) (e.g., t-test,
ANOVA, regression).
- State how the primary hypothesis will be tested (e.g., superiority,
non-inferiority).
- Comparisons & Significance Levels
- Predefined Comparisons: Subgroup analyses, pairwise comparisons
(adjust for multiple testing if needed).
- Alpha (Significance Level): Typically 0.05, with justification if
adjusted.
- Power Analysis: Sample size justification based on
expected effect size.
- Exploratory Data Analyses (EDA)
- Unplanned analyses to identify patterns (e.g., trends,
interactions).
- Clarify that results are hypothesis-generating (not
confirmatory).
- Statistical Models
- Primary Model: Specify (e.g., linear regression,
logistic regression, Cox model).
- Response Variable: Clearly defined (e.g., blood
pressure, survival time).
- Predictors/Covariates: List included variables and
rationale.
- Alternative Models: Robustness checks (e.g.,
sensitivity analyses, different covariate adjustments).
Identifying the need
for an SAP
While a study protocol outlines the general research methodology, an
SAP provides a deeper, technical specification of statistical
procedures. Here’s when an SAP becomes essential:
- High-Risk or Regulated Studies
- Clinical trials (especially Phase II/III) requiring
regulatory approval (FDA, EMA).
- Studies with major clinical/public health
implications (e.g., drug efficacy, policy decisions).
- Pre-specified analyses to prevent bias (e.g.,
avoiding selective reporting).
- Complex Statistical Methods
- Advanced modeling (e.g., mixed-effects models,
survival analysis, machine learning).
- Adaptive trial designs (interim analyses, Bayesian
methods).
- Handling missing data (multiple imputation, inverse
probability weighting).
- Large or Multi-Center Studies
- Ensures consistency across sites/analysts.
- Prevents post-hoc decisions that could introduce bias.
- Reproducibility & Transparency
- Needed for peer-reviewed journals (e.g., ICMJE, CONSORT
requirements).
- Allows an independent statistician to replicate analyses without
ambiguity.
Study Protocol vs
SAP
The study protocol and SAP are complementary documents that guide
different aspects of a research study. While they overlap in some areas,
they serve distinct purposes and audiences.
The following table gives some comparisons of the two formal
documents:
Types |
Study Protocol |
Statistical Analysis Plan (SAP) |
Purposes |
Describes the overall study design, objectives, and methodology.
Ensures ethical and scientific validity (used for approvals by IRBs, regulators).
Guides the conduct of the study (e.g., recruitment, interventions, data |
Provides detailed, technical instructions for statistical analysis.
Ensures reproducibility, transparency, and minimization of bias.
Acts as a binding pre-specification to prevent data-driven decisions (e.g., p-hacking). |
Contents |
Research question & hypotheses
Study population & eligibility criteria
Study design (randomized trial, cohort, case-control, etc.)
Data collection procedures
General statistical approach (but not highly technical details) |
Exact statistical models (e.g., regression formulas, survival analysis methods)
Handling of missing data, outliers, and covariates
Primary/secondary endpoint analysis (including multiplicity adjustments)
Sensitivity & subgroup analyses
Software & code specifications (if applicable) |
Audience |
Investigators, ethics committees, funding agencies |
Statisticians, data analysts, regulatory reviewers, peer reviewers |
The protocol and SAP are interdependent: The protocol sets the rules;
the SAP enforces them statistically. Here are some examples:
- SAP Expands on the Protocol
- The protocol states: “We will compare Group A and
Group B using regression analysis.”
- The SAP specifies: “A Cox proportional hazards
model will be used with covariates X, Y, Z. Hazard ratios will be
reported with 95% CIs. Missing data will be handled via multiple
imputation.”
- SAP is More Rigid (Pre-Specified)
- The protocol may allow minor methodological flexibility.
- The SAP locks in analysis details before data unblinding to prevent
bias.
- SAP is Often a Separate Document
- For simple studies, statistical methods may be fully described in
the protocol.
- For complex/high-stakes studies, a standalone SAP is required (e.g.,
clinical trials for FDA submission).
---
title: 'A Brief Description of Basic Components of A SAP'
author: "Cheng Peng"
date: " "
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("factoextra")) {
   install.packages("factoextra")
   library(factoextra)
}
if (!require("TSstudio")) {
   install.packages("TSstudio")
   library(TSstudio)
}
if (!require("plotrix")) {
   install.packages("plotrix")
library(plotrix)
}
if (!require("ggridges")) {
   install.packages("ggridges")
library(ggridges)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}
knitr::opts_chunk$set(echo = TRUE,   
                      warning = FALSE,  
                      result = TRUE,    
                      message = FALSE,
                      comment = NA
                      )  
```

\

# Introduction

Regardless of the research design, statistics are a crucial component of research since it allows the researchers to summarize the collected data and give it to others for interpretation. 

We need a defined analytic plan before we start collecting data. The SAP (statistical analysis plan) will direct us from the beginning to the conclusion, help us summarize and describe the data, and test our hypotheses.


The statistical analysis plan (SAP) describes the intended clinical trial analysis. The SAP is a technical document that describes the statistical methods of research analysis, as opposed to the protocol, which represents the analysis.

# What is a statistical analysis plan?

**Statistical Analysis Plan (SAP)** is a detailed document specifying how data will be analyzed, ensuring transparency, reproducibility, and minimization of bias.

* **Data Collection**
  + **Primary Data**: Clearly define the main outcome variables (e.g., clinical endpoints, survey responses).
  + **Secondary Data**: Specify additional variables (e.g., covariates, confounders) and their sources.
  + **Data Handling**: Procedures for missing data, outliers, and transformations (e.g., log-transformation for skewed data).

* **Methods of Analysis**
  + **Descriptive Statistics**: Summarize data (means, proportions, SDs, visualizations).
  + **Inferential Statistics**: Hypothesis testing, confidence intervals, effect sizes.
  + **Software/Tools**: Specify (e.g., R, SAS, SPSS) and versions.

* **Primary Analysis**
  + Define the primary outcome(s) and statistical test(s) (e.g., t-test, ANOVA, regression).
  + State how the primary hypothesis will be tested (e.g., superiority, non-inferiority).

* **Comparisons & Significance Levels**
  + Predefined Comparisons: Subgroup analyses, pairwise comparisons (adjust for multiple testing if needed).
  + Alpha (Significance Level): Typically 0.05, with justification if adjusted.
  + **Power Analysis**: Sample size justification based on expected effect size.

* **Exploratory Data Analyses (EDA)**
  + Unplanned analyses to identify patterns (e.g., trends, interactions).
  + Clarify that results are hypothesis-generating (not confirmatory).

* **Statistical Models**
  + **Primary Model**: Specify (e.g., linear regression, logistic regression, Cox model).
  + **Response Variable**: Clearly defined (e.g., blood pressure, survival time).
  + **Predictors/Covariates**: List included variables and rationale.
  + **Alternative Models**: Robustness checks (e.g., sensitivity analyses, different covariate adjustments).



# Identifying the need for an SAP

While a study protocol outlines the general research methodology, an SAP provides a deeper, technical specification of statistical procedures. Here’s when an SAP becomes essential:

* **High-Risk or Regulated Studies**
  + **Clinical trials** (especially Phase II/III) requiring regulatory approval (FDA, EMA).
  + **Studies with major clinical/public health implications** (e.g., drug efficacy, policy decisions).
  + **Pre-specified analyses to prevent bias** (e.g., avoiding selective reporting).

* **Complex Statistical Methods**
  + **Advanced modeling** (e.g., mixed-effects models, survival analysis, machine learning).
  + **Adaptive trial designs** (interim analyses, Bayesian methods).
  + **Handling missing data** (multiple imputation, inverse probability weighting).

* **Large or Multi-Center Studies**
  + Ensures consistency across sites/analysts.
  + Prevents post-hoc decisions that could introduce bias.

* **Reproducibility & Transparency**
  + Needed for peer-reviewed journals (e.g., ICMJE, CONSORT requirements).
  + Allows an independent statistician to replicate analyses without ambiguity.



# Study Protocol vs SAP

The study protocol and SAP are complementary documents that guide different aspects of a research study. While they overlap in some areas, they serve distinct purposes and audiences.

The following table gives some comparisons of the two formal documents:

```{=html}
<table border="1">
  <tr>
    <th> Types </th>
    <th>Study Protocol</th>
    <th>Statistical Analysis Plan (SAP)</th>
  </tr>
  <tr>
    <td>Purposes</td>
    <td> <li> Describes the overall study design, objectives, and methodology.
         <li> Ensures ethical and scientific validity (used for approvals by IRBs, regulators).
         <li> Guides the conduct of the study (e.g., recruitment, interventions, data</td>
    <td> <li> Provides detailed, technical instructions for statistical analysis.
         <li> Ensures reproducibility, transparency, and minimization of bias.
         <li> Acts as a binding pre-specification to prevent data-driven decisions (e.g., p-hacking).</td>
  </tr>
  
  <tr>
    <td>Contents</td>
    <td> <li> Research question & hypotheses
         <li> Study population & eligibility criteria
         <li> Study design (randomized trial, cohort, case-control, etc.)
         <li> Data collection procedures
         <li> General statistical approach (but not highly technical details)</td>
    <td> <li> Exact statistical models (e.g., regression formulas, survival analysis methods)
         <li> Handling of missing data, outliers, and covariates
         <li> Primary/secondary endpoint analysis (including multiplicity adjustments)
         <li> Sensitivity & subgroup analyses
         <li> Software & code specifications (if applicable)</td>
  </tr>
  
  <tr>
    <td>Audience</td>
    <td>Investigators, ethics committees, funding agencies</td>
    <td>Statisticians, data analysts, regulatory reviewers, peer reviewers </td>
  </tr>
</table>
```


The protocol and SAP are interdependent: The protocol sets the rules; the SAP enforces them statistically. Here are some examples:

* **SAP Expands on the Protocol**
  + **The protocol** states: "We will compare Group A and Group B using regression analysis."
  + **The SAP** specifies: "A Cox proportional hazards model will be used with covariates X, Y, Z. Hazard ratios will be reported with 95% CIs. Missing data will be handled via multiple imputation."


* **SAP is More Rigid (Pre-Specified)**
  + The protocol may allow minor methodological flexibility.
  + The SAP locks in analysis details before data unblinding to prevent bias.

* **SAP is Often a Separate Document**
  + For simple studies, statistical methods may be fully described in the protocol.
  + For complex/high-stakes studies, a standalone SAP is required (e.g., clinical trials for FDA submission).


# Key Information in the SAP

The SAP should contain various sample size calculations for different statistical procedures to achieve certain statistical power and a thorough explanation of the main and any interim analyses used in the data analysis technique. 

The SAP should also thoroughly explain the procedures used to analyze and display the study results.

* **Statistical Significance** – The predefined level of statistical significance (e.g., α = 0.05) and whether one-tailed or two-tailed tests will be employed.

* **Missing Data Handling** – Methods for addressing missing data (e.g., imputation techniques, complete-case analysis).

* **Outlier Management** – Approaches for identifying and handling outliers.

* **Estimation Methods** – Techniques for point and interval estimation.

* **Composite/Derived Variables** – Rules for calculating composite or derived variables, including data-driven definitions, with sufficient detail to minimize ambiguity.

* **Baseline and Covariate Data** – How baseline and covariate data will be incorporated into the analysis.

* **Randomization Factors** – Inclusion of randomization factors (if applicable).

* **Multi-sources Data Handling** – Methods for managing data from multiple sources.

* **Multiple Comparisons & Subgroup Analysis** – Methods for adjusting for multiple comparisons and conducting subgroup analyses.

* **Interim/Sequential Analyses** – Details of any planned interim or sequential analyses.

* **Software Specifications** – Identification of the computer systems and statistical software packages used for data analysis.

* **Assumptions & Sensitivity Analyses** – Critical assumptions of the statistical models and methods for conducting sensitivity analyses to validate these assumptions.

* **Data Presentation** – Guidelines for tables and figures to present study data.

* **Safety Population Definition** – A clear definition of the safety population.

* **Model Validation & Alternatives** – Provisions for testing the statistical model and alternative methods if model assumptions are violated.


<font color = "red">*\color{red}The SAP must include provisions for testing the statistical model, along with alternative methods to be used if the model assumptions are not met.*</font>
















