Data Set
Choose a data set that has at least four categorical variables and
four numerical variables. The sample size should be at least 200. You
can find a data set either from my teaching data repository or other
data sources. The data set should be cross-sectional (i.e., each of the
data points must be observed/collected/generated at the same time).
The data set must have both continuous and (ideally binary) category
variables so you can perform linear and logistic regression
modeling.
Description of
Data
The following information of the data should be provided in the
report:
A brief description of the data source.
How the data set is generated or collected.
Number of variables and their type (categorical or numerical) and
size of the data set.
List the variable names and their
description/definitions.
Problem Statements and
Candidate Models
Formulate at least two practical questions based on the continuous
and categorical response variables. Please make clear statements of the
practical questions and convert them into unambiguous analytic questions
so you can identify candidate models with sufficient justification to
address the practical questions.
Write model formulas and assumptions of all candidate models
explicitly.
Exploratory Data
Analysis and Feature Engineering
Perform the standard EDA to serve the following major purposes:
Inspecting data issues such as missing values, mistakenly
recorded data values, inconsist data formats, etc. and fix
them;
Identifying new patterns/insights to improve subsequent
modeling;
Checking assumptions of candidate models and perform appropriate
feature engineering methods
To present your EDA in clear logical order, you are encouraged to use
subsections to organize your work.
For each EDA and associated representation, you should
open a paragraph with one or few sentence to describe the reasons
for the specific EDA before actual analysis;
After the analysis, interpret what you observed and the
implication of potential feature engineering;
Perform feature engineering (if necessary) based on EDA findings,
and thoroughly document all steps to ensure reproducibility.
Creating Analytical
Data
Create an analytical data set that includes
- all original feature variables if no feature engineering is not
needed
- all feature engineered featured variables and exclude the
corresponding orginal variables
All variables will be called directly in subsequent models. Note that
all numerical feature variables needs to be standardized for predictive
modeling.
Wrapping Feature
Engineering Code
Wrapping feature engineering code into reusable
functions for predictive modeling. This ensures that the same
transformations applied during training can be seamlessly applied to new
raw data during inference.
Modularity: Each feature engineering step should
be a separate function.
Consistency: Transformations must behave
identically on training and new data.
Stateful Transformations: A term refers to
storing learned parameters (e.g., imputation values, scalers) during
training for reuse on new data.
---
title: 'Porject One: Regression Algorithms and Cross-validation '
author: " Part I- EDA and Feature Engineering"
date: " STA 511 - Foudations of Data Science"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```



\

# Data Set


Choose a data set that has at least four categorical variables and four numerical variables. The sample size should be at least 200. You can find a data set either from my teaching data repository or other data sources. The data set should be cross-sectional (i.e., each of the data points must be observed/collected/generated at the same time).

The data set must have both continuous and (ideally binary) category variables so you can perform linear and logistic regression modeling.

# Description of Data

The following information of the data should be provided in the report:

* A brief description of the data source.

* How the data set is generated or collected.

* Number of variables and their type (categorical or numerical) and size of the data set.

* List the variable names and their description/definitions.


# Problem Statements and Candidate Models

Formulate at least two practical questions based on the continuous and categorical response variables. Please make clear statements of the practical questions and convert them into unambiguous analytic questions so you can identify candidate models with sufficient justification to address the practical questions.

Write model formulas and assumptions of all candidate models explicitly.



# Exploratory Data Analysis and Feature Engineering

Perform the standard EDA to serve the following major purposes:

* Inspecting data issues such as missing values, mistakenly recorded data values, inconsist data formats, etc. and fix them;

* Identifying new patterns/insights to improve subsequent modeling;

* Checking assumptions of candidate models and perform appropriate feature engineering methods 

To present your EDA in clear logical order, you are encouraged to use subsections to organize your work.

For each EDA and associated representation, you should 

* open a paragraph with one or few sentence to describe the reasons for the specific EDA before actual analysis;

* After the analysis, interpret what you observed and the implication of potential feature engineering;

* Perform feature engineering (if necessary) based on EDA findings, and thoroughly document all steps to ensure reproducibility.


# Creating Analytical Data

Create an analytical data set that includes

* all original feature variables if no feature engineering is not needed
* all feature engineered featured variables and exclude the corresponding orginal variables

All variables will be called directly in subsequent models. Note that all numerical feature variables needs to be standardized for predictive modeling. 


# Wrapping Feature Engineering Code

Wrapping feature engineering code into **reusable functions** for predictive modeling. This ensures that the same transformations applied during training can be seamlessly applied to new raw data during inference.

* **Modularity**: Each feature engineering step should be a separate function.

* **Consistency**: Transformations must behave identically on training and new data.

* **Stateful Transformations**: A term refers to storing learned parameters (e.g., imputation values, scalers) during training for reuse on new data. 


# Reporting and format

Use the suggested reporting template (the RMarkdown Source can be found at <https://pengdsci.github.io/STA551/w01/w01-ReportingRMarkdoenSource.txt>) and the report component at (<https://pengdsci.github.io/STA551/w02/w02-AssignSunmission.html>)

























