The objective of this project is to put together some of the topics we learned this semester to work on real-world problems.
The data management part aims to create an analytic data set that has enough information for the exploratory data analysis (EDA) to be discussed next week. The specific information and variables to be integrated into the final data set will be detailed in the subsequent sections.
Part I: Data Integration - Using Either R or SAS
2. Data Sources
The involved data sets can be found in the following.
The information to be included in the analytic data set is outlined in the following.
Presidential Election Data: The following information should be kept in the final data set
only use the 2020 election data
only keep the data for the two major parties: Democrats and Republicans
aggregate the total votes and keep the winning party in the data
county FIPS code, State name, county name, total votes received in the winning party, and the name of the party
Unemployment Data: The following information should be kept in the data
Only keep the unemployment date in the year 2020 or the most recent year if the 2020 employment rate is unavailable
County FIPS code
Unemployment rate
Poverty Data: The following information should be kept in the data
Keep only 2019 poverty rate [the variable name in the data set: PCTPOVALL_2019]
County FIPS code
Education Data: The following information should be kept in the data
only keep the percentage of education levels between 2015 and 2019.
Education levels
less than a high school diploma
high school diploma
completed some college (1-3 years)
completed four years of college
County FIPS code
After finishing managing the above relational data sets, combine them to create a single data set in which each county has only one record.
4. Some Suggestions on Coding Conventions
The following are some suggestions on best practices in coding projects with multiple tasks.
Formatting: A good programmer should write code in ways that make review as easy to read as possible. Good formatting of the code is essential.
Putting multiple statements on one line hurts readability
Putting each variable on its own line when having multiple variables
Indentation. When using SAS, all the statements in each DATA step or PROC step should be indented except for the first and last one
Commenting: Commenting makes the code more readable and understandable. It should be used in your code appropriately. The following are a few occasions a comment should be used.
A program header containing a list of information about the program
When making a change to the code, commenting should be made appropriately
When a code line or code block involves your smart idea that is not obvious to others, you need a comment on it.
Naming Conventions: Using the naming conventions in your code.
Dataset names should describe what is in the dataset (i.e., the name should be descriptive).
The key to choosing variable names is clarity.
Part II: EDA and Visualization - Using R Only
5. EDA and Statistical Graphics
As an integral part of exploratory data analysis, statistical graphics make it easier to identify patterns, trends, outliers, etc. from a data set. The primary goal of EDA is to detect (hidden) patterns in the data. I will not ask you to perform specific EDA. However, I would like to see the three basic techniques in your analysis report.
Distributional information of single variable and relevant visual representation
Describe the relationship between two variables
Pairwise comparisons
Please keep in mind that the characterization of categorical and continuous variables uses different approaches. When assessing the relationship between two variables, you must consider whether they are continuous, categorical, or a mixture of both.
5. Comments on Visual Representation
First of all, visual representation includes both tables and graphics. Please follow the best practice of visual representation to include the required components of tables and figures that are listed below.
Table Representation
Table caption
Table header
Figure Representation
Title
Labels of both axes
Figure captions
Legends
Annotations if needed
In addition, each table and figure must be interpreted briefly based on the observed patterns.
6. Project Reporting - Using RMarkdown
Your project report is the formal description of your project. The format should be similar to statistical journal articles. The report should be about 10 pages in length (excluding code) and contain the following components.
Problem Statement and Background
Give a clear and complete statement of the problem(s) you plan to address. Where does the data come from, and what are its characteristics?
Include background material as appropriate: who cares about problems and what impacts the questions have
Objectives of data integration
Software programs to be used in the projects and the reason you use it
Data Integration
Write narrative paragraphs to describe the process of data integration (refer to the specified steps)
Use sections and subsections with descriptive headings to organize your work
Describe the final aggregated data set: size, variable lists, variable types
Describe the steps of data inspection for anomalies
Exploratory Data Analysis
You may perform various analyses, but only report practically meaningful results
Each subsection focuses on one analysis with visual representations whenever relevant
Each analysis should be accompanied by narrative paragraphs to describe the patterns you observed from the visual representations.
Results and Discussions
Give a detailed summary of the results of your work. Here is also where you make some inferential statements based on your statistical knowledge and formulate new research questions or update/modify your initial questions based on the uncovered new patterns.
Include only important outputs (figures and tables) that support your arguments in the report. Please do not simply copy or screenshot visuals from any output and paste them into the report. The table and figures should be directly generated from the software program you used in the EDA.
Interpret each of the tables and figures that you included in the report.
Please use visualizations whenever possible. You are particularly encouraged to use interactive graphics in your report.