```{=html} ``` ```{r setup, include=FALSE} # Detect, install and load packages if needed. if (!require("knitr")) { install.packages("knitr") library(knitr) } if (!require("MASS")) { install.packages("MASS") library(MASS) } if (!require("nleqslv")) { install.packages("nleqslv") library(nleqslv) } # # specifications of outputs of code in code chunks knitr::opts_chunk$set(echo = TRUE, # include code chunk in the output file warnings = FALSE, # sometimes, you code may produce warning messages, # you can choose to include the warning messages in # the output file. messages = FALSE, # results = TRUE # you can also decide whether to include the output # in the output file. ) ``` \ # Introduction This week's assignment has two components. * Finding a data set for **project #1** that is due on Sunday, 10/1/2023. The detailed requirements of the data will be described in the next section. * Fitting a simple linear regression (SLR) by selecting a numerical explanatory variable from the data set using the least square approach and then constructing 95% bootstrap confidence intervals for the regression coefficients. # Data Requirements and Sources The selected data set will be used for multiple linear regression analysis and bootstrap analysis as well. In order to implement the commonly used regression techniques, the desired data must meet some requirements. You also have the flexibility to choose a data set that you are interested in so you can easily formulate the analytic questions and tell a better story from the analytic results. ## Data set requirements The desired data set must have + the response variable **must** be continuous random variables. + at least two categorical explanatory variables. + at least one of the categorical variables has more than two categories. + at least two numerical explanatory variables. + at least 15 observations are required for estimating each regression coefficient. For example, if your final linear model has 11 variables (including dummy variables), you need $12 \times 15 = 180$ observations. ## Data Sources The following websites contain many links to sites that have different types of data sets (some of the links may not be active). * 10 open data sets for linear regression * UFL Larry Winner's Teaching Data Sets * The suggested data repository for this class * Datasets for Teaching (Univ. Sheffield, UK) * Data.World ## Post Your Selected Data on D2L Before you start searching your data set, check the D2L discussion board and make sure you will not select the data set your classmates have already chosen for their project. After you identify your data set, please post your data set name and the link to that data set. # This Week's Data Analysis Due: Sunday, 9/17/2023 Please prepare an RMarkdown document to include the following two parts. Please start your work earlier to ## Description of the Data Set Write an essay to describe the data set. The following information is expected to be included in this description. * How the data was collected? * List of all variables: names and their variable types. * What are your practical and analytic questions * Does the data set have enough information to answer the questions ## Simple Linear Regression Make a pair-wise scatter plot of all variables in your selected data set and choose an explanatory variable that is linearly correlated to the response variable. * Make a pairwise scatter plot and comment on the relationship between the response and explanatory variables. + If there is a non-linear pattern, can you perform a transformation of one of the variables so that the transformed variable and the other original variable have a linear pattern? + If you have a choice to transform either the response variable or the explanatory variable, what is your choice and why? * Fit an ordinary least square regression (SLR) to capture the linear relationship between the two variables. If you transformed one of the variables to achieve the linear relationship, then use the transformed variable in the model. and then perform the model diagnostics. Comment on the residual plots and point out the violations to the model assumptions. * Using the bootstrap algorithm on the **previous final linear regression model** to estimate the bootstrap confidence intervals of regression coefficients (using $95\%$ confidence level). * compare the p-values and bootstrap confidence intervals of corresponding regression coefficients of the final linear regression model, make a recommendation on which inferential result to be reported, and justify.