This week’s assignment has two components.
Finding a data set for project #1 that is due on Sunday, 10/1/2023. The detailed requirements of the data will be described in the next section.
Fitting a simple linear regression (SLR) by selecting a numerical explanatory variable from the data set using the least square approach and then constructing 95% bootstrap confidence intervals for the regression coefficients.
The selected data set will be used for multiple linear regression analysis and bootstrap analysis as well. In order to implement the commonly used regression techniques, the desired data must meet some requirements. You also have the flexibility to choose a data set that you are interested in so you can easily formulate the analytic questions and tell a better story from the analytic results.
The desired data set must have
The following websites contain many links to sites that have different types of data sets (some of the links may not be active).
10 open data sets for linear regression https://lionbridge.ai/datasets/10-open-datasets-for-linear-regression/
UFL Larry Winner’s Teaching Data Sets http://users.stat.ufl.edu/~winner/datasets.html
The suggested data repository for this class http://stat321.s3.amazonaws.com/w00-datasets.html
Datasets for Teaching (Univ. Sheffield, UK) https://www.sheffield.ac.uk/mash/statistics/datasets
Data.World https://data.world/datasets/regression
Before you start searching your data set, check the D2L discussion board and make sure you will not select the data set your classmates have already chosen for their project. After you identify your data set, please post your data set name and the link to that data set.
Please prepare an RMarkdown document to include the following two parts. Please start your work earlier to
Write an essay to describe the data set. The following information is expected to be included in this description.
Make a pair-wise scatter plot of all variables in your selected data set and choose an explanatory variable that is linearly correlated to the response variable.
Make a pairwise scatter plot and comment on the relationship between the response and explanatory variables.
Fit an ordinary least square regression (SLR) to capture the linear relationship between the two variables. If you transformed one of the variables to achieve the linear relationship, then use the transformed variable in the model. and then perform the model diagnostics. Comment on the residual plots and point out the violations to the model assumptions.
Using the bootstrap algorithm on the previous final linear regression model to estimate the bootstrap confidence intervals of regression coefficients (using \(95\%\) confidence level).
compare the p-values and bootstrap confidence intervals of corresponding regression coefficients of the final linear regression model, make a recommendation on which inferential result to be reported, and justify.