In this note, we focus on the relationship between continuous numeric
variables. There are different types of relationships between two
numeric variables. The relationship we are interested in is the linear
relationship. Two specific topics to be covered in this note are
Correlation coefficient - determining if there is a relationship between these two variables.
Linear regression - Describing how the values of one variable change when the corresponding changes in the other variable.
A correlation exists between two numeric variables when one of them is related to the other in some ways. To visualize the relational pattern, we use a graphic tool - scatter plot (or scatter diagram), which is a graph of the paired (x, y) data with a horizontal x-axis and a vertical y-axis.
We now look at a few scatter plots that demonstrate different general relationships.
A relationship is linear when the points on a scatter plot follow a somewhat straight-line pattern.
Non-linear Relationships have an apparent
pattern, just not linear. The following two figures represent a
quadratic relationship between two numeric variables.
When two variables have no relationship, there
is no straight-line relationship or non-linear relationship. When one
variable changes, it does not influence the other
variable.
The above visual representations and examples demonstrate the various
relationship between two numeric variables. To quantify the
strength and direction of the
relationship between two variables, we use the linear
correlation coefficient that can be estimated from sample data using the
following formula \[
r =
\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{(y_i-\bar{y})^2}}
\] The structure of data used in estimating the correlation
coefficient is something like the following data which will be used in
the following Example 5.
We can think about Height(cm) and Weight(kg) to be \(X\) and \(Y\). The components in the formula of the correlation coefficient are important `sum of squares which are used in the parameters in the simple linear regression (with only one independent variable).
\[ SS_{xx} = \sum_{i=1}^n(x_i-\bar{x})^2, \ \ \ \ \ SS_{yy} = \sum_{i=1}^n(y_i-\bar{y})^2, \ \ \ \ and \ \ \ \ \ SS_{xy} = \sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}). \]
With the above notation, we can re-express the correlation coefficient as
\[ r = \frac{SS_{xy}}{\sqrt{SS_{xx}}\sqrt{SS_{yy}}} \]
The calculation of the sum of squares is not difficult but could be time-consuming.
Example 5. Let’s consider the relationship between height and weight. A sample data set is given in the above data table. Make a scatter plot and calculate the correlation coefficient.
Solution: We first make the scatter plot in the
following.
Using the above formula, we calculate the coefficient of correlation between weight and height and obtain \(r = 0.789\). You can use IntroStatsApps (https://chengpeng.shinyapps.io/correlation-reg/) to calculate the correlation coefficient on any other data. This data was used in the App as a default example.
The interpretation of the correlation coefficient is summarized in the following table.
Important Remarks
Correlation coefficient is defined to measure the strength of the linear correlation between two numeric variables. Therefore, the correlation coefficient should never be used to measure the non-linear relationship between two numeric variables.
In general, a linear correlation does not necessarily imply causation.
If we use notation corr(X, Y) to denote the correlation coefficient between X and Y, then \(cor(X, Y) = cor(Y, X)\).
The linear correlation coefficient provides us with the strength and the direction of the association between two numeric variables. However, it does not tell how the change of one variable is impacted by the change of the other variable. For example, in the following figure, if we increase x by one unit, the change of y in the left plot is less than the change in y in the right plot. However, the correlation coefficients of the two variables are the same.
The equation of the linear regression line is, in general, given by \[ y = b + m x \] It gives the explicit relationship between two variables. \(b\) is the intercept and \(m\) is the slope. Variable \(x\) is called a predictor or explanatory variable that explains the other variable \(y\) called the response or dependent variable.
If \(m > 0\), \(x\) and \(y\) are positively (linearly) correlated.
If \(m < 0\), \(x\) and \(y\) are negatively (linearly) correlated.
If \(m = 0\), \(x\) and \(y\) are NOT linearly correlated.
Note that both \(b\) and \(m\) are estimated from the. Once their estimated values are obtained, the estimated regression model is written in the following form \[ \hat{y} = \hat{b} + \hat{m} x \] where
\(\hat{y}\) = predicted (or fitted) value.
\(\hat{b}\) and \(\hat{m}\) are estimated intercept and slope.
The following figure shows the concepts given above.
Example 6. A hydrologist creates a model to predict the volume flow for a stream at a bridge crossing with a predictor variable of daily rainfall in inches.
\[ \hat{y} = 1.6 + 29x. \] The y-intercept \(b = 1.6\) can be interpreted this way: On a day with no rainfall, there will be 1.6 gal. of water/min. flowing in the stream at that bridge crossing.
The slope \(m = 29\) tells us that if it rained one inch that day the flow in the stream would increase by an additional 29 gal./min. If it rained 2 inches that day, the flow would increase by an additional 58 gal./min.
Prediction: What would be the average stream flow if it rained 0.45 inches that day?
\(\hat{y} = 1.6 + 29x = 1.6 + 29(0.45) =14.65\) gal./min.
The structure of the data set for the regression is the same as the one used in calculating the correlation coefficient (see the Weight and Height data set). In fact, we can use the sum of squares introduced above to estimate the regression coefficients in the following. \[ m = \frac{SS_{xy}}{SS_{xx}} \ \ \ \ and \ \ \ \ b = \bar{y} - m \bar{x} \]
With the above explicit expression of the regression coefficient, we can estimate the intercept and slope from given data sets.
Example 7. Determining If There Is a Relationship: Is there a relationship between the alcohol content and the number of calories in 12-ounce beer? To determine if there is one a random sample was taken of beer’s alcohol content and calories and the data is in the following table.
Solution: The objective of least square regression is to find the intercept \(b\) and the slope \(m\) to uniquely determine the regression line based on the data set and then use the fitted regression equation to answer the questions.
We use the following table to calculate the sum of squares that are used to estimate the regression coefficients.
Based on the sum of squares in the above table and the formulas for the regression coefficients, we have \[ \hat{m} = \frac{SS_{xy}}{SS_{xx}} = \frac{327.667}{12.45} \approx 26.3. \] \[ \hat{b} = \bar{y} - \hat{m}\bar{x} = 170.222 -26.3 \times 5.51667 \approx 25.0 \] Therefore, the estimated (also called fitted) regression line is given by \[ \hat{y} = 25 + 26.3 x. \] The above regression indicates that if we increase the alcohol content by 1 unit, the corresponding number of calories increases by 26.3 units. The above regression equation can also be used as a prediction model when a new
The coefficient of determination assesses the goodness of the regression line by measuring the amount of variation in the response captured by the regression model. To develop a formula to calculate the coefficient of determination, we need the following sum of squares of errors depicted in the following figure.
where
explained by the regression line
.Note that, Total Variation (SST) = Explained Variation (SSR) + Unexplained Variation (SSE)
Therefore, we have the following definition of the coefficient of determination \[ R^2 = \frac{Variation \ \ Explained}{Total \ \ Variation} = \frac{SSR}{SST} \]
Interpretation of \(R^2\): The percentage of total variation (in the response) by the regression.
Relationship between the correlation coefficient (\(r\)) and the coefficient of determination (\(R^2\)): \(R^2 = r^2\)
Two major applications of regression models are association analysis and predictive analysis. The inference will focus on these two applications. Although the following discussions are valid for more general regression models, we restrict our discussion to simple linear regression: \(y = b + mx\).
Association Analysis: The goal of association analysis is to assess the relationship between the two numeric variables through slope coefficient(s). As usual, confidence intervals and testing hypotheses are inferential tools for analyzing regression coefficients.
Two StatsApps were created for studying the linear relationship between two numeric variables.
This simulation demonstrates the correlation between \(x\) and \(y\) through simulated data sets. The app is at (https://chpeng.shinyapps.io/LSE-Reg/). The following is the screenshot of the simulator. You can click the arrows under the slider bar to automatically select different intercepts and slopes as well as the random y-values.
You can watch the animation in the video.
(https://github.com/pengdsci/MAT121/raw/main/notes/video/MAT121-corRegDemo.mp4)
This app analyzes user input data. You can click this link https://chengpeng.shinyapps.io/correlation-reg/ to use it. The following screenshot of the app.
When an anthropologist finds skeletal remains, they need to
figure out the height of the person. The height of a person (in cm) and
the length of their metacarpal bone (in cm) were collected and are in
the following table.
The World Bank collected data on the percentage of GDP that a
country spends on health expenditures (“Health expenditure,” 2013) and
also the percentage of women receiving prenatal care (“Pregnant woman
receiving,” 2013). The part of the data for the countries where this
information is available for the year 2011 is in the following table.
(1). Create a scatter plot of the data and find a regression equation between the percentage spent on health expenditure and the percentage of women receiving prenatal care.
(2). Use the regression equation to find the percent of women receiving prenatal care for a country that spends 5.0% of GDP on health expenditure and for a country that spends 12.0% of GDP.
(3). Which prenatal care percentage that you calculated do you think is closer to the true percentage? Why?
(1). Create a scatter plot and find a regression equation between the number of calories and the amount of sodium.
(2). Use the regression equation to find the amount of sodium a beef hotdog has if it is 170 calories and if it is 120 calories. Which sodium level that you calculated do you think is closer to the true sodium level? Why?