A correlation is a relationship between two variables. The data can be represented by the ordered pairs $(x, y)$, where $x$ is the independent (or explanatory) variable and $y$ is the dependent (or response) variable.
The table shows the total assets (in billions of dollars) of individual retirement accounts (IRAs) and federal pension plans for nine years.
IRAs, $x$ | 2619 | 2533 | 2993 | 3299 | 3652 | 4207 | 4784 | 3585 | 4251 |
Federal pension plans, $y$ | 860 | 894 | 958 | 1023 | 1072 | 1141 | 1197 | 1221 | 1324 |
The total variation $\sum(y_i-\bar{y})^2$ is the sum of the squares of the vertical distances each
point is from the mean. The total variation can be divided into two parts: that which is
attributed to the relationship of $x$ and $y$ and that which is due to chance. The variation
obtained from the relationship (i.e., from the predicted $\hat{y_i}$ values) is $\sum(\hat{y_i}-\bar{y})^2$ and is
called the explained variation. Most of the variations can be explained by the relationship.
The closer the value $r$ is to -1 or 1, the better the points fit the line and the closer
$\sum(\hat{y_i}-\bar{y})^2$ is to $\sum(y_i-\bar{y})^2$. In fact, if all points fall on the regression line, $\sum(\hat{y_i}-\bar{y})^2$ will
equal $\sum(y_i-\bar{y})^2$, since $\bar{y}$ is equal to $y$ in each case.
On the other hand, the variation due to chance, found by $\sum(y_i-\hat{y_i})^2$, is called the
unexplained variation. This variation cannot be attributed to the relationship. When
the unexplained variation is small, the value of $r$ is close to -1 or 1. If all points fall on
the regression line, the unexplained variation $\sum(y_i-\hat{y_i})^2$ will be 0. Hence, the total variation
is equal to the sum of the explained variation and the unexplained variation. That is,
$\sum(y_i-\bar{y_i})^2 = \sum(\hat{y_i}-\bar{y})^2 + \sum(y_i-\hat{y_i})^2$
These values are shown in the figure above. For a single point, the differences are called deviations.
The values $y-\hat{y}$ are called residuals. A residual is the difference between
the actual value of y and the predicted value $\hat{y}$ for a given $x$ value. The mean of the residuals
is always zero. As stated previously, the regression line determined by the formulas
given in the textbook is the line that best fits the points of the scatter plot. The sum of the
squares of the residuals computed by using the regression line is the smallest possible
value. For this reason, a regression line is also called a least-squares line.
The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is $r^2$.
Coefficient of Nondetermination \[1-r^2\]
The standard error of the estimate, denoted by $s_e$, is the standard deviation of the observed $y_i$ values about the $\hat{y}$ predicted values. The formula for the standard error of the estimate is \[ s_e = \sqrt{\frac{\sum(y_i-\hat{y_i})^2}{n-2}} \] where $n$ is the number of pairs of data.
Given a linear regression equation $\hat{y}=mx+b$ and $x_0$, a specific value of $x$, a $c$-prediction interval for $y$ is \[ \hat{y}-E < y < \hat{y}+E \] where \[ E=t_c\cdot s_e\cdot\sqrt{1+\frac{1}{n}+\frac{n(x_0-\bar{x})^2}{n\cdot\sum x^2-(\sum x)^2}} \] and the degrees of freedom used to find $t_c$ are $df=n-2$, with $n$ being equal to the number of pairs of data in the sample. The point estimate is $\hat{y}$ and the margin of error is $E$. The probability that the prediction interval contains $y$ is $c$ (the level of confidence), assuming that the estimation process is repeated a large number of times.