Correlation can be used to summarise the amount of linear association between two continuous variables x and y.
Let (x1, y1), (x2, y2),
..., (xn, yn) denote the data points.
A scatter plot gives a "cloud" of points:
If there is a strong linear association between the two variables, then the points lie nearly in a straight line, like this:
A positive association between the x and y variables (i.e. an increase in x is accompanied by an increase in y) is shown by the scatterplot having a positive slope. Similarly, a strong negative association (i.e. an increase in x is accompanied by a decrease in y) is shown by points with a negative slope.
If the points are nearly in a straight line then knowing the value of one variable helps you to predict the value of the other.
If there is little or no association, the "cloud" is more spread out and information about one variable doesn't tell you much about the other.
Warning: Correlation measures the existence of a linear association. Variables can be strongly associated in a nonlinear (curved) way yet still have zero correlation. If two variables are independent (have no association of any sort) then their correlation is zero, but having zero correlation does not necessarily mean that they are independent, only that they have no linear relationship.
Compare these two shapes of data:
These "clouds" have the same values for the centre, defined by (
,
) and the same standard deviations
and
for the marginal distributions of x and y.
But (A) is tightly clustered and (B) is loosely clustered.
The degree of clustering, i.e. the strength of linear association, is summarised by the correlation coefficient, defined as
If points in the positive quadrants are dominant (ie those for which
) then
so r>0 and there is positive association between x and y, i.e. they
increase together. If instead points in the negative quadrants
predominate (i.e. those with
)
then r<0 and there is negative association, i.e. y tends to
decrease as x increases
It can be shown that
Similarly
These versions are often easier to use because the intermediate
calculations
and
are not needed.
Progress check |
| ... Previous page | Next page ... |