Surfstat.australia: an online text in introductory Statistics

SUMMARISING AND PRESENTING DATA

PRESENTING DATA FOR TWO CONTINUOUS MEASUREMENTS

Correlation

Correlation can be used to summarise the amount of linear association between two continuous variables x and y.

Let (x1, y1), (x2, y2), ..., (xn, yn) denote the data points.
A scatter plot gives a "cloud" of points:

If there is a strong linear association between the two variables, then the points lie nearly in a straight line, like this:

Positive and Negative Association

A positive association between the x and y variables (i.e. an increase in x is accompanied by an increase in y) is shown by the scatterplot having a positive slope. Similarly, a strong negative association (i.e. an increase in x is accompanied by a decrease in y) is shown by points with a negative slope.

If the points are nearly in a straight line then knowing the value of one variable helps you to predict the value of the other.

If there is little or no association, the "cloud" is more spread out and information about one variable doesn't tell you much about the other.

Linear Versus Nonlinear Association

Warning: Correlation measures the existence of a linear association. Variables can be strongly associated in a nonlinear (curved) way yet still have zero correlation. If two variables are independent (have no association of any sort) then their correlation is zero, but having zero correlation does not necessarily mean that they are independent, only that they have no linear relationship.

Correlation Coefficient

Compare these two shapes of data:

These "clouds" have the same values for the centre, defined by (,) and the same standard deviations and for the marginal distributions of x and y.

But (A) is tightly clustered and (B) is loosely clustered.

The degree of clustering, i.e. the strength of linear association, is summarised by the correlation coefficient, defined as

If points in the positive quadrants are dominant (ie those for which ) then

so r>0 and there is positive association between x and y, i.e. they increase together. If instead points in the negative quadrants predominate (i.e. those with ) then r<0 and there is negative association, i.e. y tends to decrease as x increases

Example - Calculation of r for students' heights and weights

It can be shown that

Similarly

These versions are often easier to use because the intermediate calculations and are not needed.

Progress check


... Previous page Next page ...