Surfstat.australia: an online text in introductory Statistics

STATISTICAL INFERENCE

MORE ON CORRELATION AND REGRESSION

Use of Correlation and Regression

Data for one group or sample consist of two continuous measurements on each subject (or object)

Data : (x1,y1), (x2,y2),..., (xn,yn).

Plot y against x.

Correlation and Correlation coefficient

Correlation measures the extent of linear association between x and y. The "correlation coefficient" is given by:

In calculating a correlation, x and y are treated interchangeably. Both measurements are assumed to be subject to random variation.

Example - 20 female students in STAT101

The data below are the heights (cm) and weights (Kg) of 20 female students taking STAT101
 ROW	fht	   fwt

   1	167	  60
   2	164	  65
   3	170  	  64
   4	163  	  47
   5	152  	  46
   6	160  	  57
   7	170  	  57
   8	160  	  55
   9	157  	  55
  10	170  	  65
  11	150  	  50
  12	156  	  46
  13 	168  	  60
  14	159  	  55
  15	160  	  50
  16 	172  	  69
  17	175  	  56
  18	169  	  56
  19	169  	  72
  20	156  	  56

gplot  c2  c1
 correl c2 c1

Correlation of fwt and fht = 0.673

For regression on the other hand, one variable y is regarded as an outcome of, or response to, the other variable x. The two variables are not interchangeable.

We find the equation for a straight line " y = a + b.x " which summarises the relationship between them. Here y is the outcome, response or dependent variable (output), while x is the predictor, explanatory or independent variable (input).

Often values of x are chosen by the investigator and are not random, whereas y values are measurements which are treated as random variables taken from distibutions that depend on x.

Example - Health expenditure growth in Australia 1994-5

The following table shows the rate of growth (%) of health expenditure per person in Australia at constant 1984-5 prices.

(Source: Aust Inst. of Health, "Health Expenditure", Info.Bull. No 6, May 1991)

Year 	      1983-4        84-5        85-6        86-7        87-8        88-9

Growth 4.6 2.7 4.1 2.3 1.8 2.0
Denote the years by x = 1 for 1983-4, x = 2 for 1984-5 and so on.
MTB> plot c4 c3

         -
 growth  -     *
         -
         -
      4.0+                         *
         -
         -
         -
         -
      3.0+
         -               *
         -
         -
         -                                   *
      2.0+                                                       *
         -                                             *
         -
           ----+---------+---------+---------+---------+---------+--year
             1.0       2.0       3.0       4.0       5.0       6.0

We fit a line = â - x using the method of least squares to calculate â and so that the sum of squares of the vertical distance is minimised.

These values are

(sxy and sx are defined above)

â = -

You can calculate â and and then predict the y value for any given x

= â - x

is called the fitted value.

Analysis of Variance

Analysis of variance is a technique for measuring the importance of a regression, and for estimating how much unexplained variation in Y is left after allowing for X.

The total (overall) variation among measured y values, ignoring the x's, is . The variation of y values from the line is less, and is given by . The regression line is precisely the line that makes this second quantity as small as possible. It can be shown that

Total Variation

Sometimes (e.g. by MINITAB) these values are shown in an Analysis of Variance (ANOVA) table:

Source of variation Sum of Squares (SS)
Regression (i - )2 <-- "explained variation"
Error (yi - i)2 <-- "unexplained variation"
Total (yi - )2

Coefficient of Determination

This is a measure of how well the line fits the data.

It can be shown that

coefficient of determination =

= r2 = square of the correlation coefficient.

So 100r2 is the percentage of variation "explained" by the regression line.

If the points all lie exactly on the line, i.e. if yi = i, then the "unexplained" (or error) variation is zero so the coefficient of determination is one.

If i = so the slope of the line is zero (i.e. no regression effect) then the coefficient of determination is zero.

In general: 0 coefficient of determination 1 .

Example on health expenditure continued

MTB> brief 3 
(note: specifies amount of output-see Minitab HELP or handbook) MTB> regress c4 1 c3
(note: this regresses the dependent (y) variable,
growth, on "1" predictor (x) variable year) The regression equation is growth= 4.67 - 0.500 year
(note: the relationship between growth and year.
On average, growth declined by 1/2% per year) Predictor Coef Stdev t-ratio p Constant 4.6667 0.7171 6.51 0.003 year -0.5000 0.1841 -2.72 0.053 s = 0.7703 R-sq = 64.8% R-sq(adj) = 56.0%
(note: s is the estimated standard deviation of the data
about the regression line. 64.8% of variation in growth
is explained by the fitted equation) Analysis of Variance SOURCE DF SS MS F p Regression 1 4.3750 4.3750 7.37 0.053 Error 4 2.3733 0.5933 Total 5 6.7483 The value predicted by the regress. eqn. \/ Obs. year growth Fit Stdev Fit* Residual St.Resid* 1 1.00 4.600 4.167 0.557 0.433 0.82 2 2.00 2.700 3.667 0.419 -0.967 -1.49 3 3.00 4.100 3.167 0.328 0.933 1.34 4 4.00 2.300 2.667 0.328 -0.367 -0.53 5 5.00 1.800 2.167 0.419 -0.367 -0.57 6 6.00 2.000 1.667 0.557 0.333 0.63 * Not treated in this course.


... Previous page Next page ...