Surfstat.australia: an online text in introductory Statistics

STATISTICAL INFERENCE

MORE ON CORRELATION AND REGRESSION

Statistical Inference for Linear Regression

We assume for any given x that the random variable Y has mean (expected value) on the population regression line E(Y) = a + bx, with observed values of Y being normally distributed about that line.

Assume ei ~ N(0, s 2) with s 2 the same for all ei's. Assume the ei's are independent.

The error terms

ei = Yi - a - bxi

are estimated by

i = yi - â - x

where yi and xi are the data values and â and are the least squares estimates.

The i's are called the residuals. It can be shown that the mean of the residuals is zero,

The following two results are needed in order to make statistical inferences from regression lines.

  1. The estimate of s 2 is s2, the average of terms

    the denominator must be (n-2) to give an unbiased estimator.

  2. The slope

    so if s 2 is estimated by s2, then the t-statistic

    can be used to make inferences about the slope b of the regression line. The inferences most commonly required are (i) tests of particular values of the parameters, in particular whether the slope could be zero, and (ii) confidence intervals for the parameters, especially the slope parameter b.

Example - Statistical model for predicting delivery time

A statistician working for a car manufacturer wishes to develop a statistical model for predicting delivery time (the number of days between ordering a car and actual delivery of the car) of a particular model for which there is a range of factory-fitted options. A random sample of 16 cars was selected giving the following results
car#optionsdelivery time
1325
2432
3426
4738
5734
6841
7939
81146
91244
101251
111453
121658
131761
142064
152366
162570
  1. Find the least squares regression equation relating delivery time (y) to number of options (x) and test whether the regression is significant.
  2. Find a 95% confidence interval for the rate at which delivery time increases with the number of options.
  3. Estimate the average delivery time for all cars ordered with 18 options.
MTB> name c1 'options' c2 'delivery'
MTB> plot c2 c1

MTB> brief 3
MTB> regress c2 1 c1;
SUBC> predict 18;
SUBC> predict 10;
SUBC> residuals c3.

The regression equation is
delivery = 21.9 + 2.07 options

Predictor       Coef       Stdev    t-ratio        p
Constant      21.925       1.591      13.78    0.000
options       2.0687      0.1164      17.77    0.000

s = 3.045       R-sq = 95.8%     R-sq(adj) = 95.5%

Analysis of Variance

SOURCE       DF          SS          MS         F        p
Regression    1      2927.2      2927.2    315.80    0.000
Error        14       129.8         9.3
Total        15      3057.0

Obs. options  delivery       Fit Stdev.Fit  Residual   St.Resid
  1      3.0    25.000    28.132     1.295    -3.132     -1.14
  2      4.0    32.000    30.200     1.203     1.800     0.64 
  3      4.0    26.000    30.200     1.203    -4.200     -1.50
  4      7.0    38.000    36.406     0.958     1.594     0.55 
  5      7.0    34.000    36.406     0.958    -2.406     -0.83
  6      8.0    41.000    38.475     0.892     2.525     0.87 
  7      9.0    39.000    40.544     0.837    -1.544     -0.53
  8     11.0    46.000    44.681     0.770     1.319     0.45 
  9     12.0    44.000    46.750     0.761    -2.750     -0.93
 10     12.0    51.000    46.750     0.761     4.250     1.44 
 11     14.0    53.000    50.887     0.796     2.113     0.72 
 12     16.0    58.000    55.025     0.892     2.975     1.02 
 13     17.0    61.000    57.094     0.958     3.906     1.35 
 14     20.0    64.000    63.300     1.203     0.700     0.25 
 15     23.0    66.000    69.506     1.490    -3.506     -1.32
 16     25.0    70.000    73.643     1.694    -3.643     -1.44

     Fit  Stdev.Fit         95% C.I.         95% P.I.
  59.162      1.033   ( 56.946, 61.379)  ( 52.265, 66.060)

  42.613      0.796   ( 40.905, 44.320)  ( 35.861, 49.364)
PLOT 'DELIVERY' 'OPTIONS' <- scatter plot
BRIEF 3                   <- print full output
REGRESS 'DELIVERY' 1 'OPTIONS'  <- regress the response variable DELIVERY
                                  (y) against one predictor OPTIONS (x)
PREDICT 18: - predict DELIVERY for various values of OPTIONS
PREDICT 10: - predict DELIVERY for various values of OPTIONS
RESIDUALS C3.   - Store residuals
i = yi - i C3

(i) Regression equation is 
          delivery  =   21.9 +  2.07   options
	    ^             ^       ^         ^
            y         intercept  slope      x
                          â       
Interpretation - delivery time increases by about 2 days for each additional option ordered

- For the first data point x1 = 3, y1 = 25

fitted 1 = 21.925 + 2.0687 × 3 = 28.13

Residual 1 = 25 - 28.13 = -3.13

- To test the hypothesis b = 0 (i.e. delivery time is not related to number of options) use

p-value = P(t14 < - 17.77 or t14> 17.77)

< 0.0001 (see t-tables).

As the p-value is very small we would reject the null hypothesis that b=0, and conclude that delivery time was related to the number of options ordered.

(ii) To obtain a 95% confidence interval (C.I.) for b use ± tn-2 × st.dev () where n=16 so tn-2 = t14 i.e. 2.0687 ± 2.145 × 0.1164 which gives (1.82, 2.32)

This is the 95% C.I. for the rate at which delivery time increases with the number of options.

- To interpret the Analysis of Variance table

source              DF              SS              MS   

regression 1 - - error 14 - -
total 15
MS = mean square =

Error MS = = s2 (see result 1 above)

From the printout s = 3.045 so s2 = 9.272 = error MS.

Coefficient of determination R2 =

so 95.8% of variation in delivery times is 'explained' by the number of options ordered.

R2 (adj) gives an unbiased estimate of the square of the population correlation coefficient when more than one predictor is used - it adjusts for the number of parameters estimated (via the d.f.'s)

(iii) Predictions, e.g. for x = 18 options the predicted (fitted) value for delivery time is

= â + x

= 21.925 + 2.0687 × 18

= 59.162 days


... Previous page Next page ...