Surfstat.australia: an online text in introductory Statistics

STATISTICAL INFERENCE

INFERENCE FOR COUNT DATA

Inference for Contingency Tables

Example - Vitamin C on colds

Does vitamin C reduce the incidence of colds? To test this suppose we asked 100 people two questions: whether they take vitamin C tablets, and whether they had a cold last year.

The results were:

Vitamin C
Had a cold? Yes No Total
No 35 5 40
Yes 35 25 60
70 30 100

Do these data suggest that vitamin C prevents colds?

Calculate the expected frequencies by assuming that Vitamin C is not related to colds, (ie. no relationship between the two variables).

If 40% of the population had no colds and 60% had a cold (estimated from the last column of the table) then of the 70 who took vitamin C you'd expect

70 × = 28 to have no colds

and 70 × = 42 to have had a cold

Similarly among the 30 who did not take vitamin C you'd expect

= 12 to have no colds, and

= 18 to have had a cold.

You could compare these expected frequencies with the observed frequencies

Vitamin C
Had a cold? Yes No Total
No 38(28) 5(12) 40
Yes 35(42) 25(18) 60
70 30 100

= 9.72

The procedure used for this example was:

  1. Take as the null hypothesis that taking vitamin C and having colds are statistically independent events
    H0 : vitamin C and colds are independent
  2. Estimate the probability pv of taking vitamin C (and hence qv = 1 - pv) and the probability pc of having no colds (and hence qc = 1 - pc).
  3. Use H0 to calculate probabilities for each cell in the table, e.g. for vitamin C and no cold, prob = pv × pc because independent

    = × = 0.28
  4. Calculate expected frequencies by multiplying these probabilities by total frequency n = 100,

e.g. for vitamin C and no cold, e = 100 × 0.28 = 28

For this example pv and pc were estimated from the data (but qv and qc were calculated from these) so

degrees of freedom = # categories - # parameters estimated - 1

= 4 - 2 - 1 = 1.

Hence the p-value is P( c2 1 > 9.72).

From the first row of Table 3 P( c21 > 9.72) is between 0.0025 and 0.001 so the p-value is less than 0.0025 (p<.0025).

Observed frequencies are significantly different from expected frequencies, if the expected frequencies are based on the hypothesis that vitamin C and colds are independent. So we reject this hypothesis and conclude there is some association.

General case:

Two categorical variables cross classified.

Estimate probabilities for A1, .... , Ar by , ... ,

(this involves estimating (r-1) independent parameters since their total must be 1 because they are probabilities).

Similarly estimate probabilities for B1 ,..., Bc by

(this involves (c-1) independent estimates).

Null hypothesis H0: A and B are independent

Alternative hypothesis H1 : A and B are not independent

If H0 is true, probability for cell Ai Bj is

pij = for i = 1, .... , r ; j = 1, .... , c

So the expected frequency for cell Ai Bj is n × pij , i.e.

eij = n × =

=

Calculate all the expected frequencies using

Then calculate

Degrees of freedom = # cells - # estimates - 1

= (r×c) - [ (r-1) + (c-1) ] - 1

= rc - r - c + 1 = (r - 1)(c - 1).

Degrees of freedom = (r-1)(c-1)

Example - An homogeneous group of people

Another study is conducted using a homogeneous group of people, estimating the average amount of vitamin C they get from all sources (fruit, etc.) and then following them for a year and seeing how many colds they get.

Suppose the results were

Vitamin C (amount in gms)
Colds 0 g < 1 1 g < 2 g 2 Total
none 64 62 24 150
at least 1 36 6 8 50
100 68 32 200

Do these data suggest that vitamin C protects against colds?

Null hypothesis H0: colds and vitamin C are independent.

Assuming H0 is correct, the expected frequencies are calculated as follows: e.g. for no colds and 0 g < 1.

e11 = = 75

and similarly for all other cells.

Expected frequencies are often shown in the table (e.g. in brackets)

Vitamin C (amount in gms)
Colds 0 g < 1 1 g < 2 g 2 Total
none 64 (75) 62 (51) 24 (24) 150
1 36 (25) 6 (17) 8 (8) 50
100 68 32 200

Notice the row and column totals are the same for observed and expected frequencies (this is a good check on your arithmetic)

degrees of freedom = (3 - 1) × (2 - 1) = 2
p-value = P( c2 > 15.9)

From Table 3, second row, this is less than 0.0005. Since the p-value is small, you would reject the null hypothesis of independence and conclude there is some association between amount of vitamin C and incidence of colds.

Notice, however, that the pattern of differences between observed and expected frequencies is not consistent with a dose-response relationship.

MINITAB commands for chi-squared analysis

CHISQUARED C1 C2 C3

where frequencies are stored in columns C1, C2 and C3.

MTB > chis c1 c2 c3

Expected counts are printed below observed counts

0<=g<1 1<=g<2 g>=2 Total 1 64 62 24 150 75.00 51.00 24.00

2 36 6 8 50 25.00 17.00 8.00

Total 100 68 32 200

ChiSq = 1.613 + 2.373 + 0.000 + 4.840 + 7.118 + 0.000 = 15.944 df = 2

If the data are stored as category codes you can create frequencies in a cross-classified table using, e.g.

TABLE C1 BY C2

To do a chi-squared analysis for this table use the subcommand

CHISQUARE 2.

This says print observed and expected frequencies for each cell as well as doing the chi-squared calculation.


... Previous page Next page ...