Surfstat.australia: an online text in introductory Statistics

STATISTICAL INFERENCE

INFERENCE FOR COUNT DATA

Sampling Distribution of a Sample Proportion

Suppose you are interested in a population in which every individual can be classified into one of two categories, e.g.
manufactured items: defective, OK
experimental animals: dead, alive
exam results: pass, fail
In general, we call these "success" and "failure". We want to arrive at conclusions about p, the proportion of successes in the population, using information from a sample.

Suppose you take a random sample of n individuals. Let the random variable X be the number of successes in the sample. A reasonable model would be to assume that X has the binomial distribution,   X ~ binomial (n, p)   where usually n, the sample size, is known but p is not.

The proportion of successes in the sample, , provides an estimate for p (the population proportion). Write this as (where denotes "an estimate of p").

If X ~ binomial(n,p) then E(X)=np and var(X)=npq where q=1-p.

Since is a function of X, and n is known, we have

so the standard error of is

SE() describes the variability of estimates derived from different samples from the same population.

If p and q are estimated by the sample values and = 1 - , then approximately

The binomial distribution of X can be approximated by the Normal distribution N(np, npq), provided n>20, np>5 and nq>5.

Hence the sampling distribution of is approximately Normal: N(p, pq/n),
and Z = N(0,1). This gives us a basis for making inferences about p: constructing confidence intervals and performing hypothesis tests.

Confidence Intervals and tests for Proportions

To find a 95% CI for p, we first need to find c such that Pr (-c < Z < c) = 0.95. Using Table 1, c = 1.96. This can be written as:

P(- 1.96 < Z < 1.96) = 0.95 where
Z = .

so Pr(-1.96 < < 1.96) = 0.95.

This can be rearranged to obtain the 95% confidence limits for p. A 95% CI for p, the true population proportion, is ±1.96 SE()

where is the observed sample proportion X/n.

Example - Confidence interval

To predict the outcome of a referendum an opinion poll is conducted. A random sample of 552 people on the electoral roll are asked "will you vote for ....?" Their answers are recorded as either "yes" or "other" ("other" includes "no", "don't know", etc).

If 239 people say "yes" find a 95% confidence interval for the true population proportion of "yes" votes.

The proportion of 'yes's in the sample is

= 239 / 552 = 0.433

so = 1 - 0.433 = 0.567 and the estimated standard error is

SE() = = 0.021

So a 95% confidence interval for the population proportion p is given by

± 1.96 = 0.433 ± 1.96 × 0.021, i.e. (0.39, 0.47).

Thus with 95% confidence, the true proportion of "yes" votes lies between 39% and 47%.

Example - Sample size

In most Australian national elections, about 50% of votes are cast for the Liberal/National Coalition and 50% for the Australian Labour Party. How large a sample do you need to estimate the vote to within ±3% with 95% confidence?

Let p be the proportion of Liberal/National votes.

The interval ± 1.96 SE () will contain p with 95% confidence

If we take the term ± 1.96 SE () as the ± 3% (or 0.03) error, we need

1.96 SE () = .03

so SE () = 0.03/1.96 = 0.0153

Using the estimate that approximately 50% of votes are for Liberal/National Party, then = 0.5 and so = 0.5.

SE() = =

Hence, = 0.0153

and = 32.67.

which gives n = (32.67)2 = 1067. This is the reason why most opinion polls use samples of about 1000 subjects.

Example - Hypothesis Test

A company claims to have 40% of the market for some product. You conduct a survey and find 38 out of 112 buyers (i.e. 34%) purchased this brand. Are these data consistent with the company's claim or is your survey result of 34% significantly different to the company's claim of 40%?

Let X be the number of buyers who choose this brand.

Assume X ~ Binomial (n,p) with n = 112 since the buyer either purchases this brand or not (binomial).

Hypothesis: p = 0.4 (i.e. true population proportion is 40%).

If this is true then   X ~ Binomial (112, 0.4) and so   E(X) = np = 112 × 0.4 = 44.8

Now to calculate the p-value for the test, we must calculate the probability of obtaining the observed value of 38, or an even more extreme value, in either direction from the expected value of 44.8.

Values of X that are either less than or equal to 38, or greater than or equal to 51.6 ( = 44.8 + 6.8), are as extreme or more extreme than the observed value, so

p-value = P(X 38 or X 51.6)

= P(X=0)+P(X=1)+...+P(X=38)+P(X=52)+...+P(X=112),

This could be calculated using the exact binomial formula to obtain the result 0.2102.

Alternatively, since n is large, we might use the Normal approximation to the Binomial distribution

X approx ~ N(np, npq)

np = 112 × .4 = 44.8

npq = 112 × .4 × .6 = 26.88

i.e. X ~ approx N(44.8, 26.88)

p-value = P(X 38 or X 51.6) or using the continuity correction,

= P(X < 38.5 or X 51.5)
= P(Z or Z )
= P(Z - 1.215 or Z 1.292)
= .1122 + (1-.9015)
= .210

This p-value is greater than 0.05, so if the conventional significance level of 0.05 is chosen the result is "not statistically significant". The p-value is not small so we do not reject H0 but instead conclude that the data are consistent with the claim that the true population proportion of buyers purchasing this brand is 40%. The sample proportion differed from this, but not by a statistically significant amount.

Confidence Interval for the Difference Between Two Proportions

Example - City/country market survey

In a market survey it is found that 51 of 198 (26%) of people in cities used brand 'A' of a product and 26 of 145 (18%) of country people use it. Can we conclude that brand 'A' really is more popular in cities or could this difference have occurred by chance, i.e. due to sampling variability?

In general we may wish to compare the population proportions based on data from samples
Group 1Group 2


population proportionsp1p2
sample proportions1 = 2 =
standard errors

We compare p1 and p2 by making inferences about their difference (p1 - p2). This is estimated by

To obtain the sampling distribution for we use the formulas for expected values and variances of functions of random variables.

= -

so E = E(X1) - E(X2)

= = p1 - p2

If p1, q1, p2 and q2 are replaced by sample estimates then

Also provided both samples are large (say n1 > 20 and n2 > 20) and neither p1 nor p2 is very near 0 or 1, then the sampling distribution of is approximately Normal.

So

Hence a 95% confidence interval for (p1 - p2)is given by

± 1.96

For the data in the example above

1 = 51/198 0.2576, 2 = 26/145 0.1793

so a 95% confidence interval is

(0.2576 - 0.1793) ± 1.96

i.e. (- 0.01, 0.17)

Thus the data suggest that, with 95% confidence, the true population difference in proportions who favour brand 'A' between cities and country areas is between -1% and 17%.

In particular, since this interval includes zero, the data are consistent with there being no difference in preferences (they are also consistent with differences of 10% or 17% or - 1%, etc). The null hypothesis of "no difference" cannot be rejected and we cannot conclude that there is any real difference in preferences.


... Previous page Next page ...