Suppose you take a random sample of n individuals. Let the random variable X be the number of successes in the sample. A reasonable model would be to assume that X has the binomial distribution, X ~ binomial (n, p) where usually n, the sample size, is known but p is not.
The proportion of successes in the sample,
, provides an estimate for
p (the population proportion). Write this as
(where
denotes "an estimate of p").
If X ~ binomial(n,p) then E(X)=np and var(X)=npq where q=1-p.
Since
is a function of X,
and n is known, we have
so the standard error of
is
SE(
) describes the
variability of estimates
derived from different samples from the same population.
If p and q are estimated by the sample values
and
= 1 -
, then approximately
![]() |
Hence the sampling distribution of
is approximately Normal:
N(p, pq/n),
and Z =
N(0,1). This gives us a basis for
making inferences about p: constructing confidence intervals and
performing hypothesis tests.
To find a 95% CI for p, we first need to find c such that Pr (-c < Z < c) = 0.95. Using Table 1, c = 1.96. This can be written as:
P(- 1.96 < Z < 1.96) = 0.95 where
Z =
.
so Pr(-1.96 <
< 1.96) = 0.95.
This can be rearranged to obtain the 95% confidence limits for p.
A 95% CI for p, the true population proportion, is
±1.96 SE(
)
where
is
the observed sample proportion X/n.
Example - Confidence intervalTo predict the outcome of a referendum an opinion poll is conducted. A random sample of 552 people on the electoral roll are asked "will you vote for ....?" Their answers are recorded as either "yes" or "other" ("other" includes "no", "don't know", etc). |
If 239 people say "yes" find a 95% confidence interval for the true population proportion of "yes" votes.
The proportion of 'yes's in the sample is
= 239 / 552 = 0.433
so
= 1 - 0.433 = 0.567
and the estimated standard error is
SE(
)
=
= 0.021
So a 95% confidence interval for the population proportion p is given by
± 1.96
= 0.433 ± 1.96 × 0.021, i.e. (0.39, 0.47).
Thus with 95% confidence, the true proportion of "yes" votes lies between 39% and 47%.
Example - Sample sizeIn most Australian national elections, about 50% of votes are cast for the Liberal/National Coalition and 50% for the Australian Labour Party. How large a sample do you need to estimate the vote to within ±3% with 95% confidence? |
Let p be the proportion of Liberal/National votes.
The interval
± 1.96
SE (
) will contain p
with 95% confidence
If we take the term ± 1.96 SE (
) as the ± 3% (or 0.03) error, we need
1.96 SE (
) = .03
so SE (
) = 0.03/1.96 =
0.0153
Using the estimate that approximately 50% of votes are for
Liberal/National Party, then
= 0.5 and so
= 0.5.
SE(
) =
=
Hence,
= 0.0153
and
= 32.67.
which gives n = (32.67)2 = 1067. This is the reason why most opinion polls use samples of about 1000 subjects.
Example - Hypothesis TestA company claims to have 40% of the market for some product. You conduct a survey and find 38 out of 112 buyers (i.e. 34%) purchased this brand. Are these data consistent with the company's claim or is your survey result of 34% significantly different to the company's claim of 40%? |
Let X be the number of buyers who choose this brand.
Assume X ~ Binomial (n,p) with n = 112 since the buyer either purchases this brand or not (binomial).
Hypothesis: p = 0.4 (i.e. true population proportion is 40%).
If this is true then X ~ Binomial (112, 0.4) and so E(X) = np = 112 × 0.4 = 44.8
Now to calculate the p-value for the test, we must calculate the probability of obtaining the observed value of 38, or an even more extreme value, in either direction from the expected value of 44.8.
Values of X that are either less than or equal to 38, or greater than or equal to 51.6 ( = 44.8 + 6.8), are as extreme or more extreme than the observed value, so
p-value = P(X
38 or
X
51.6)
This could be calculated using the exact binomial formula to obtain the result 0.2102.
Alternatively, since n is large, we might use the Normal approximation to the Binomial distribution
X approx ~ N(np, npq)
np = 112 × .4 = 44.8
npq = 112 × .4 × .6 = 26.88
i.e. X ~ approx N(44.8, 26.88)
p-value = P(X
38 or
X
51.6) or using the
continuity correction,
51.5)
or Z
)
- 1.215 or Z
1.292)
This p-value is greater than 0.05, so if the conventional significance level of 0.05 is chosen the result is "not statistically significant". The p-value is not small so we do not reject H0 but instead conclude that the data are consistent with the claim that the true population proportion of buyers purchasing this brand is 40%. The sample proportion differed from this, but not by a statistically significant amount.
Example - City/country market surveyIn a market survey it is found that 51 of 198 (26%) of people in cities used brand 'A' of a product and 26 of 145 (18%) of country people use it. Can we conclude that brand 'A' really is more popular in cities or could this difference have occurred by chance, i.e. due to sampling variability? |
In general we may wish to compare the population proportions based on data from samples
| Group 1 | Group 2 |
| population proportions | p1 | p2 |
| sample proportions | 1 = ![]() | 2 = ![]() |
| standard errors | ![]() |
![]() |
We compare p1 and p2 by making inferences
about their difference (p1 - p2).
This is estimated by
To obtain the sampling distribution for
we use the formulas for expected values and variances of
functions of random variables.
=
-
so E
=
E(X1) -
E(X2)
=
= p1 -
p2
If p1, q1, p2 and q2 are replaced by sample estimates then
Also provided both samples are large (say n1 > 20 and
n2 > 20) and neither p1 nor p2 is
very near 0 or 1, then the sampling distribution of
is approximately Normal.
So
Hence a 95% confidence interval for (p1 - p2)is given by
± 1.96
For the data in the example above
1 = 51/198
0.2576,
2 = 26/145
0.1793
so a 95% confidence interval is
(0.2576 - 0.1793) ± 1.96
i.e.
(- 0.01, 0.17)
Thus the data suggest that, with 95% confidence, the true population difference in proportions who favour brand 'A' between cities and country areas is between -1% and 17%.
In particular, since this interval includes zero, the data are consistent with there being no difference in preferences (they are also consistent with differences of 10% or 17% or - 1%, etc). The null hypothesis of "no difference" cannot be rejected and we cannot conclude that there is any real difference in preferences.
| ... Previous page | Next page ... |