Surfstat.australia: an online text in introductory Statistics

STATISTICAL INFERENCE

ONE CONTINUOUS VARIABLE

Sampling Distribution of the Sample Mean

Suppose we have a sample of n observations of a continuous random variable X (e.g. height, cost, temperature)

Let µ = E(X) be the population mean, and s = be the population standard deviation of X.

Usually both µ and s are unknown, and we want primarily to estimate µ.

From the sample we can calculate

= sample mean
s = sample standard deviation

The sample mean is an estimate of µ, but how accurate is it?

We know the approximate sampling distribution of the random variable from the Central Limit Theorem. Suppose many different samples of the same size are obtained by repeatedly sampling from a population

- for each sample is calculated and
- a histogram of the values is drawn

The shape of the histogram depends on the size n of the sample, and approximates to the sampling distribution.

Properties of the Sampling Distribution

  1. Sampling Distributions of have the same expected value, regardless of sample size n, equal to µ.
  2. Even if the distribution of the X values was not Normal, as n increases the distribution for becomes more like the Normal distribution - this is the Central Limit Theorem
  3. As n increases, the distribution of becomes narrower - that is, the 's cluster more tightly around µ. In fact the variance is inversely proportional to n:
  4. The square root of this variance, , is called the "standard error" of : SE() = .
  5. The sampling distribution of is (approx)

    This follows from the previous definition of variance of combinations of random variables, since

Confidence Limits for the Population Mean when s is known

To calculate probabilities for use

e.g. from table for Z ~ N(0, 1),

P (Z<1) = .8413

so P(- 1 < Z < 1) = .8413 - (1- .8413) = .6826.

So with probability 0.6826, - 1 < Z < 1 , Equivalently, with probability .6826, .

We want to rearrange these inequalities so that µ alone is in the middle. If you re-arrange a statement that has a certain probability of being true, the new equation still has that same probability.

First, multiply each term by to obtain

Multiply each term by - 1 (this means inequality signs should also be reversed)

Add to each term

Rearrange this to obtain

This tells us that 68.3% of all samples will contain the population mean µ between the random interval

L is called the lower 68% confidence limit for µ and U is called the upper 68% confidence limit for µ

More usually 90% or 95% or 99% confidence limits are used.

E.g. For tables for N(0,1)

P(Z < 1.96) = 0.975

Hence P(- 1.96 < Z < 1.96) = 0.95

This can be rearranged as before to obtain 95% confidence limits for µ.

Similarly, 90% confidence limits for µ are ± 1.645 and 99% cofidence limits for µ are ± 2.575 .

The probability that µ lies in the random interval ± 1.96 is 0.95, where is the random variable from which each sample mean is an observation.

If we observe a sample mean, , then the interval ± 1.96 is the 95% C.I. for µ. However, the interval is the observed interval (is not random) and hence it either does or does not contain µ. Thus, the statement that ± 1.96 contains µ with probability 0.95 is incorrect, because there is no random variable in this statement to which the probability can refer.

Instead, we say that the interval ± 1.96 contains plausible values for µ that are consistent with the sample data, or that we are 95% confident that µ lies in the interval ± 1.96 . 95% confident does not refer to the probability for the observed interval, but to the notion of repeated sampling.

In practice, s is usuallly unknown, but it can be estimated by the sample standard deviation, s. If this is done, the t-distribution must be used instead of the standard normal.

Health survey example continued:

Sample of n = 100 women aged 25-29 years

sample mean = 165 cms
sample standard deviation s = 5 cms For now, we will pretend that the population SD is known to be exactly 5. A more accurate method must take account of the fact that the value 5 is only an estimate. This method is based on the t-distribution instead of the normal distribution. The difference is small when the sample is large, as here.

95% confidence limits for population mean, µ are

165 + 1.96 × = 165.98 166
165 - 1.96 × = 164.02 164

i.e. 95% confidence interval for µ is (164, 166)cms

Hence, plausible values for µ are 164-166 cms, or with 95% confidence the true study population mean height of women aged 25-29 years lies between 164 and 166 cms.


... Previous page Next page ...