Surfstat.australia: an online text in introductory Statistics

# STATISTICAL INFERENCE

## POPULATIONS, SAMPLES, ESTIMATES AND REPEATED SAMPLING

Statistical inference is the use of probability theory to make inferences about a population from sample data.

Suppose we want to estimate the characteristics of a population such as the average weight of all 30 year old women in Australia, or the percentage of voters in N.S.W. who think the Government is doing a good job to control inflation.

In practice we cannot obtain data from every member of the population. Instead, we obtain data from a sample and use the results to make inferences about the population.

### Definitions

Population - collection of all subjects or objects of interest (not necessarily people)

Sample - subset of the population used to make inferences about the characteristics of the population

Population parameter - numerical characteristic of a population, a fixed and usually unknown quantity.

e.g. average weight, µ, of all 30 year old women in Australia,

e.g. % of voters, p, in N.S.W who think the Government is doing a good job to control inflation.

Data - values measured or recorded on the sample.

Sample statistic - numerical characteristic of the sample data such as the mean, proportion or variance. It can be used to provide estimates of the corresponding population parameters.

e.g. % of voters in an opinion poll in Sydney who think the Government is doing a good job to control inflation.

Different samples give different values for sample statistics. By taking many different samples and calculating a sample statistic for each sample (e.g. the sample mean), you could then draw a histogram of all the sample means. A statistic from a sample or randomised experiment can be regarded as a random variable and the histogram is an approximation to its probability distribution. The term sampling distribution is used to describe this distribution, i.e. how the statistic (regarded as a random variable) varies if random samples are repeatedly taken from the population.

### Bias

If the sampling distribution is known then the ability of the sample statistic to estimate the corresponding population parameter can be determined.

In particular, the sampling distribution determines the expected value and variance of the sampling statistic. If the expected value of the statistic is equal to the population parameter, the estimator is unbiased. If the variance of the statistic is 'small' and it is also unbiased then an observed statistic is likely to be close to the population parameter.

Bias = distance between parameter and expected value of sample statistics

Subsequently, sample statistics can be classified as shown in the following diagrams.

1. Estimates have low bias because their average is near the population parameter, but have high variability because they are widely spread and a single sample value could be far from the parameter.

2. Estimates have bias because the expected value is not equal to the parameter.

They also have high variability because they are widely spread out.

3. In this case the estimates are biased because all of them are systematically higher than the population parameter

The sample statistics have, however, low variability because they are all close together.

In this case the estimates have both low bias and low variability. Experimental design aims to simultaneously reduce bias and variability by producing a sampling distribution as shown in 4.

In general

sample statistic = population parameter + bias + chance variation

Inferences about the characteristics of a population are based on data from a sample.

• If the sample is not representative of the population being studied, the sample statistic may be biased so you cannot use it to make valid inferences about the population parameter
• To minimise bias the sample should be chosen by random sampling from a list of all individuals in the relevant population. This list is called the sampling frame. It is essential.
• For a simple random sample the individuals are chosen in such a way that each individual in the sampling frame has an equal chance of being selected. This may involve using computer generated random numbers to select the sample.

#### Example - Health survey conducted in the Hunter Region

Study population - all residents of the lower Hunter Region (Newcastle, Lake Macquarie, Port Stephens, Maitland, Cessnock) aged 25-69 years.

Sampling frame - electoral roll (note: some bias is introduced here: younger persons (< 35 years) and migrants are less likely to be on the roll).

Sample selection - sample chosen using computer generated random numbers so each person on the electoral roll in this age group has a 1 in 100 chance of selection.

Actual sample - those who responded to the request to participate in the study

Non-respondents may differ from the respondents in many ways (e.g. being less healthy) and this could lead to bias in estimates of the proportion of smokers, average weight, etc.

#### Example - Heights of women, 25 - 29

Suppose you measured the heights of a random sample of 100 women aged 25-29 years and calculated

sample mean = 165 cms

sample standard deviation s = 5 cms

What can you conclude about the heights of all women in this population aged 25-29 years?

Supposing any bias is negligibly small, the population mean is approximately 165 cms. But how close is the approximation? How might the estimate have varied if a different random sample had been selected?

In parameter estimation, we assume that the distribution of the variable we are interested in is adequately described by a distribution with one or more (unknown) parameters. We attempt to estimate the population parameter using the sample data. To emphasise the difference between sample and population, the parameters we wish to estimate are called population parameters. Estimates of the population parameters obtained from a sample are called sample statistics (or sample estimates).

 ... Previous page Next page ...