Surfstat.australia: an online text in introductory Statistics

PRODUCING DATA

PRINCIPLES OF STUDY DESIGN

"Statistical designs for producing trustworthy data are perhaps the single most influential contribution of statistics to the advancement of knowledge." -(Moore and McCabe, 1993).

Overview

There are two purposes for analysing data: to search for patterns, and to provide clear answers to specific questions. Exploratory data analysis (EDA) may be followed by formal statistical inference, which answers specific questions and provides measures of the uncertainty associated with the answers.

Data analysis is important, but researchers also need to develop skills

to produce reliable and valid data
to judge the quality of data produced by others.

Producing new data is expensive, but clear answers often require new data in order to answer a specific question. Alternatively, "available data" is data that was collected for some other purpose, but that may be used to help answer a present question.

Statistical designs for producing new data rely on either random sampling or controlled experimentation.

The purpose of sampling is to study a part in order to gain information about the whole. For example, accountants may sample a firm's inventory to verify the accuracy of its entire records.

An experiment randomly allocates the treatment on the experimental units or subjects in order to observe their response.

Design of experiments

The design of an experiment begins with a description of the response (dependent) variable or variables, the factors (explanatory or independent variables) and what specific treatments will be administered.

Treatment => observation

Before-and-after

observation 1 => treatment => observation 2

Experiments should compare treatments rather than attempt to assess a single treatment in isolation.

e.g. compare the response of the treated group to a control group not given the treatment.

The Control of the effects of other factors and variables (placebo effect, selection of subjects, etc.) is the first aim of the statistical design of experiments.

Bias - the design of a study is biased if it systematically favours certain outcomes.

Randomisation, Replication and Blocking

The use of chance to allocate experimental units into groups is called randomisation. Randomisation is the major principle of the statistical design of experiments.

Randomisation produces groups of experimental units that are more likely to be similar in all respects before the treatments are applied than using non-random methods.

At the end of the study if the differences in the outcome variable between the two groups is too large to attribute to chance, then the difference is called statistically significant. The decision about how large a difference is required to be "significant" depends on statistical inference using the laws of probability. This will be discussed in later sections.

Another principle is that experiments with more subjects are more likely to detect differences than those with fewer subjects. Repeating an experiment on many subjects is called replication.

If it is known, before the experiment is carried out, that other variables of no interest influence the outcome (e.g. age or sex of a patient), then randomisation can be carried out within subsets of experimental units defined by these variables. This is called a BLOCK DESIGN.

For example, the response to treatment for a given type of cancer is expected to depend on the sex of the patient. Ideally, equal numbers of males or females are required in the control or treatment groups and this can be achieved by randomising 50% of males to the treatment group as shown below.

Example - Mice in the lab

In a laboratory experiment with mice, the 5 mice to receive the treatment were selected by the investigator by plunging his hand into the cage containing 10 mice and catching them one at a time.

What is wrong with this method of sampling?

Can you suggest a better method?

This method may appear to be random but perhaps the mice chosen are slower, friendlier or even sicker than the mice who were not caught. This would result in treated group being slower or sicker than the control group. Randomisation would give each mouse an equal chance to be chosen in the treatment group.

The steps in randomisation usually consist of

Comparison Groups

Most statistical analyses involve comparisons of groups.

Ideally you take a collection of subjects or objects which are initially alike, divide them into groups, treat the groups differently and measure the outcomes (or responses)

If the investigator randomises the subjects to the respective groups and hence the intervention, the study is called an experiment.

If subjects are not randomised, it is an observational study.

Randomisation is obtained by use of a coin or a table of random numbers.

(References: Moore and McCabe, chapters 3.1-3.3, Freedman, Pisani and Purves, chapters 1 and 2)

Examples of Observational Studies

(i) Opinion Poll

Two "measurements" were made on each subject

political party supported

views on national debt

These were "measured" at the same time so this is an example of a cross-sectional study.

(ii) Cigarette smoking and lung cancer

It would be unethical to conduct an experiment because you couldn't make people smoke cigarettes (or not smoke cigarettes) and follow them for 20 years or more to see who developed lung cancer.

Also taking a sample from the general population and dividing the sample into those with and without lung cancer and smokers and non-smokers (i.e. cross-sectional study) would result in a sample with very few cases of lung cancer (At any time there are not many people in the population who are alive and are known to have lung cancer since their life expectancy after the diagnosis of cancer is two years).

(Reference: Bonnet, Dickman et. al. 1992. Cancer Survival in South Australia. Univ. Newcastle)

Therefore, a different study design is used in which a group of people with lung cancer (e.g. from a hospital) and a group of people without lung cancer (e.g. their neighbours) are selected by the investigator and asked about their cigarette smoking in the past. This is an example of a retrospective case-control study.

Confounding

A potential problem with observational studies is that there may be factors which differ between the comparison groups and which may affect the outcomes.

If such a factor affects the outcomes it is called a confounding effect.

You need to look out for such factors when you analyse data and make sure you present the results appropriately.

Example - Gender and discrimination at UCB

Is there gender discrimination in the admission of graduate students at University of California at Berkley ?(Freedman, Pisani & Purves, ch. 2)

Table 1

Students Applied % admitted
Men 8,442 44%
Women 4,321 35%

Admissions were decided by "major" (equivalent to Faculty in Australia).

Table 2

Men Women
Major Applied % admitted Applied % admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 35
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Total 2691 45 1835 30

This table doesn't cover all "majors" so the numbers are not the same as in Table 1

Questionnaire Design

There are often research questions that cannot be answered using routinely collected data (e.g. census data, Bureau of Statistics surveys). In these situations, the data required may have to be obtained through the use of questionnaires. There are many good reference texts available on the subject of questionnaire design but some of the key points to note are:

Progress check

  1. Selecting every 10th file in a filing cabinet constitutes taking a random sample.


... Previous page Next page ...