1998-02-13 (Week 6)
The Stages of a Statistical Study: "PPDAC"
Problem
Plan
Data
Analysis
Conclusions
Problem
- Why is the study being done? Set down a Statement of Purpose.
- Identify the Target Population
for the study.
- Decide which Population
Attributes you wish to measure.
- Identify the Sources of
Variation. Distinguish between natural variation (inherent
in the material or population you are studying) and measurement
error (dependent on your measurement techniques and perhaps
possible to reduce with better instruments or methodology).
Plan
- Do a Preliminary Study of the
Population. Is it as you were expecting, or will you have
to re-think the study from the beginning?
- Test and study the Measurement
System you propose to use. Will it work in practice?
- Determine a Sampling Protocol.
How will units be selected? How many units will be needed?
- Is this an Observational Study
(just observe what is there) or an Experimental Study (apply an intervention
or "treatment" and observe the result of the intervention)?
- If it is an Experimental Study, choose the Treatment Levels.
- Decide in advance the Data Preparation
Protocol, attempting to minimise the risk of transcription
errors and other data loss.
Data
- Carry out the Plan; obtain the
data. If there are difficulties doing so and it becomes apparent
that the data will be too sparse or badly biased, you will have to
loop back and re-think the Plan or perhaps revise the original
Statement of Purpose.
- Check over the data; do Data
Cleaning as necessary, but be ethical, don't bias the
results by, e.g., discarding data values that don't conform to
your preconceptions.
- Look after Data Storage. Ensure
that the data are secure and accessible. Can authorised persons
access the data easily? Can unauthorised persons break into the
database? How long do you expect to keep the data? How far into
the future will the storage medium be readable? Will it
deteriorate physically? Will it require specialised software or
hardware that may become obsolete? Keep backups in different
locations. Keep the original data and a record of any editing
done.
Analysis
- Use EDA methods to get a Data
Summary. This may indicate the need to go back for more
extensive Data Cleaning. It should give an early indication as to
what relationships, if any, are evident in the data.
- Do a Formal Analysis (ANOVA or
other tests of hypotheses). Be aware that letting EDA suggest
hypotheses will exaggerate their level of significance.
- Do an Assessment of your results.
Look for sources of bias in the sampling. Consider how small
samples lead to unreliable conclusions while large samples can
attach a high level of significance to an effect that is too small
to be scientifically interesting. At this point you may decide
that more work is needed and that a formal report on the study
would be premature.
Conclusions
- Write a Plain Language Report.
Say exactly what you did and what you have learned from the study
in words that any educated person can understand. At this point,
the process by which you reached your conclusions is less
important than the conclusions themselves and the evidence you
give to support them.
- Give an honest assessment of the Limitations of the Study: adequacy of the
sample size, sources of sampling bias, precision of the estimates,
applicability to other populations, etc.
- Give your Recommendations. If you
found serious limitations of the study, one recommendation might
be to repeat the whole cycle, perhaps even re-defining the
original Problem.
What to Prepare for the Next Class...
Consider our work on the "Effects of effluent on the Kapuskasing
River" data in terms of the PPDAC model. At what stage did we enter
the process? What do we know about earlier stages of the study? Is
there anything we do not know about the earlier stages but need to
know to make our analysis more accurate or relevant? If it was
decided to go back and re-do earlier stages of the study (not likely,
as it would set the work back by a year or more!), what
recommendations would you make?
Would the Report you submitted for Assignment #1 be suitable as
the "Plain Language Report" described above? If not, how should it be
modified?
Re-read the Preface, the Prelude, "How to tackle statistical
problems," and Chapters 1 to 6 of Chatfield, Problem Solving: A
Statistician's Guide. Much of what he says is summarised above.
What other interesting ideas do you find?
The discussion of Box Plots from February 13 is summarised in
"How
to Interpret a Box Plot in Terms of a Normal Distribution."