Verifying your Assumptions
Independence
- Review the experimental protocol, check out the equipment used. Could your subjects confer together and share their opinions? Did they know whether they were assigned to the treatment or to the control group? Could your instrumentation lose its calibration? Was there contamination carried over from one determination to the next? What you can do here will depend on the context of the study, and how much information, in addition to numerical data, is available to you.
- If you suspect that there is autocorrelation, you can check the data with lagged scatter plots or with Time Series Analysis, but these methods are only useful if there are several hundred observations in the series.
Normality
- Try histograms, stem-and-leaf-plots, box plots, anything that will indicate if the shape of the distribution departs too much from a symmetric bell curve. Probability plots aren't on the course, and, curiously, they aren't in Rosner, but in if you do qqnorm(x) and qqline(x) in R, normally-distributed data will lie on a straight line, non-normal data will show a curve. None of these plots will tell you much unless you have at least 40 or 50 observations.
Homoscedasticity
- If you are comparing means in different groups, you usually want to assume that the variance is the same in each group. You can compare two variances using the F-test, but you know that the test isn't very powerful unless you have a large sample in each group and it isn't even valid unless you have independent normal data from each group (see above!).
- Comparative Box Plots are a useful way of checking the data to see if the dispersion is more or less comparable in each group.
- If you are fitting a regression line, the conditional variance should be the same at any point on the line. For simple linear regression, plot the data on a scatter plot with the fitted line and check that the scatter of points is reasonably consistent along the line. For multiple regression, plot the residuals against fitted values and check that the scatter of points is reasonably consistent across the graph.
- If the variance increases with the mean (groups with larger means are more variable), then try log or square-root transformations. Redraw the comparative box plots to see if the variances are more equal with the transformed data. Of course, comparing means of transformed data isn't the same as comparing means in the original scale, and a linear relationship on a log scale is not a linear relationship on the original scale, but that is an unavoidable problem with transformations.
Linearity
- If you have repeated x-values in a simple linear regression, you can compute "Pure Error" and "Lack of Fit" terms in the regression ANOVA to test the assumption that the true relationship is a straight line. A less satisfactory method is to fit a quadratic model and test to see if the fit is significantly better than the linear model.
Robustness
A statistical method is said to be "Robust" if it does what it is supposed to do even if the assumptions aren't satisfied. Generally, methods are more robust in large samples than they are with small samples. This is frustrating, because it is only with large samples that you can test the assumptions.
An example is the t-test. The t distribution is only justified if the data are independent and normal, but if there are enough degrees of freedom the t distribution becomes a standard normal so it doesn't make much difference whether you treat the variance as known or estimated, hence the assumption that s2 is distributed as a Chi-squared (which is only true for normal data) is no longer important.