The following data (taken from Rosner (2000), Fundamentals of Biostatistics, 5th Edition, p. 39) shows data collected on patients discharged from a hospital. Using GLMs, determine which factors predict duration of stay. Try normal error with identity link and gamma error with inverse link. Try modelling log(duration) with normal error and identity link. Which of these is best? Use the dispersion parameter to estimate the gamma shape parameter.
The variables are: id number; duration of hospital stay; age; sex (1=M, 2=F); first temperature following admission; first WBC (¥ 103) following admission; received antibiotic (1=Y, 2=N); received bacterial culture (1=Y, 2=N); service (1=medical, 2=surgical).
Be sure to make the categorical variables factors.
The models duration ~ age + temp1 + wbc1 and log(duration) ~ age + temp1 + wbc1 are interesting, try them to get started.
fit1n <- glm(duration~age+temp1+wbc1, data=hosp) summary(fit1n) anova(fit1n, test="F") fit1g <- glm(duration~age+temp1+wbc1, data=hosp, family=Gamma(link=inverse)) summary(fit1g) anova(fit1g, test="F") fit1l <- glm(log(duration)~age+temp1+wbc1, data=hosp) summary(fit1l) anova(fit1l, test="F") id duration age sex temp1 wbc1 antib bact serv 1 5 30 2 99.0 8 2 2 1 2 10 73 2 98.0 5 2 1 1 3 6 40 2 99.0 12 2 2 2 4 11 47 2 98.2 4 2 2 2 5 5 25 2 98.5 11 2 2 2 6 14 82 1 96.8 6 1 2 2 7 30 60 1 99.5 8 1 1 1 8 11 56 2 98.6 7 2 2 1 9 17 43 2 98.0 7 2 2 1 10 3 50 1 98.0 12 2 1 2 11 9 59 2 97.6 7 2 1 1 12 3 4 1 97.8 3 2 2 2 13 8 22 2 99.5 11 1 2 2 14 8 33 2 98.4 14 1 1 2 15 5 20 2 98.4 11 2 1 2 16 5 32 1 99.0 9 2 2 2 17 7 36 1 99.2 6 1 2 2 18 4 69 1 98.0 6 2 2 2 19 3 47 1 97.0 5 1 2 1 20 7 22 1 98.2 6 2 2 2 21 9 11 1 98.2 10 2 2 2 22 11 19 1 98.6 14 1 2 2 23 11 67 2 97.6 4 2 2 1 24 9 43 2 98.6 5 2 2 2 25 4 41 2 98.0 5 2 2 1
The following table (from Bishop, Fienberg & Holland (1975), Discrete Multivariate Analysis, MIT Press, p. 103 ) shows the total number of patients and the number surviving, in 3 cities, 3 age groups, classified by the initial histology of the tumour (less/more inflammation, benign/malignant). Use logistic regression to determine which variables predict survival.
There are several different ways to set up binomial data for GLM analysis in Splus; this is the simplest:
fitc1 <- glm(cbind(survive, n-survive) ~ city*benign, data=cancer.logit, family = binomial(link=logit))
Use anova(fitc1, test="Chi") to test the significance of individual terms in the model and the overall goodness of fit. You could fit the saturated model, inflam*benign*city*age, then use step() and update() to remove terms until you have a model you like.
This example can also be run using GLMStat on the Macintosh in BSB-102; the data are already entered in the GLMStat file cancer.logit in the Cancer folder.
The example can also be analysed as a loglinear model, but it requires a different data format. Click here for a detailed comparison.
When you input the data, be sure to make the categorical variables factors.
inflam benign city age survive n 1 1 1 1 26 35 1 1 1 2 20 29 1 1 1 3 1 3 1 1 2 1 11 17 1 1 2 2 18 26 1 1 2 3 15 24 1 1 3 1 16 32 1 1 3 2 27 41 1 1 3 3 12 15 1 2 1 1 68 75 1 2 1 2 46 55 1 2 1 3 6 9 1 2 2 1 24 31 1 2 2 2 58 78 1 2 2 3 26 44 1 2 3 1 20 27 1 2 3 2 39 51 1 2 3 3 11 18 2 1 1 1 25 29 2 1 1 2 18 29 2 1 1 3 5 6 2 1 2 1 4 10 2 1 2 2 10 13 2 1 2 3 1 4 2 1 3 1 8 11 2 1 3 2 10 13 2 1 3 3 4 7 2 2 1 1 9 12 2 2 1 2 5 7 2 2 1 3 1 1 2 2 2 1 0 0 2 2 2 2 3 5 2 2 2 3 1 1 2 2 3 1 1 1 2 2 3 2 4 4 2 2 3 3 1 1