William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics. As a result, Dupont focuses on concepts and model assumptions, rather than on the underlying mathematics. The text covers linear regression, logistic regression, Poisson regression, survival analysis, and analysis of variance. Two chapters are devoted to each topic: an introductory chapter that uses simple data to develop the concept and a more advanced chapter devoted to explaining more complex models, case studies, diagnostic measures, etc.
Dupont pays equal attention to the methods and to using Stata to apply them. When Stata output is displayed, the most important elements of the output are highlighted and explained in notes that follow the output. These notes help the reader make sense of the output by providing the appropriate focus for the problem at hand. The notes also include instructions for reproducing the analysis via Stata’s point-and-click user interface. The text, replete with examples featuring real medical data, uses Stata graphics extensively, providing ample explanation and detail for reproduction.
1. INTRODUCTION
Algebraic notation
Descriptive statistics
Dot plot
Sample mean
Residual
Sample variance
Sample standard deviation
Percentile and median
Box plot
Histogram
Scatter plot
The Stata Statistical Software Package
Downloading data from my website
Creating histograms with Stata
Stata command syntax
Obtaining interactive help from Stata
Stata log files
Stata graphics and schemes
Stata do files
Stata pulldown menus
Displaying other descriptive statistics with Stata
Inferential statistics
Probability density function
Mean, variance, and standard deviation
Normal distribution
Expected value
Standard error
Null hypothesis, alternative hypothesis, and P-value
95% confidence interval
Statistical power
The z and Student’s t distributions
Paired t test
Performing paired t tests with Stata
Independent t test using a pooled standard error estimate
Independent t test using separate standard error estimates
Independent t tests using Stata
The chi-squared distribution
Overview of methods discussed in this text
Models with one response per patient
Models with multiple responses per patient
Additional reading
Exercises
2. SIMPLE LINEAR REGRESSION
Sample covariance
Sample correlation coefficient
Population covariance and correlation coefficient
Conditional expectation
Simple linear regression model
Fitting the linear regression model
Historical trivia: origin of the term regression
Determining the accuracy of linear regression estimates
Ethylene glycol poisoning example
95% confidence interval for y[x] = ? + ?x evaluated at x
95% prediction interval for the response of a new patient
Simple linear regression with Stata
Lowess regression
Plotting a lowess regression curve in Stata
Residual analyses
Studentized residual analysis using Stata
Transforming the x and y variables
Stabilizing the variance
Correcting for non-linearity
Example: research funding and morbidity for 29 diseases
Analyzing transformed data with Stata
Testing the equality of regression slopes
Example: the Framingham Heart Study
Comparing slope estimates with Stata
Density-distribution sunflower plots
Creating density-distribution sunflower plots with Stata
Additional reading
Exercises
3. MULTIPLE LINEAR REGRESSION
The model
Confounding variables
Estimating the parameters for a multiple linear regression model
R2 statistic for multiple regression models
Expected response in the multiple regression model
The accuracy of multiple regression parameter estimates
Hypothesis tests
Leverage
95% confidence interval for ?i
95% prediction intervals
Example: the Framingham Heart Study
Preliminary univariate analyses
Scatter plot matrix graphs
Producing scatter plot matrix graphs with Stata
Modeling interaction in multiple linear regression
The Framingham example
Multiple regression modeling of the Framingham data
Intuitive understanding of a multiple regression model
The Framingham example
Calculating 95% confidence and prediction intervals
Multiple linear regression with Stata
Automatic methods of model selection
Forward selection using Stata
Backward selection
Forward stepwise selection
Backward stepwise selection
Pros and cons of automated model selection
Collinearity
Residual analyses
Influence
??_hat influence statistic
Cook’s distance
The Framingham example
Residual and influence analyses using Stata
Using multiple linear regression for non-linear models
Building non-linear models with restricted cubic splines
Choosing the knots for a restricted cubic spline model
The SUPPORT Study of hospitalized patients
Modeling length-of-stay and MAP using restricted cubic splines
Using Stata for non-linear models with restricted cubic splines
Additional reading
Exercises
4. SIMPLE LOGISTIC REGRESSION
Example: APACHE score and mortality in patients with sepsis
Sigmoidal family of logistic regression curves
The log odds of death given a logistic probability function
The binomial distribution
Simple logistic regression model
Generalized linear model
Contrast between logistic and linear regression
Maximum likelihood estimation
Variance of maximum likelihood parameter estimates
Statistical tests and confidence intervals
Likelihood ratio tests
Quadratic approximations to the log likelihood ratio function
Score tests
Wald tests and confidence intervals
Which test should you use?
Sepsis example
Logistic regression with Stata
Odds ratios and the logistic regression model
95% confidence interval for the odds ratio associated with a unit increase in x
Calculating this odds ratio with Stata
Logistic regression with grouped response data
95% confidence interval for ?[x]
Exact 100(1 ? ?)% confidence intervals for proportions
Example: the Ibuprofen in Sepsis Study
Logistic regression with grouped data using Stata
Simple 2 × 2 case–control studies
Example: the Ille-et-Vilaine study of esophageal cancer and alcohol
Review of classical case–control theory
95% confidence interval for the odds ratio: Woolf’s method
Test of the null hypothesis that the odds ratio equals one
Test of the null hypothesis that two proportions are equal
Logistic regression models for 2 × 2 contingency tables
Nuisance parameters
95% confidence interval for the odds ratio: logistic regression
Creating a Stata data file
Analyzing case–control data with Stata
Regressing disease against exposure
Additional reading
Exercises
5. MULTIPLE LOGISTIC REGRESSION
Mantel–Haenszel estimate of an age-adjusted odds ratio
Mantel–Haenszel ?2 statistic for multiple 2 × 2 tables
95% confidence interval for the age-adjusted odds ratio
Breslow–Day–Tarone test for homogeneity
Calculating the Mantel–Haenszel odds ratio using Stata
Multiple logistic regression model
Likelihood ratio test of the influence of the covariates on the response variable
95% confidence interval for an adjusted odds ratio
Logistic regression for multiple 2 × 2 contingency tables
Analyzing multiple 2 × 2 tables with Stata
Handling categorical variables in Stata
Effect of dose of alcohol on esophageal cancer risk
Analyzing model (5.25) with Stata
Effect of dose of tobacco on esophageal cancer risk
Deriving odds ratios from multiple parameters
The standard error of a weighted sum of regression coefficients
Confidence intervals for weighted sums of coefficients
Hypothesis tests for weighted sums of coefficients
The estimated variance–covariance matrix
Multiplicative models of two risk factors
Multiplicative model of smoking, alcohol, and esophageal cancer
Fitting a multiplicative model with Stata
Model of two risk factors with interaction
Model of alcohol, tobacco, and esophageal cancer with interaction terms
Fitting a model with interaction using Stata
Model fitting: nested models and model deviance
Effect modifiers and confounding variables
Goodness-of-fit tests
The Pearson ?2 goodness-of-fit statistic
Hosmer–Lemeshow goodness-of-fit test
An example: the Ille-et-Vilaine cancer data set
Residual and influence analysis
Standardized Pearson residual
??_hatj influence statistic
Residual plots of the Ille-et-Vilaine data on esophageal cancer
Using Stata for goodness-of-fit tests and residual analyses
Frequency matched case–control studies
Conditional logistic regression
Analyzing data with missing values
Imputing data that is missing at random
Cardiac output in the Ibuprofen in Sepsis Study
Modeling missing values with Stata
Logistic regression using restricted cubic splines
Odds ratios from restricted cubic spline models
95% confidence intervals for ?_hat[x]
Modeling hospital mortality in the SUPPORT Study
Using Stata for logistic regression with restricted cubic splines
Regression methods with a categorical response variable
Proportional odds logistic regression
Polytomous logistic regression
Additional reading
Exercises
6. INTRODUCTION TO SURVIVAL ANALYSIS
Survival and cumulative mortality functions
Right censored data
Kaplan–Meier survival curves
An example: genetic risk of recurrent intracerebral hemorrhage
95% confidence intervals for survival functions
Cumulative mortality function
Censoring and bias
Log-rank test
Using Stata to derive survival functions and the log-rank test
Log-rank test for multiple patient groups
Hazard functions
Proportional hazards
Relative risks and hazard ratios
Proportional hazards regression analysis
Hazard regression analysis of the intracerebral hemorrhage data
Proportional hazards regression analysis with Stata
Tied failure times
Additional reading
Exercises
7. HAZARD REGRESSION ANALYSIS
Proportional hazards model
Relative risks and hazard ratios
95% confidence intervals and hypothesis tests
Nested models and model deviance
An example: the Framingham Heart Study
Kaplan–Meier survival curves for DBP
Simple hazard regression model for CHD risk and DBP
Restricted cubic spline model of CHD risk and DBP
Categorical hazard regression model of CHD risk and DBP
Simple hazard regression model of CHD risk and gender
Multiplicative model of DBP and gender on risk of CHD
Using interaction terms to model the effects of gender and DBP on CHD
Adjusting for confounding variables
Interpretation
Alternative models
Proportional hazards regression analysis using Stata
Stratified proportional hazards models
Survival analysis with ragged study entry
Kaplan–Meier survival curve and the log-rank test with ragged entry
Age, sex, and CHD in the Framingham Heart Study
Proportional hazards regression analysis with ragged entry
Survival analysis with ragged entry using Stata
Predicted survival, log–log plots, and the proportional hazards assumption
Evaluating the proportional hazards assumption with Stata
Hazard regression models with time-dependent covariates
Testing the proportional hazards assumption
Modeling time-dependent covariates with Stata
Additional reading
Exercises
8. INTRODUCTION TO POISSON REGRESSION: INFERENCES ON MORBIDITY AND MORTALITY RATES
Elementary statistics involving rates
Calculating relative risks from incidence data using Stata
The binomial and Poisson distributions
Simple Poisson regression for 2 × 2 tables
Poisson regression and the generalized linear model
Contrast between Poisson, logistic, and linear regression
Simple Poisson regression with Stata
Poisson regression and survival analysis
Recoding survival data on patients as patient–year data
Converting survival records to person–years of follow-up using Stata
Converting the Framingham survival data set to person–time data
Simple Poisson regression with multiple data records
Poisson regression with a classification variable
Applying simple Poisson regression to the Framingham data
Additional reading
Exercises
9. MULTIPLE POISSON REGRESSION
Multiple Poisson regression model
An example: the Framingham Heart Study
A multiplicative model of gender, age, and coronary heart disease
A model of age, gender, and CHD with interaction terms
Adding confounding variables to the model
Using Stata to perform Poisson regression
Residual analyses for Poisson regression models
Deviance residuals
Residual analysis of Poisson regression models using Stata
Additional reading
Exercises
10. FIXED EFFECTS ANALYSIS OF VARIANCE
One-way analysis of variance
Multiple comparisons
Reformulating analysis of variance as a linear regression model
Non-parametric methods
Kruskal–Wallis test
Example: a polymorphism in the estrogen receptor gene
User contributed software in Stata
One-way analyses of variance using Stata
Two-way analysis of variance, analysis of covariance, and other models
Additional reading
Exercises
11. REPEATED-MEASURES ANALYSIS OF VARIANCE
Example: effect of race and dose of isoproterenol on blood flow
Exploratory analysis of repeated measures data using Stata
Response feature analysis
Example: the isoproterenol data set
Response feature analysis using Stata
The area-under-the-curve response feature
Generalized estimating equations
Common correlation structures
GEE analysis and the Huber–White sandwich estimator
Example: analyzing the isoproterenol data with GEE
Using Stata to analyze the isoproterenol data set using GEE
GEE analyses with logistic or Poisson models
Additional reading
Exercises
Appendices
A. SUMMARY OF STATISTICAL MODELS DISCUSSED IN THIS TEXT
Models for continuous response variables with one response per patient
Models for dichotomous or categorical response variables with one response per
patient
Models for survival data (follow-up time plus fate at exit observed on each
patient)
Models for response variables that are event rates or the number of events
during a specified number of patient–years of follow-up. The event must be rare
Models with multiple observations per patient or matched or clustered patients
B. SUMMARY OF STATA COMMANDS USED IN THIS TEXT
Data manipulation and description
Analysis commands
Graph commands
Common options for graph commands (insert after comma)
Post-estimation commands (affected by preceding regression-type command)
Command prefixes
Command qualifiers (insert before comma)
Logical and relational operators and system variables (see Stata User’s Guide)
Functions (see Stata Data Management Manual)