INTRODUCTION
In this tutorial, we will examine several ways to utilize formula strings for generalized linear models. The formula string specification in the GAUSS procedure glm
requires at least three inputs, the dataset name, the formula, and the distribution family. In addition, an optional control structure may be used to specify the link function and control other aspects of estimation.
LOGISTIC REGRESSION WITH A FORMULA STRING
To begin let’s consider a simple logistic regression. This model estimates the impact of two variables gre
and gpa
on the college admission using logistic regression. Let’s consider our three inputs, dataset, formula, and distribution.
- Dataset
- Data for this example is stored in the dataset binary.csv. This file can be directly inputted to
glm
, as GAUSS formula string syntax supports CSV, Excel (XLS, XLSX), HDF5, GAUSS Matrix (FMT), GAUSS Dataset (DAT), Stata (DTA) and SAS (SAS7BDAT, SAS7BCAT) and dataset types.
- Data for this example is stored in the dataset binary.csv. This file can be directly inputted to
- Formula
- In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde
~
and then the independent variables. Independent variables should be separated by a+
. The formula for this model isadmit ~ gre + gpa
.
- In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde
- Distribution Family
- We wish to estimate a logistic regression with a binary dependent variable, therefore we use the
"binary"
distribution family specification.
- We wish to estimate a logistic regression with a binary dependent variable, therefore we use the
Put it all together:
//Create string with fully pathed file name fname = getGAUSShome() $+ "examples/binary.csv"; //Call glm function with formula string using 'factor' //keyword to create dummy variables call glm(fname, "admit ~ gre + gpa", "binomial");
The printed results:
Generalized Linear Model Valid cases: 400 Dependent Variable: admit Degrees of freedom: 397 Distribution: binomial Deviance: 480.3 Link function: logit Pearson Chi-square: 398.1 AIC: 486.3 Log likelihood: -240.2 BIC: 498.3 Dispersion: 1 Iterations: 4 Standard Prob Variable Estimate Error z-value >|z| ---------------- ------------ ------------ ------------ ------------ CONSTANT -4.9494 1.0751 -4.6037 < 0.0001 gre 0.0026907 0.0010575 2.5444 0.0109465 gpa 0.75469 0.31959 2.3615 0.0182034 Note: Dispersion parameter for BINOMIAL distribution taken to be 1
INCLUDE FACTOR VARIABLES
Now, let’s extend our previous model to include the categorical variable rank. To specify that a variable is a categorical variable in a formula we use factor
followed by the name of the variable inside a pair of parentheses. Using factor
in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. The formula for our extended model will be "admit ~ factor(rank) + gre + gpa"
.
// Call glm function with formula string using //'factor' keyword to create dummy variables call glm(fname, "admit ~ factor(rank) + gre + gpa", "binomial");
The printed output table which now includes coefficients for rank=2,3,4
. Note that rank=1
is automatically excluded from the regression as the base level
Generalized Linear Model Valid cases: 400 Dependent Variable: admit Degrees of freedom: 394 Distribution: binomial Deviance: 458.5 Link function: logit Pearson Chi-square: 397.5 AIC: 470.5 Log likelihood: -229.3 BIC: 494.5 Dispersion: 1 Iterations: 4 Standard Prob Variable Estimate Error z-value >|z| ---------------- ------------ ------------ ------------ ------------ CONSTANT -3.99 1.14 -3.5001 0.000465027 rank: 2 -0.67544 0.31649 -2.1342 0.0328288 rank: 3 -1.3402 0.34531 -3.8812 0.000103942 rank: 4 -1.5515 0.41783 -3.7131 0.000204711 gre 0.0022644 0.001094 2.0699 0.0384651 gpa 0.80404 0.33182 2.4231 0.0153879
INCLUDE INTERACTION EFFECTS
Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, :
, is used to add only a pure interaction term and an asterisk, *
, is used to add each individual term, as well as the interaction term.
© 2024 Aptech Systems, Inc. All rights reserved.
Let’s first consider using :
to add the interaction of gpa
and gre
to our model. In this case the formula for our model is "admit ~ factor(rank) + gre + gpa + gre:gpa"
.
// Call glm function using ":"
call glm(fname, "admit ~ factor(rank) + gre + gpa + gre:gpa", "binomial");
In the output from this call we see that the coefficient for the interaction term gre:gpa
has been added to our output table just below the coefficient for gpa
.
Generalized Linear Model Valid cases: 400 Dependent Variable: admit Degrees of freedom: 393 Distribution: binomial Deviance: 455.8 Link function: logit Pearson Chi-square: 401.8 AIC: 469.8 Log likelihood: -227.9 BIC: 497.7 Dispersion: 1 Iterations: 4 Standard Prob Variable Estimate Error z-value >|z| ---------------- ------------ ------------ ------------ ------------ CONSTANT -13.609 6.0711 -2.2416 0.0249902 rank: 2 -0.7217 0.31915 -2.2613 0.0237401 rank: 3 -1.3435 0.34638 -3.8786 0.000105077 rank: 4 -1.6063 0.42099 -3.8155 0.000135897 gre 0.018344 0.009968 1.8403 0.0657216 gpa 3.6522 1.7884 2.0421 0.0411427 gre:gpa -0.004719 0.0028977 -1.6285 0.103418 Note: Dispersion parameter for BINOMIAL distribution taken to be 1
Now we will estimate the same model using *
. In this case the formula for our model is "admit ~ factor(rank) + gre*gpa
“.
// Call glm function using "*"
call glm(fname, "admit ~ factor(rank) + gre*gpa", "binomial");
The resulting output table is exactly the same as the case using :
and shows that coefficients for gre
, gpa
, and gre:gpa
are estimated.
Generalized Linear Model Valid cases: 400 Dependent Variable: admit Degrees of freedom: 393 Distribution: binomial Deviance: 455.8 Link function: logit Pearson Chi-square: 401.8 AIC: 469.8 Log likelihood: -227.9 BIC: 497.7 Dispersion: 1 Iterations: 4 Standard Prob Variable Estimate Error z-value >|z| ---------------- ------------ ------------ ------------ ------------ CONSTANT -13.609 6.0711 -2.2416 0.0249902 rank: 2 -0.7217 0.31915 -2.2613 0.0237401 rank: 3 -1.3435 0.34638 -3.8786 0.000105077 rank: 4 -1.6063 0.42099 -3.8155 0.000135897 gre 0.018344 0.009968 1.8403 0.0657216 gpa 3.6522 1.7884 2.0421 0.0411427 gre:gpa -0.004719 0.0028977 -1.6285 0.103418 Note: Dispersion parameter for BINOMIAL distribution taken to be 1
GLM WITHOUT A CONSTANT
As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for glm
is to include a constant in the model. In order to run the model without a constant, we must add a -1
after the ~
in our formula. The -1
should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula "admit ~ -1 + factor(rank) + gre*gpa"
.
// Call glm function using "*"
call glm(fname, "admit ~ -1 + factor(rank) + gre*gpa", "binomial");
The resulting output table is :
Valid cases: 400 Dependent Variable: admit Degrees of freedom: 394 Distribution: binomial Deviance: 461.2 Link function: logit Pearson Chi-square: 397.6 AIC: 473.2 Log likelihood: -230.6 BIC: 497.1 Dispersion: 1 Iterations: 4 Standard Prob Variable Estimate Error z-value >|z| ---------------- ------------ ------------ ------------ ------------ rank: 2 -0.69527 0.31615 -2.1992 0.0278659 rank: 3 -1.3651 0.34366 -3.9723 < 0.0001 rank: 4 -1.5741 0.41732 -3.772 0.000161944 gre -0.0037016 0.0019285 -1.9194 0.0549365 gpa -0.35181 0.20923 -1.6814 0.0926763 gre:gpa 0.0017265 0.000545 3.1679 0.00153547 Note: Dispersion parameter for BINOMIAL distribution taken to be 1