INTRODUCTION

In this tutorial, we will examine several ways to utilize formula strings for generalized linear models. The formula string specification in the GAUSS procedure glm requires at least three inputs, the dataset name, the formula, and the distribution family. In addition, an optional control structure may be used to specify the link function and control other aspects of estimation.

LOGISTIC REGRESSION WITH A FORMULA STRING 

To begin let’s consider a simple logistic regression. This model estimates the impact of two variables gre and gpa on the college admission using logistic regression. Let’s consider our three inputs, dataset, formula, and distribution.

  • Dataset
    • Data for this example is stored in the dataset binary.csv. This file can be directly inputted to glm, as GAUSS formula string syntax supports CSV, Excel (XLS, XLSX), HDF5, GAUSS Matrix (FMT), GAUSS Dataset (DAT), Stata (DTA) and SAS (SAS7BDAT, SAS7BCAT) and dataset types.
  • Formula
    • In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde ~ and then the independent variables. Independent variables should be separated by a +. The formula for this model is admit ~ gre + gpa.
  • Distribution Family
    • We wish to estimate a logistic regression with a binary dependent variable, therefore we use the "binary" distribution family specification.

 

Put it all together:

 

 //Create string with fully pathed file name
 fname = getGAUSShome() $+ "examples/binary.csv";

//Call glm function with formula string using 'factor'
 //keyword to create dummy variables
 call glm(fname, "admit ~ gre + gpa", "binomial");

 

The printed results:

 

Generalized Linear Model

Valid cases:                  400     Dependent Variable:                      admit
Degrees of freedom:           397     Distribution:                         binomial
Deviance:                   480.3     Link function:                           logit
Pearson Chi-square:         398.1     AIC:                                     486.3
Log likelihood:            -240.2     BIC:                                     498.3
Dispersion:                     1     Iterations:                                  4

                                          Standard                              Prob
Variable                 Estimate            Error          z-value             >|z|
----------------     ------------     ------------     ------------     ------------
CONSTANT                  -4.9494           1.0751          -4.6037         < 0.0001
gre                     0.0026907        0.0010575           2.5444        0.0109465
gpa                       0.75469          0.31959           2.3615        0.0182034


Note: Dispersion parameter for BINOMIAL distribution taken to be 1

 

 

INCLUDE FACTOR VARIABLES

Now, let’s extend our previous model to include the categorical variable rank. To specify that a variable is a categorical variable in a formula we use factor followed by the name of the variable inside a pair of parentheses. Using factor in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. The formula for our extended model will be "admit ~ factor(rank) + gre + gpa".

 

 // Call glm function with formula string using
 //'factor' keyword to create dummy variables
 call glm(fname, "admit ~ factor(rank) + gre + gpa", "binomial");

 

The printed output table which now includes coefficients for rank=2,3,4. Note that rank=1 is automatically excluded from the regression as the base level

 

Generalized Linear Model

Valid cases: 400 Dependent Variable: admit
Degrees of freedom: 394 Distribution: binomial
Deviance: 458.5 Link function: logit
Pearson Chi-square: 397.5 AIC: 470.5
Log likelihood: -229.3 BIC: 494.5
Dispersion: 1 Iterations: 4

Standard Prob
Variable Estimate Error z-value >|z|
---------------- ------------ ------------ ------------ ------------
CONSTANT -3.99 1.14 -3.5001 0.000465027
rank: 2 -0.67544 0.31649 -2.1342 0.0328288
rank: 3 -1.3402 0.34531 -3.8812 0.000103942
rank: 4 -1.5515 0.41783 -3.7131 0.000204711
gre 0.0022644 0.001094 2.0699 0.0384651
gpa 0.80404 0.33182 2.4231 0.0153879

 

INCLUDE INTERACTION EFFECTS 

Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, :, is used to add only a pure interaction term and an asterisk, *, is used to add each individual term, as well as the interaction term.

 

© 2024 Aptech Systems, Inc. All rights reserved.

 

Let’s first consider using : to add the interaction of gpa and gre to our model. In this case the formula for our model is "admit ~ factor(rank) + gre + gpa + gre:gpa".

 

// Call glm function using ":"
call glm(fname, "admit ~ factor(rank) + gre + gpa + gre:gpa", "binomial");

 

In the output from this call we see that the coefficient for the interaction term gre:gpa has been added to our output table just below the coefficient for gpa.

 

Generalized Linear Model

Valid cases: 400 Dependent Variable: admit
Degrees of freedom: 393 Distribution: binomial
Deviance: 455.8 Link function: logit
Pearson Chi-square: 401.8 AIC: 469.8
Log likelihood: -227.9 BIC: 497.7
Dispersion: 1 Iterations: 4

Standard Prob
Variable Estimate Error z-value >|z|
---------------- ------------ ------------ ------------ ------------
CONSTANT -13.609 6.0711 -2.2416 0.0249902
rank: 2 -0.7217 0.31915 -2.2613 0.0237401
rank: 3 -1.3435 0.34638 -3.8786 0.000105077
rank: 4 -1.6063 0.42099 -3.8155 0.000135897
gre 0.018344 0.009968 1.8403 0.0657216
gpa 3.6522 1.7884 2.0421 0.0411427
gre:gpa -0.004719 0.0028977 -1.6285 0.103418

Note: Dispersion parameter for BINOMIAL distribution taken to be 1

 

Now we will estimate the same model using *. In this case the formula for our model is "admit ~ factor(rank) + gre*gpa“.

 

// Call glm function using "*"
call glm(fname, "admit ~ factor(rank) + gre*gpa", "binomial");

 

The resulting output table is exactly the same as the case using : and shows that coefficients for gregpa, and gre:gpa are estimated.

 

Generalized Linear Model

Valid cases:                  400     Dependent Variable:                      admit
Degrees of freedom:           393     Distribution:                         binomial
Deviance:                   455.8     Link function:                           logit
Pearson Chi-square:         401.8     AIC:                                     469.8
Log likelihood:            -227.9     BIC:                                     497.7
Dispersion:                     1     Iterations:                                  4

                                          Standard                              Prob
Variable                 Estimate            Error          z-value             >|z|
----------------     ------------     ------------     ------------     ------------
CONSTANT                  -13.609           6.0711          -2.2416        0.0249902
rank: 2                   -0.7217          0.31915          -2.2613        0.0237401
rank: 3                   -1.3435          0.34638          -3.8786      0.000105077
rank: 4                   -1.6063          0.42099          -3.8155      0.000135897
gre                      0.018344         0.009968           1.8403        0.0657216
gpa                        3.6522           1.7884           2.0421        0.0411427
gre:gpa                 -0.004719        0.0028977          -1.6285         0.103418


Note: Dispersion parameter for BINOMIAL distribution taken to be 1

 

GLM WITHOUT A CONSTANT 

As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for glm is to include a constant in the model. In order to run the model without a constant, we must add a -1 after the ~ in our formula. The -1 should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula "admit ~ -1 + factor(rank) + gre*gpa".

 

// Call glm function using "*"
call glm(fname, "admit ~ -1 + factor(rank) + gre*gpa", "binomial");

 

The resulting output table is :

 

Valid cases:                  400     Dependent Variable:                      admit
Degrees of freedom:           394     Distribution:                         binomial
Deviance:                   461.2     Link function:                           logit
Pearson Chi-square:         397.6     AIC:                                     473.2
Log likelihood:            -230.6     BIC:                                     497.1
Dispersion:                     1     Iterations:                                  4

                                          Standard                              Prob
Variable                 Estimate            Error          z-value             >|z|
----------------     ------------     ------------     ------------     ------------
rank: 2                  -0.69527          0.31615          -2.1992        0.0278659
rank: 3                   -1.3651          0.34366          -3.9723         < 0.0001
rank: 4                   -1.5741          0.41732           -3.772      0.000161944
gre                    -0.0037016        0.0019285          -1.9194        0.0549365
gpa                      -0.35181          0.20923          -1.6814        0.0926763
gre:gpa                 0.0017265         0.000545           3.1679       0.00153547


Note: Dispersion parameter for BINOMIAL distribution taken to be 1