GOALS

This tutorial builds on the first five econometrics tutorials. It is suggested that you complete those tutorials prior to starting this one.

 

This tutorial demonstrates how to test for influential data after OLS regression. After completing this tutorial, you should be able to :

  • Examine the correlation between variables.
  • Find the variance influence factor (VIF) test for multicollinearity.

 

INTRODUCTION 

Multicollinearity between regressors does not directly violate OLS assumptions. However, it can complicate regression, and exact multicollinearity will make estimation impossible. Signs of multicollinearity include large standard errors combined with high R-squared, high correlation between independent variables, and high correlation between estimated coefficients. We will check for multicollinearity by examining the correlation between regressors and calculating the variance inflation factor (VIF).

 

The OLS Model

Multicollinearity becomes a concern only when we have multiple regressors in our model. For this reason, we will change our linear model for this tutorial using a data generating process with multiple independent variables:

 

yi=1.3+5.7xi,1+0.5xi,2+1.9xi,3+ϵi

 

where ϵi is the random disturbance term. However, for demonstration we will make x3 a function of x1 and x2:

 

 

Once we’ve created the data, we estimate the model parameters using the GAUSS function ols and store the results and we did in previous tutorials.

 

COMPUTE THE CORRELATION MATRIX 

We will first look for signs of multicollinearity in the correlation matrix of the independent variables. This is done using the GAUSS command corrx.

 

 

The above code will print the following output:

 

corr(x):
  1.0000   0.0933   0.3042
  0.0933   1.0000   0.6982
  0.3042   0.6982   1.0000

 

As we should expect given our data generating process, the correlation matrix shows somewhat high correlations between x3 and x2 and x3 and x1. These correlations don’t seem to be impacting our regression — we don’t see any unusually large significant errors.

 

VARIANCE INFLATION FACTOR (VIF)

The variance inflation factor for xj is given by

 

 

where ^R²jis the R-squared that results when xj is regressed with intercept against all other explanatory variables. As a rule of thumb, VIF values over 10 are concerning.

 

To run the VIF test, we will first create a GAUSS procedure that:

  • Runs the appropriate OLS regression for the inputs y and x.
  • Computes the VIF using the OLS results.

 

 //proc (number of return values) = procedure_name(input1, input2)
 proc (1) = vif(y,x);
 //local variables only exist in this procedure
 local nam, m, b, stb, vc, std,
 sig, cx, rsq, resid, dbw, VIF_x;

//Turn off printing of 'ols' report
 \_\_output = 0;

//Run regression
 { nam, m, b, stb, vc, std,
 sig, cx, rsq, resid, dbw } = ols("", y, x);

//Calculate the VIF
 VIF_x = 1/(1 - rsq);

//Return the VIF
 retp(VIF_x);

//The 'endp' keyword ends the procedure
 endp;

 

We then use this new procedure to find the VIF for x1x2 and x3.

 

//Call 'vif' procedure for each variable

//Y = column 1, X = column 2 and 3
 vif_x1 = vif(indepvars[.,1], indepvars[.,2:3]);
 print "VIF for x_1 = " vif_x1;

//Y = column 2, X = column 1 and 3
 vif_x2 = vif(indepvars[.,2], indepvars[.,1 3]);
 print "VIF for x_2 = " vif_x2;

//Y = column 3, X = column 1 and 2
 vif_x3 = vif(indepvars[.,3], indepvars[.,1:2]);
 print "VIF for x_3 = " vif_x3;

 

© 2024 Aptech Systems, Inc. All rights reserved.

 

The above code should print the following output:

 

VIF for x_1 =   1.1367
VIF for x_2 =   2.0126
VIF for x_3 =   2.1986

 

CONCLUSION

Congratulations! You have:

  • Computed correlation between variables.
  • Found the VIF for each variable.

 

Further reading on diagnosing issues related to multicollinearity is available in our blog post “Diagnosing a Singular Matrix”.

 

The next tutorial examines model specification.

 

For convenience, the full program text is below.

 

//Clear the work space
 new;

//Set seed to replicate results
 rndseed 23423;

//Create 100 observations of two variables
 //which are each distributed as N(0,1)
 num_obs = 100;
 x = rndn(num_obs,2);

//Introduce a potential source of multicollinearity
 x_3 = 0.4*x[.,1] + 0.8*x[.,2] + rndn(num_obs, 1);

//Independent variables
 //The tilde operator preforms horizontal concatenation
 indepvars = x ~ x_3;

//Generate error terms
 error_term = rndn(num_obs, 1);

//Generate y from x and errorTerm
 y = 1.3 + 5.7*indepvars[.,1] + 0.5*indepvars[.,2] + 1.9*indepvars[.,3] + error_term;

//Turn on residuals computation
 _olsres = 1;

//Estimate model and store results in variables
 { nam, m, b, stb, vc, std, sig, cx, rsq, resid, dbw } = ols("", y, x);

//Test correlation between independent variables
 print "corr(x):" corrx(indepvars);

//Call 'vif' procedure for each variable

//Y = column 1, X = column 2 and 3
 vif_x1 = vif(indepvars[.,1], indepvars[.,2:3]);
 print "VIF for x_1 = " vif_x1;

//Y = column 2, X = column 1 and 3
 vif_x2 = vif(indepvars[.,2], indepvars[.,1 3]);
 print "VIF for x_2 = " vif_x2;

//Y = column 3, X = column 1 and 2
 vif_x3 = vif(indepvars[.,3], indepvars[.,1:2]);
 print "VIF for x_3 = " vif_x3;

//proc (number of return values) = procedure_name(input1, input2)
 proc (1) = vif(y,x);
 //local variables only exist in this procedure
 local nam, m, b, stb, vc, std,
 sig, cx, rsq, resid, dbw, VIF_x;

//Turn off printing of 'ols' report
 \_\_output = 0;

//Run regression
 { nam, m, b, stb, vc, std,
 sig, cx, rsq, resid, dbw } = ols("", y, x);

//Calculate the VIF
 VIF_x = 1/(1 - rsq);

//Return the VIF
 retp(VIF_x);

//The 'endp' keyword ends the procedure
 endp;