INTRODUCTION 

Loops account for a large percentage of computation time for many scientific computing problems. For this reason, GAUSS provides an easy method to make parallel loops with the threadfor keyword.

 

This tutorial will start out by making a simple single-threaded code to compute parameter estimates for Rolling Regression on some simulated data. Then we will use the threadfor keyword to parallelize the code.

 

SIMULATE SOME DATA

In order to keep this tutorial focused on parallelization and make it simple for you to scale the problem up and down easily, we will just simulate random normal data.

 

 //Number of observations
 nobs = 1e6;

//Number of variables
 nvars = 3;

//Set random seed for repeatable random numbers
 rndseed 432343;

//Random independent variables, N(0,1)
 X = rndn(nobs, nvars);

//Simulate parameter values
 b_true = rndn(nvars, 1);

//Simulate error term
 err = rndn(nobs, 1);

//Simulate our dependent variable
 y = X * b_true + err;

 

SINGLE-THREADED ROLLING REGRESSION
Now that we have computed our y variable, we can implement a single-threaded Rolling Regression estimation. Our algorithm will be:

  • Choose a window size.
  • Estimate the parameters of a linear model from 1:window_size.
  • Slide our window forward 1 observation, i.e. from 2:window_size + 1.
  • Estimate the parameters of our current window.
  • Continue steps 3 and 4 until we reach the end of the y vector.

 

 window_size = 50;

//Number of iterations to completion
 iters = rows(y) - window_size;

//Pre-allocate matrix to hold parameter estimates
 out = zeros(iters, cols(X));

//Start timer
 h_start = hsec();

for i(1, iters, 1);
 //Estimate parameters of current window,
 //transpose them to a row vector and place
 //in 'out' vector
 out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])';
 endfor;

h_final = hsec();
 print (h_final - h_start)/100 "seconds for single-thread";

 

The time to run the above code will vary based upon your computer and the version of GAUSS you are running. On the reference device for this tutorial, a quad-core Macbook Pro, this code takes about 1.73 seconds.

 

threadfor BASICS

The threadfor keyword takes the same inputs as the standard for loop:

 

threadfor i(start, stop, step);
 //loop body
 threadendfor;
  • i
    • The loop counter
  • start
    • Value of the loop counter for the first iteration
  • stop
    • The final value of the loop counter in the last iteration
  • step
    • The amount to change the loop counter on each iteration

 

The difference, of course, is that threadfor will run multiple iterations at the same time. Since any two iterations could be running at the same time, we cannot write data to the same location from separate iterations.

 

© 2024 Aptech Systems, Inc. All rights reserved.

 

Fortunately, in this example, we write to a separate row in each iteration. Since that row is controlled by the loop counter, i, we do not have to worry about collisions. So we only need to change for to threadfor and our code will be run in parallel:

 

 //Pre-allocate matrix to hold parameter estimates
 out = zeros(iters, cols(X));

//Start timer
 h_start = hsec();

//Parallel loop
 threadfor i(1, iters, 1);
 //Estimate parameters of current window,
 //transpose them to a row vector and place
 //in 'out' vector
 out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])';
 threadendfor;

h_final = hsec();
 print (h_final - h_start)/100 "seconds with threadfor";

 

On the same quad-core Macbook Pro, the threaded code runs in about 0.44 seconds. This gives us a speed-up of almost exactly 4 times. Not all algorithms will scale as perfectly as this one, but it does indicate that threadfor can be quite simple to use and provide a great performance boost.

TEMPORARY VARIABLES

Our first example was very easy and clean to convert to parallel, with excellent performance as well. However, we stated above that you cannot assign to the same variable (or more precisely the same index location) from separate threads. Since it is common for loops to require some scalar temporary variables, how can we handle this? Do we need to create an array for every single temporary variable? Fortunately, the answer is no.

 

In threadfor loops, GAUSS will automatically create and manage loop specific versions of temporary variables that are created and used in a threadfor loop. To illustrate this behavior, let’s make a slight modification to our parallel Rolling Regression example:

 

 //Parallel loop
 threadfor i(1, iters, 1);
 //Assign to temporary variables 'a' and 'b'
 a = i;
 b = i + window_size;
 //Estimate parameters of current window,
 //transpose them to a row vector and place
 //in 'out' vector
 out[i, .] = (y[a:b] / x[a:b,.])';
 threadendfor;

 

In this case, the use of the temporary variables a and b is not critical for the code. But there are many cases in which it would be a burden to not be able to use them. However, due to the fact that we do not know the order in which the iterations will run, these temporary variables have some special behavior and rules. The rules for threadfor temporary variables are:

  • A variable that is assigned to without using an index is a temporary variable. Variables that are assigned to by index, such as the `out` vector in our example are called ‘slice’ variables.
  • The first reference to a temporary variable in a `threadfor` loop must be an assignment.
  • The value of a variable after the loop will not be changed by actions inside the loop.

Rules number 2 and 3 above are intended to prevent you from creating code with bugs called ‘data races’. A ‘data race’ is when the output of your program is dependent on the order that parallel calculations are run. To make this more clear, let’s look at a very simple example:

 

 my_var = 3;

threadfor i(1, 4, 1);
 my_var = i * 2;
 threadendfor;

print my_var;

 

What do you think my_var will be equal to after the threadfor statement? The answer is 3. Since my_var was not assigned to by index inside the threadfor loop, it became a temporary variable. While a loop is a natural way to think about this code, the actual behavior of the threadfor loop is more like this: