INTRODUCTION
Loops account for a large percentage of computation time for many scientific computing problems. For this reason, GAUSS provides an easy method to make parallel loops with the threadfor
keyword.
This tutorial will start out by making a simple single-threaded code to compute parameter estimates for Rolling Regression on some simulated data. Then we will use the threadfor
keyword to parallelize the code.
SIMULATE SOME DATA
In order to keep this tutorial focused on parallelization and make it simple for you to scale the problem up and down easily, we will just simulate random normal data.
//Number of observations nobs = 1e6; //Number of variables nvars = 3; //Set random seed for repeatable random numbers rndseed 432343; //Random independent variables, N(0,1) X = rndn(nobs, nvars); //Simulate parameter values b_true = rndn(nvars, 1); //Simulate error term err = rndn(nobs, 1); //Simulate our dependent variable y = X * b_true + err;
SINGLE-THREADED ROLLING REGRESSION
Now that we have computed our y
variable, we can implement a single-threaded Rolling Regression estimation. Our algorithm will be:
- Choose a window size.
- Estimate the parameters of a linear model from
1:window_size
. - Slide our window forward 1 observation, i.e. from
2:window_size + 1
. - Estimate the parameters of our current window.
- Continue steps 3 and 4 until we reach the end of the
y
vector.
window_size = 50; //Number of iterations to completion iters = rows(y) - window_size; //Pre-allocate matrix to hold parameter estimates out = zeros(iters, cols(X)); //Start timer h_start = hsec(); for i(1, iters, 1); //Estimate parameters of current window, //transpose them to a row vector and place //in 'out' vector out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])'; endfor; h_final = hsec(); print (h_final - h_start)/100 "seconds for single-thread";
The time to run the above code will vary based upon your computer and the version of GAUSS you are running. On the reference device for this tutorial, a quad-core Macbook Pro, this code takes about 1.73 seconds.
threadfor
BASICS
The threadfor
keyword takes the same inputs as the standard for
loop:
threadfor i(start, stop, step); //loop body threadendfor;
- i
- The loop counter
- start
- Value of the loop counter for the first iteration
- stop
- The final value of the loop counter in the last iteration
- step
- The amount to change the loop counter on each iteration
The difference, of course, is that threadfor
will run multiple iterations at the same time. Since any two iterations could be running at the same time, we cannot write data to the same location from separate iterations.
© 2024 Aptech Systems, Inc. All rights reserved.
Fortunately, in this example, we write to a separate row in each iteration. Since that row is controlled by the loop counter, i, we do not have to worry about collisions. So we only need to change for to threadfor and our code will be run in parallel:
//Pre-allocate matrix to hold parameter estimates out = zeros(iters, cols(X)); //Start timer h_start = hsec(); //Parallel loop threadfor i(1, iters, 1); //Estimate parameters of current window, //transpose them to a row vector and place //in 'out' vector out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])'; threadendfor; h_final = hsec(); print (h_final - h_start)/100 "seconds with threadfor";
On the same quad-core Macbook Pro, the threaded code runs in about 0.44 seconds. This gives us a speed-up of almost exactly 4 times. Not all algorithms will scale as perfectly as this one, but it does indicate that threadfor
can be quite simple to use and provide a great performance boost.
TEMPORARY VARIABLES
Our first example was very easy and clean to convert to parallel, with excellent performance as well. However, we stated above that you cannot assign to the same variable (or more precisely the same index location) from separate threads. Since it is common for loops to require some scalar temporary variables, how can we handle this? Do we need to create an array for every single temporary variable? Fortunately, the answer is no.
In threadfor
loops, GAUSS will automatically create and manage loop specific versions of temporary variables that are created and used in a threadfor
loop. To illustrate this behavior, let’s make a slight modification to our parallel Rolling Regression example:
//Parallel loop threadfor i(1, iters, 1); //Assign to temporary variables 'a' and 'b' a = i; b = i + window_size; //Estimate parameters of current window, //transpose them to a row vector and place //in 'out' vector out[i, .] = (y[a:b] / x[a:b,.])'; threadendfor;
In this case, the use of the temporary variables a
and b
is not critical for the code. But there are many cases in which it would be a burden to not be able to use them. However, due to the fact that we do not know the order in which the iterations will run, these temporary variables have some special behavior and rules. The rules for threadfor
temporary variables are:
- A variable that is assigned to without using an index is a temporary variable. Variables that are assigned to by index, such as the `out` vector in our example are called ‘slice’ variables.
- The first reference to a temporary variable in a `threadfor` loop must be an assignment.
- The value of a variable after the loop will not be changed by actions inside the loop.
Rules number 2 and 3 above are intended to prevent you from creating code with bugs called ‘data races’. A ‘data race’ is when the output of your program is dependent on the order that parallel calculations are run. To make this more clear, let’s look at a very simple example:
my_var = 3; threadfor i(1, 4, 1); my_var = i * 2; threadendfor; print my_var;
What do you think my_var
will be equal to after the threadfor
statement? The answer is 3. Since my_var
was not assigned to by index inside the threadfor
loop, it became a temporary variable. While a loop is a natural way to think about this code, the actual behavior of the threadfor
loop is more like this: