Compiler

May 08, 2007

Loop Carried Dependencies in Sieve

Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to parallelize the loop, they must resolve this dependency.

Many of these simple dependencies can be removed by migrating the dependency to the iteration count, instead of previous iteration values. All of the OpenMP examples provided are from Michael's site (and the OpenMP Mailing list) so it may be a good idea to read his article first.

Original, with carried dependency Version with dependency on iteration
  const double up = 1.1;
  double Sn=1000.0;
  double opt[N+1];
  int n;
  for (n=0; n<=N; ++n) {
    opt[n] = Sn;
    Sn *= up;
  }
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n;
  opt[0] = origSn;

  for (n=1; n<=N; ++n) {
    Sn = origSn * pow(up, n);
    opt[n] = Sn;
  }
  Sn *= up;

So a few simple changes transform a loop into one which can be parallelized easily. This method of eliminating dependencies is very common, and can become quite complex. In addition to this, the parallel version of the loop can be optimized so that the costly pow() call is performed once per work slice, thereafter using previous iterations we know are available.

An optimized version of the same loop is shown opposite with the OpenMP pragma intact. This version only calls pow() once to initialize the local Sn upon entry into the loop. This is accomplished using a condition on lastn, which has iteration n assigned to it. The first encounter with the conditional means lastn is initialized to the value outside the loop, allowing to choose the costly pow() path instead of a simple multiply.

In the end, there was actually a performance hit due to the fact the loop just does not perform enough computation to benefit from parallelism. There are many writes/reads too, which doesn't help considering how bandwidth limited X86 architectures are.

 

OpenMP optimized version
 
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n, lastn;
  lastn = -2;
  #pragma omp parallel for private(n) firstprivate(lastn) lastprivate(Sn)
  for (n=0; n<=N; ++n) {
    if (lastn != n - 1) {
      Sn = origSn * pow(up, n);
    } else {
      Sn *= up;
    }
    opt[n] = Sn;
    lastn = n;
  }
  Sn *= up;

An Implementation using Sieve C++

Now, after all that explanation, we get back to the point of this post. I read the article and instantly started hacking away on a Sieve version of the above example. One of the really big features of Sieve is that you can define your own Iterator and Accumulator classes. These classes will work on any platform, and are deterministic in nature due to the semantics employed by the system.

  const double up = 1.1;
  double origSn = 1000.0;
  double Sn = origSn;
 
  sieve {
    DPowIterator iSn(origSn, up);
    for (IntIterator n(0); n<=N; ++n) {
      optSieve[n] = iSn;
      iSn.multiplyUp();
      splithere;
    }
    Sn = iSn
  }

I decided to explicitly define a splitpoint within the for loop so it will parallelize the loop iterations. IntIterator is your standard parallel integer iterator class, provided for you in Sieve library headers. DPowIterator I decided to create myself. The code for this is listed below.

  iteratorclass DPowIterator {
    double m_val, m_power;
  public:
    inline immediate DPowIterator(double i, double p):
        m_val(i), m_power(p) {
    }
 
    inline update void setupValue(int i) {
      m_val *= pow(m_power, i);
    }
 
    inline update void multiplyUp() {
      m_val *= m_power;
    }
 
    inline immediate operator double() const {
      return val;
    }
  };

This code results in the same behaviour as the optimized OpenMP example.

Sieve Iterators are based upon tracking modifications to them; a call to the update multiplyUp() function will result in a modification delta of 1, calling the update setupValue() function will result in a delta of its integer parameter. Upon entering a Sieve fragment, Iterator update methods are called with relevant deltas. This regenerates the Iterator to a suitable state. This method of handling Iterators is incredibly powerful, resulting in the ability to create more than just simple integer Iterators - deterministic pseudo-random number generators are an example.

Those who have read the Sieve Whitepaper or attended any of our presentations may notice some syntax changes in the code presented above. Immediate, update and splithere keywords are sieve extentions that will be revealed and explained in further detail soon.

This post was intended to show how a programmer can employ custom Sieve Iterator classes to improve the parallel potential of their code. Hopefully this has done this - we will be exploring Accumulator classes in future posts.

May 07, 2007

Catching the Bus (Part I)

When thinking about parallelizing programs, one's main concern is typically inter-process communication.  Achieving good parallel performance is challenging, even impossible, if there are many inter-processor dependencies, since every processor may need to frequently wait for results from the others.  To gain performance from multi-core in these situations it is necessary to come up with totally new algorithms.

Consequently, the most obvious candidates for easy parallelization among existing algorithms are problems where each processor can be given a load of work to do, independently from all the others.  Examples include raytracing, computing matrix products, and image convolution. However, there is a problem which is often overlooked (especially by theoreticians who may not consider low level implementation details): on a shared memory machine there is only one data bus! This means that although a given data-intensive algorithm is parallelizable, limited memory bandwidth may degrade performance.

As an example, consider image convolution.  Say we have a 3000x2500 image which we want to sharpen using a 5x5 matrix filter.  The standard way to do this is to apply the matrix to every pixel in the image (perhaps excluding a small number of border pixels, an irrelevant complication which we ignore here).  There are 7.5 million pixels, and the filter can be applied to any two pixels independently.  On a dual core machine the obvious way to parallelize this is to give each core 3.75 million pixels and the sharpening matrix and let them get on with the job.  This appears to be embarrassingly parallel, and we would quite reasonably expect run-time to approach ½ of the serial run-time.  However, the computation involved in applying the filter to a pixel (75 floating point multiplications and 75 floating point additions for a colour pixel, perhaps fewer if we can identify zeros and ones in the matrix) cannot all be done in registers on a Pentium.  The problem is that both cores are simultaneously accessing main memory all the time, and the bus cannot cope.

In a series of blog posts I hope to illustrate some ways to cope with this problem which involve managing registers, L1 and L2 caches properly.