Parallel

November 05, 2007

Sieve vs. OpenMP: Custom reduction operations

Reduction is the process of combining elements of a vector (or array) to yield a single aggregate element.  Examples of reduction include: summing or multiplying all elements of a list, and computing minima/maxima.  In fact, reduction can basically performed using any associative operation [but see Sam Lindley's comment below about monoids].

Both Sieve and OpenMP provide some support for reductions: Sieve via its accumulator classes, OpenMP via its reduction clause. However, the reduction clause of OpenMP is very limited - certain reductions operators are available over basic types (float and int), but there is no way to define your own custom reductions for specific applications.

In this post I will illustrate the respective approaches to reduction taken by Sieve and OpenMP, and show how Sieve's accumulator mechanism can be used to define a custom reduction over structures - leading to parallel reductions which are not possible using OpenMP.

Continue reading "Sieve vs. OpenMP: Custom reduction operations" »

June 12, 2007

The Risks of Parallelism

You've seen all the spectacular performance results on the net for parallel software on the latest multicore CPU or GPU, so you think you'll have a go yourself. You work for ages adapting your algorithm to run in parallel.  You fill your code with threading or OpenMP or Sh, or HLSL or whatever else you can get hold of. You spend ages getting rid of nasty unexpected timing problems. And then, at the end of it all, it goes 2x slower.

You might think this is unusual, but apparently it isn't. It turns out that the more parallel your hardware, the harder it is to predict what the performance will be. I have seen code that runs very fast until you give it a big problem to solve, in which case it runs slowly. Or, code that runs very slow until you give it a big problem to solve, in which case it runs fast.

For most of the parallel hardware available today, the memory latency is the most serious problem. DRAM latency is already a serious issue. Increasing the number of processor cores increases the effects of limited memory bandwidth and latency.

The second big problem is to do with memory latency and bandwidth too. Access by a core to its local cache is very fast, so if you give a core a problem to solve that fits in its local memory then it's very fast. If you give a core a problem to solve that doesn't fit in its local memory then it is at least an order of magnitude slower. This can be complicated because data that is streamed in and out doesn't need to be in a local cache and is less affected by latency. So, you need to understand which of your data is being streamed in and out, and which is being constantly accessed via a cache. A good example is ray-tracing: the resolution of the scene isn't affected by the cache size, but the complexity of the scene is.

The third big problem is Amdahl's Law. If you find a hot-spot that takes up 50% of your processing time, then the maximum possible speed-up you can get is 2x. So you then need to find another hot-spot that takes up a smaller percentage and parallelize that as well. As you find more and more hot-spots, you tend to find that the gain from parallelization is smaller for each new hot-spot, and that the amount of programming work required increases. You rapidly hit a law of diminishing returns.

But the really big problem facing software developers right now is the uncertainty over the future of processor architecture. Multi-threaded shared-memory processors have traditionally only ever worked up to 4 cores. Beyond 4 cores the memory bottleneck of a shared-memory model holds back the system performance. So, developers investing large amounts of time and money in parallelizing for any existing processor architecture has to consider that there investment may be very short-lived and they have to switch to a completely new programming model in a few years time.

May 08, 2007

Loop Carried Dependencies in Sieve

Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to parallelize the loop, they must resolve this dependency.

Many of these simple dependencies can be removed by migrating the dependency to the iteration count, instead of previous iteration values. All of the OpenMP examples provided are from Michael's site (and the OpenMP Mailing list) so it may be a good idea to read his article first.

Original, with carried dependency Version with dependency on iteration
  const double up = 1.1;
  double Sn=1000.0;
  double opt[N+1];
  int n;
  for (n=0; n<=N; ++n) {
    opt[n] = Sn;
    Sn *= up;
  }
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n;
  opt[0] = origSn;

  for (n=1; n<=N; ++n) {
    Sn = origSn * pow(up, n);
    opt[n] = Sn;
  }
  Sn *= up;

So a few simple changes transform a loop into one which can be parallelized easily. This method of eliminating dependencies is very common, and can become quite complex. In addition to this, the parallel version of the loop can be optimized so that the costly pow() call is performed once per work slice, thereafter using previous iterations we know are available.

An optimized version of the same loop is shown opposite with the OpenMP pragma intact. This version only calls pow() once to initialize the local Sn upon entry into the loop. This is accomplished using a condition on lastn, which has iteration n assigned to it. The first encounter with the conditional means lastn is initialized to the value outside the loop, allowing to choose the costly pow() path instead of a simple multiply.

In the end, there was actually a performance hit due to the fact the loop just does not perform enough computation to benefit from parallelism. There are many writes/reads too, which doesn't help considering how bandwidth limited X86 architectures are.

 

OpenMP optimized version
 
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n, lastn;
  lastn = -2;
  #pragma omp parallel for private(n) firstprivate(lastn) lastprivate(Sn)
  for (n=0; n<=N; ++n) {
    if (lastn != n - 1) {
      Sn = origSn * pow(up, n);
    } else {
      Sn *= up;
    }
    opt[n] = Sn;
    lastn = n;
  }
  Sn *= up;

An Implementation using Sieve C++

Now, after all that explanation, we get back to the point of this post. I read the article and instantly started hacking away on a Sieve version of the above example. One of the really big features of Sieve is that you can define your own Iterator and Accumulator classes. These classes will work on any platform, and are deterministic in nature due to the semantics employed by the system.

  const double up = 1.1;
  double origSn = 1000.0;
  double Sn = origSn;
 
  sieve {
    DPowIterator iSn(origSn, up);
    for (IntIterator n(0); n<=N; ++n) {
      optSieve[n] = iSn;
      iSn.multiplyUp();
      splithere;
    }
    Sn = iSn
  }

I decided to explicitly define a splitpoint within the for loop so it will parallelize the loop iterations. IntIterator is your standard parallel integer iterator class, provided for you in Sieve library headers. DPowIterator I decided to create myself. The code for this is listed below.

  iteratorclass DPowIterator {
    double m_val, m_power;
  public:
    inline immediate DPowIterator(double i, double p):
        m_val(i), m_power(p) {
    }
 
    inline update void setupValue(int i) {
      m_val *= pow(m_power, i);
    }
 
    inline update void multiplyUp() {
      m_val *= m_power;
    }
 
    inline immediate operator double() const {
      return val;
    }
  };

This code results in the same behaviour as the optimized OpenMP example.

Sieve Iterators are based upon tracking modifications to them; a call to the update multiplyUp() function will result in a modification delta of 1, calling the update setupValue() function will result in a delta of its integer parameter. Upon entering a Sieve fragment, Iterator update methods are called with relevant deltas. This regenerates the Iterator to a suitable state. This method of handling Iterators is incredibly powerful, resulting in the ability to create more than just simple integer Iterators - deterministic pseudo-random number generators are an example.

Those who have read the Sieve Whitepaper or attended any of our presentations may notice some syntax changes in the code presented above. Immediate, update and splithere keywords are sieve extentions that will be revealed and explained in further detail soon.

This post was intended to show how a programmer can employ custom Sieve Iterator classes to improve the parallel potential of their code. Hopefully this has done this - we will be exploring Accumulator classes in future posts.

May 07, 2007

Catching the Bus (Part I)

When thinking about parallelizing programs, one's main concern is typically inter-process communication.  Achieving good parallel performance is challenging, even impossible, if there are many inter-processor dependencies, since every processor may need to frequently wait for results from the others.  To gain performance from multi-core in these situations it is necessary to come up with totally new algorithms.

Consequently, the most obvious candidates for easy parallelization among existing algorithms are problems where each processor can be given a load of work to do, independently from all the others.  Examples include raytracing, computing matrix products, and image convolution. However, there is a problem which is often overlooked (especially by theoreticians who may not consider low level implementation details): on a shared memory machine there is only one data bus! This means that although a given data-intensive algorithm is parallelizable, limited memory bandwidth may degrade performance.

As an example, consider image convolution.  Say we have a 3000x2500 image which we want to sharpen using a 5x5 matrix filter.  The standard way to do this is to apply the matrix to every pixel in the image (perhaps excluding a small number of border pixels, an irrelevant complication which we ignore here).  There are 7.5 million pixels, and the filter can be applied to any two pixels independently.  On a dual core machine the obvious way to parallelize this is to give each core 3.75 million pixels and the sharpening matrix and let them get on with the job.  This appears to be embarrassingly parallel, and we would quite reasonably expect run-time to approach ½ of the serial run-time.  However, the computation involved in applying the filter to a pixel (75 floating point multiplications and 75 floating point additions for a colour pixel, perhaps fewer if we can identify zeros and ones in the matrix) cannot all be done in registers on a Pentium.  The problem is that both cores are simultaneously accessing main memory all the time, and the bus cannot cope.

In a series of blog posts I hope to illustrate some ways to cope with this problem which involve managing registers, L1 and L2 caches properly.