C++

November 05, 2007

Sieve vs. OpenMP: Custom reduction operations

Reduction is the process of combining elements of a vector (or array) to yield a single aggregate element.  Examples of reduction include: summing or multiplying all elements of a list, and computing minima/maxima.  In fact, reduction can basically performed using any associative operation [but see Sam Lindley's comment below about monoids].

Both Sieve and OpenMP provide some support for reductions: Sieve via its accumulator classes, OpenMP via its reduction clause. However, the reduction clause of OpenMP is very limited - certain reductions operators are available over basic types (float and int), but there is no way to define your own custom reductions for specific applications.

In this post I will illustrate the respective approaches to reduction taken by Sieve and OpenMP, and show how Sieve's accumulator mechanism can be used to define a custom reduction over structures - leading to parallel reductions which are not possible using OpenMP.

Continue reading "Sieve vs. OpenMP: Custom reduction operations" ยป

July 23, 2007

Malloc and Scalability

The C standard library malloc memory allocator is now a potential bottleneck for multi-threaded applications running on multiprocessor systems. Dynamic memory management on shared memory multiprocessor systems is now a significant bottleneck, especially with regard to the stock implementations of malloc provided with development tools. The naive approach to a malloc implementation is to wrap a single global heap area with a lock, and have all concurrent attempts to allocate and deallocate memory contend for that lock. Obviously, scalability cannot be high under such a system.

Issues with the efficiency of dynamic memory management are not new. Doug Lea's venerable dlmalloc was written in response to such concerns over a decade ago; However, programmers rely increasingly heavily on such systems, and now they pose a new challenge as multiprocessors become ubiquitous and affordable. Applications that use modern C++ features, such as the STL, may be performing a significant amount of dynamic memory management behind the scenes. Not only do programmers need to be aware of the concurrency issues within their own code, they must also be aware of those issues within each 3rd party library used within a project. The memory allocation behaviour of such libraries is often far from obvious.

Fortunately, C++ allows for the overriding of the global new and delete operators (plus the equivalents for arrays) so that memory allocations can be performed by a scalable and concurrent allocator used as a drop in replacement for the system allocator. This process is explained in Bruce Eckel's Thinking in C++.

If memory management is becoming a bottleneck for your concurrent application, it is relatively easy to obtain a significant performance increase. In one of our in house applications, simply replacing the default malloc with a 3rd party malloc has increased performance by up to 80% on an eight core x86 system.

Links to further resources:

NedMalloc - A Scalable Malloc
PT Malloc
Googles Thread Caching Malloc
Multiprocessor Malloc Comparison

May 08, 2007

Loop Carried Dependencies in Sieve

Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to parallelize the loop, they must resolve this dependency.

Many of these simple dependencies can be removed by migrating the dependency to the iteration count, instead of previous iteration values. All of the OpenMP examples provided are from Michael's site (and the OpenMP Mailing list) so it may be a good idea to read his article first.

Original, with carried dependency Version with dependency on iteration
  const double up = 1.1;
  double Sn=1000.0;
  double opt[N+1];
  int n;
  for (n=0; n<=N; ++n) {
    opt[n] = Sn;
    Sn *= up;
  }
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n;
  opt[0] = origSn;

  for (n=1; n<=N; ++n) {
    Sn = origSn * pow(up, n);
    opt[n] = Sn;
  }
  Sn *= up;

So a few simple changes transform a loop into one which can be parallelized easily. This method of eliminating dependencies is very common, and can become quite complex. In addition to this, the parallel version of the loop can be optimized so that the costly pow() call is performed once per work slice, thereafter using previous iterations we know are available.

An optimized version of the same loop is shown opposite with the OpenMP pragma intact. This version only calls pow() once to initialize the local Sn upon entry into the loop. This is accomplished using a condition on lastn, which has iteration n assigned to it. The first encounter with the conditional means lastn is initialized to the value outside the loop, allowing to choose the costly pow() path instead of a simple multiply.

In the end, there was actually a performance hit due to the fact the loop just does not perform enough computation to benefit from parallelism. There are many writes/reads too, which doesn't help considering how bandwidth limited X86 architectures are.

 

OpenMP optimized version
 
  const double up = 1.1;
  double Sn, origSn=1000.0;
  double opt[N+1];
  int n, lastn;
  lastn = -2;
  #pragma omp parallel for private(n) firstprivate(lastn) lastprivate(Sn)
  for (n=0; n<=N; ++n) {
    if (lastn != n - 1) {
      Sn = origSn * pow(up, n);
    } else {
      Sn *= up;
    }
    opt[n] = Sn;
    lastn = n;
  }
  Sn *= up;

An Implementation using Sieve C++

Now, after all that explanation, we get back to the point of this post. I read the article and instantly started hacking away on a Sieve version of the above example. One of the really big features of Sieve is that you can define your own Iterator and Accumulator classes. These classes will work on any platform, and are deterministic in nature due to the semantics employed by the system.

  const double up = 1.1;
  double origSn = 1000.0;
  double Sn = origSn;
 
  sieve {
    DPowIterator iSn(origSn, up);
    for (IntIterator n(0); n<=N; ++n) {
      optSieve[n] = iSn;
      iSn.multiplyUp();
      splithere;
    }
    Sn = iSn
  }

I decided to explicitly define a splitpoint within the for loop so it will parallelize the loop iterations. IntIterator is your standard parallel integer iterator class, provided for you in Sieve library headers. DPowIterator I decided to create myself. The code for this is listed below.

  iteratorclass DPowIterator {
    double m_val, m_power;
  public:
    inline immediate DPowIterator(double i, double p):
        m_val(i), m_power(p) {
    }
 
    inline update void setupValue(int i) {
      m_val *= pow(m_power, i);
    }
 
    inline update void multiplyUp() {
      m_val *= m_power;
    }
 
    inline immediate operator double() const {
      return val;
    }
  };

This code results in the same behaviour as the optimized OpenMP example.

Sieve Iterators are based upon tracking modifications to them; a call to the update multiplyUp() function will result in a modification delta of 1, calling the update setupValue() function will result in a delta of its integer parameter. Upon entering a Sieve fragment, Iterator update methods are called with relevant deltas. This regenerates the Iterator to a suitable state. This method of handling Iterators is incredibly powerful, resulting in the ability to create more than just simple integer Iterators - deterministic pseudo-random number generators are an example.

Those who have read the Sieve Whitepaper or attended any of our presentations may notice some syntax changes in the code presented above. Immediate, update and splithere keywords are sieve extentions that will be revealed and explained in further detail soon.

This post was intended to show how a programmer can employ custom Sieve Iterator classes to improve the parallel potential of their code. Hopefully this has done this - we will be exploring Accumulator classes in future posts.