Catching the Bus (Part I)
When thinking about parallelizing programs, one's main concern is typically inter-process communication. Achieving good parallel performance is challenging, even impossible, if there are many inter-processor dependencies, since every processor may need to frequently wait for results from the others. To gain performance from multi-core in these situations it is necessary to come up with totally new algorithms.
Consequently, the most obvious
candidates for easy parallelization among existing algorithms
are problems where each processor can be given a load of work to do,
independently from all the others. Examples include
raytracing, computing matrix products, and image convolution.
However, there is a problem which is often overlooked (especially by
theoreticians who may not consider low level implementation details):
on a shared memory machine there is only one data bus!
This means that although a given data-intensive algorithm is
parallelizable, limited memory bandwidth may degrade performance.
As an example,
consider image convolution. Say we have a 3000x2500 image which we
want to sharpen using a 5x5 matrix filter. The standard way to do
this is to apply the matrix to every pixel in the image (perhaps
excluding a small number of border pixels, an irrelevant complication
which we ignore here). There are 7.5 million pixels, and the filter
can be applied to any two pixels independently. On a dual core
machine the obvious way to parallelize this is to give each core 3.75
million pixels and the sharpening matrix and let them get on with the
job. This appears to be embarrassingly parallel, and we
would quite reasonably expect run-time to approach ½ of the
serial run-time. However, the computation involved in applying the
filter to a pixel (75 floating point multiplications and 75 floating
point additions for a colour pixel, perhaps fewer if we can identify
zeros and ones in the matrix) cannot all be done in registers on a
Pentium. The problem is that both cores are simultaneously accessing
main memory all the time, and the bus cannot cope.
In a series of blog posts I hope to illustrate some ways to cope with this problem which involve managing registers, L1 and L2 caches properly.

