Here is a piece of assignment I delivered for COMP 426 at Concordia Univerisity, Montreal. We had to parallelize any given program or algorithm we knew into a multi-core architecture. My professor was nice enough to allow us to write it in any language / platform. Since I’ve heard a little about OpenCL I decided to get my hands dirty and implement the Matrix Convolution algorithm using OpenCL.. Here are my findings. A complete example with source code can be found at the end..
Convolution of matrix is a commonly used technique for image filtering. This filtering enable a lot of effects like pre-processing in edge detections, facial recognition, image sharpen or blur. Since the convolution is rather generic and widely used it’s interesting to optimize it for multi-core so all these applications run faster with less resources.
The convolution technique usually consists of the following steps:
- Select a pixel in the original to convolute
- Apply the mask to the pixel by reading the pixel’s neighbor original value
- Writing the new value into an answer buffer.
I took the original convolution problem from a COMP428 (parallel programming) assignment. This assignment required of students to create a parallel implementation of a given sequential convolution algorithm by using hpmpi (message passing on the ENCS cirrus cluster powered by HP Beowolf cluster).
A parallel solution
Open CL
I have chosen to develop using OpenCL because I was interested in the possibilities this new library offers. With the JustInTime compilation mechanism built-in OpenCL, one can code a single OpenCL kernel and hope it runs on a wide variety of supported hardware. My choice was also motivated by the fact that I was using an Apple Macbook Pro running Snow Leopard so I was basically in possession of all I needed to develop using OpenCL.
However, after looking for documentation on OpenCL I was left quite puzzled. The library had no books neither does the Internet. All there is was quite complex specification details from the Kronos group (the group that maintains the OpenCL specifications among other open-source royalty free spec) and a couple of example from academic papers and vendors. These example were usually tailored for extremely specific scientific problems so they are of little help when trying to grasp the basics of OpenCL.
I was lucky enough to stumble on an excellent series of OpenCL tutorials by David W. Gohara, PhD at the Washington University School of Medecine Center for Computational Biology available online at http://www.macresearch.org/opencl. These tutorials walked the viewer in complete OpenCL code ranging from simple to complex computations.
I basically took the simple matrix addition problem and modified it so it can integrate with the sequential convolution algorithm. I then realized my 2D array won’t fit into the kind of buffer OpenCL required so I has to re-arrange my data to fit an OpenCL buffer.
Decomposition Technique
The problem can be separated in as many tasks as the total number of pixels and the tasks exhibits no data dependencies. Such an embarrassingly parallel problem can easily be implemented in a GPU. Each of the many thousand of core will be responsible of computing the value of a single pixel up to the point were all pixels are calculated. OpenCL is perfectly suited for this task but a couple of modification to the data structures must be done.
OpenCL supports problem in either 1, 2 or 3 dimension. In the program kernel, these dimensions are useful to get which pixel is being computer. One can get the dimension the kernel is running under with the following code:
int col = get_global_id(0);
int row = get_global_id(1);
OpenCL does not support complex data structure. We can only utilize one-dimensional buffers of primitive data structures. I therefore had to convert my 2D arrays of integers used in convolution to a single continuous stream of integers.
Results
Speedup Analysis
(click on the image for a full res)
Image Size | Mask Size | |
Problem A | 2816×2112 | 10×10 |
Problem B | 3264×2448 | 12×12 |
Table 1 – Problem properties
OpenCL + GeForce 9400M | OpenCL + GeForce 9600M GT | |
Problem A | 12.13492063 | 4.778125 |
Problem B | 12.70909091 | 5.792417651 |
Table 2 – Speedup data
As expected the OpenCL version is much faster than the sequential version running on a single core of my CPU. However, strange results happen between the two different GPU of my system.
The standard mid-range MacBook Pro has two GPU. An integrated NVIDIA GeForce 9400M and an additional discrete NVIDIA GeForce 9600M GT with 256MB of dedicated RAM. The latter is supposed to run twice as fast as the first one since it has more core and more memory.
Mac OS X Snow Leopard gives the user to run one GPU or the other in the energy saver menu.
Surprisingly, the High Performance mode yields performance two times slower than the better battery life mode. I’m clueless about the source of this issue: is OpenCL not mature yet and still buggy? Is the 9400M benefiting from better RAM access time since it’s closer to the CPU/RAM?
So I started the discussion with fellow programmers at:
http://stackoverflow.com/questions/2620599/my-opencl-kernel-is-slower-on-faster-hardware-but-why
and still haven’t found a satisfying answer. Some users suggested that having OpenCL decide the local workgroup size might yield this performance gap but playing with the number yielded slower result for the higher-end hardware in a consistent fashion.
While OpenCL supports specifying work group size, all it ask about a problem is the number of work to be done.. The details about the number of core used is hidden to the programmer so the core vs performance analysis required in the project description was not possible.
But further improvements are possible
While I was able to get XYZ X increase in performance using OpenCL for convolution, the code could have been optimized further by using vector based operation and loop unfolding.
Unfortunately, end of semesters being what they are, I lacked the required time to make that happen so I only ported the regular integer version of convolution but it might be reasonable to expect another four to five fold increase in overall performance by applying these techniques.
We have implemented a same code like you. But we are having some problems…
1. The execution time on GPU is getting more than on CPU.
2. I also want to know what is the difference between asynchronous mode and synchronous mode of data transmission. You also used synchronous. When we tried with asynchronous we are missing some data.
3. You used a function gettimeofday() in getclock(). I didnt find the function definition.
Please help me in this regard.
Thanks in advance…