The following is the content of a talk on OpenCL I gave last night at Aberdeen’s inaugural techmeetup. You should be able to watch it online shortly from the Techmeetup website.
http://techmeetup.co.uk/blog/
What's on your computer?
Back in 1965 Gordon Moore, one of the co-founders at Intel hypothesised that the speed of processors relative to their cost, would double approximately every two years. And up until recently this trend was holding.
Well, as you’ll know, things more or less continued along these lines until a couple of years ago when we started reaching processor speeds approaching 4GHz. But, I was being a little disingenuous when I referred to Moore’s Law as a doubling of speed because what Moore was actually talking about was the number of transistors which could be placed inexpensively on a chip. From this stand point Moore’s law is still holding, these days more computing power comes in the form of multiple processing units.
For example, my current computer has a 2.4GHz Core 2 Duo processor capable of performing around 20 GFLOPS (or 2x1010 floating point operations per second). This is about the same as Intel’s single core Pentium 4 running at 3.2GHz. If this is what 2 cores can do, imagine what 32 cores could do?
Well, it just so happens that I have just such a machine and it’s not that uncommon, many of you will as well.
Of course I’m referring to the graphics card. In my case I have an nVIDIA GeForce 9600M GT. This particular card has 32 cores, each running at 500MHz and providing a staggering 120 GFLOPS; six times that of the CPU alone. In fact, counting also the second GPU built into the logic board I have a machine which is theoretically capable of almost 200 GFLOPS.
Taking another example, the nVIDIA GeForce GTX 285 has 240 cores each running at around 650MHz. We’re now talking about a GPU capable of around 1TFLOP - around 50 times more that my processor. What’s more, a card like this will set you back about £300, so we’re not talking about large fortunes here.
So the question is: How can we utilise this additional power? Looking at the traditional frameworks for concurrent programming such as POSIX threads or OpenMP there is a fairly obvious drawback: They only utilise the multiprocessor capabilities of the CPU.
This may change over time as the boundaries between GPUs and CPUs narrow. According to nVIDIA’s CEO, Jen-Hsun Huang, the future of computing may well see the CPU replaced by beefed up GPUs. Whether that happens remains to be seen. However, right now we need another approach. (I have to confess that I read about this an article a couple of months back. However, looking at
some more recent reports it appears that there was either some misreporting or some backtracking).
So for the moment at least we need to move into the realm of GPGPUs (General Purpose GPU programming). Let’s examine some of the options we have here.
Firstly there is CUDA. CUDA was developed by nVIDIA and stands for Compute Unified Direct Architecture. CUDA programmers use a C based language to code for the GPU. I believe it is quite widely used and there are third party wrappers for languages such as Python or Java. However, being developed by nVIDIA means that only nVIDIA hardware will be supported.
A similar offering is FireStream, which is AMD’s equivalent to nVIDIA’s CUDA. I’m not sure whether FireStream has as much of a following as CUDA, but again the disadvantage is that it is hardware specific.
A third option is DirectCompute. This is a extension to Microsoft’s DirectX collection of API and possibly worth considering if you’re only targeting Windows. But as a Mac developer it’s not what I’m looking for.
OpenCL
Another option, and the focus of this talk, is OpenCL. So what is OpenCL?
OpenCL stands for the Open Computer Language and is touted as “The Open Standard for Heterogeneous Parallel Programming”. It is intended to be to general purpose programming what OpenGL is to graphics programming or what OpenAL is to audio programming.
So why OpenCL? Well it is hardware and platform agnostic.
Some background: OpenCL was originally devised by Apple. However, in June 2008 Apple handed responsibility for the specification over to the not-for-profit consortium the
Khronos Group. I think it would be fair to say that any major software or hardware company who have an interest in promoting open standards have some sort of input into the Khronos group. This list includes nVIDIA, AMD, Apple, Google, ARM, IBM, Intel and many more.

Image courtesy of Khronos Group
The fact that Intel is in this list is interesting. Remember I said that OpenCL was hardware agnostic, well that’s true. OpenCL is not just limited to the GPU but can be run on CPUs, FPGAs and any other processing unit for which an OpenCL implementation has been provided. In fact, if you’re lucky enough to be writing software for Snow Leopard then you are guaranteed to have an OpenCL compatible device; if the GPU isn’t compliant then at least the CPU will be.
This is great if your an Apple developer, but it is important to note that OpenCL is not just for Macs. This would be equivalent to saying that just because Apple provide great OpenGL support OpenGL is just for Macs. SDKs are becoming available for other platforms, although as far as I can tell at this stage they are mostly in beta. This is fairly understandable as the whole specification has been pushed through in record time.
So, you will need to be a little more careful about supported devices on other platforms. Although you should be able to check hardware support on your system through the APIs it is something you should be aware of.
So how does it work?
So how do you go about writing an application with OpenCL? Well you actually have to write two distinct sets of code: kernels written in a C based language and a controller which, through the OpenCL API, manages the running of these kernels on different compute devices.
An OpenCL application consists of a host which communicates with a set of Compute Devices. You can think of the host as your application and the compute devices as the individual CPUs, GPUs etc.
An OpenCL kernel may be compiled from code at runtime. By doing this the kernel may be optimised specifically for the hardware on which it is to be run. We use the API provided by the OpenCL SDK to perform this compilation and any developer who is familiar with OpenGL should feel quite comfortable with this API. I believe it is also possible to pre-compile kernels, but I have not done so and I believe this would be the exception rather than the rule.
Each compute device is further subdivided into Compute Units which are in turn divided into Processing Elements. So, for example, the nVIDIA GeForce GTX 285 with 30 streaming multiprocessors, each comprising 8 streaming processors would have 30 compute units and 240 processing elements.
The kernels you create will be executed on individual processing elements. A kernel executing on a single processing element is known in OpenCL as a work-item. You can think of this as a single thread of execution. Work-items are in turn grouped into work groups with a single work group running on a single compute unit. Hold that thought while I talk about memory quickly...
There are basically two types of memory which a kernel can access: global and local. Global memory is memory on the compute device, accessible by all work-items on that device. Local memory on the other hand is accessible only by work-items within the same work group. Going back to the graphics card you can see that the local memory corresponds to the local memory cache on each of the streaming multiprocessor units.

There are actually two other memory types: constant memory, which is really just constant global memory and private memory which is just memory defined within the scope of the kernel’s function.
Each executing kernel (work-item) can uniquely identify itself by a global unique id, unique across the device on which it is running. A work-item can also identify itself by a unique local id and an id unique to the work-group to which the work-item belongs. These IDs will be used to access memory which the work-item needs to read from or write to. Memory in OpenCL may be allocated either as a single dimensional buffer or as a 2 or 3D image. Correspondingly the work-items have either 1, 2 or 3 global (or local) ids. For example the method get_global_id(0) may be used to get the x co-ordinate of a 2D image while get_global_id(1) will return the y co-ordinate.

So are there any drawbacks to OpenCL? Well, yes there are a couple. Firstly, at the moment the processors on most graphic cards will only perform 32-bit floating point operations, although you can set #pragma directives to tell the device to do double precision calculations if it is capable. Of course there are ways around this if you need to carry out higher precision calculations, but these will generally involve more operations to be carried out.
Another point to bear in mind is that transferring memory onto a graphics card is relatively slow. Any performance gain achieve by running the process on multiple cores may be negated by performing this memory transfer.
You will also need to take into account the fact that your code will only have access to those methods defined by the OpenCL specifications. printf statements, for example, will not work (and, somewhat related to that, debugging your kernel objects is going to be tough).
Finally, (and here I really have to confess to being fuzzy on the detail) because we are running on streaming processors, branching in your code is not going to behave the same as in normal code executing on the CPU. For instance, if you have an if-else statement in a kernel running in 8 work-items all occurrences of the if condition need to complete before the else conditions execute.
Demonstration
For the demonstration I’ve written a little application which generates Mandelbrot set images.
I’ve always been fascinated by the Mandelbrot set; I’m amazed by the beautiful an complex images produced by such a simple equation.
To briefly explain: A complex number
C can be said to be in the Mandelbrot set if the absolute value of
Zn in equation
Zn+1=
Zn² +
C is less than a given value after n iterations. If a given point is not within the set we can assign it a colour based on the number of iterations required before
Zn exceeds the given tolerance.
If we define the maximum number of iterations required to determine whether a value is within the set as 1500 (which is the value I have chosen to use in the demonstration) we could potentially have to perform this calculation 1500 times for each pixel. Each of these calculations could themselves comprise10 floating point operations. Applying this over a 1000 by 1000 pixel image it should be apparent that we could be carrying out around 15 billion operations.
It should also be clear that the calculations carried out on a single pixel should in no way influence the calculations carried out on adjacent pixels. So as well as being an interesting example, the Mandelbrot set should also provide a good candidate for optimisation using OpenCL.
I wrote the example to perform the calculation in 3 ways. The first example simply performs the calculation on the main thread of the application using standard C. The second two perform the same calculation using OpenCL, firstly running solely on the CPU and then running on the GPU.
The results (I’m happy to report) are fairly conclusive. Running on a single thread the calculation takes a little over 10 seconds on my machine. Using OpenCL on the CPU the time is reduced to just over 4 seconds. This makes sense, we are doubling the number of cores and approximately halving the time taken required for the calculation.
Running on the GPU reduces the time further to around 0.67 seconds. Again this is about a sixth of the time required by the CPU and back at the start I said that the GPU was capable of processing around six times more operations per second than the CPU.
The code for the example can be
downloaded from here. Although the GUI is written in Objective-C, I have written the OpenCL part in C++, so if you’re not on a Mac you should still be able to give it a go.
I would dearly love to see this example running on the GTX 285 I talked about - if anyone does this please let me know how you get on!
References