It’s already a while back that I completed the coursera class “Heterogeneous Parallel Programming“. It was mainly concerned with cuda, which is Nvidia’s GPGPU framework. GPGPU is about running common computations on the graphics card. The class also quickly covered OpenCL, OpenACC, C++AMP and MPI.
In the programming assignments, we juggled a lot with low level details such as distributing the work load to thread blocks, which I almost didn’t care about when using OpenCL so far. After seeing cuda and OpenCL, it was a little surprise, that C++AMP is indeed a more convenient programming model, and not just a C++ compiler for the graphics card. Let’s hope that it gets ported to other platforms soon.
The most eye opening revelation for me was, that it is possible to parallelize prefix sum computation. When I was first presented with the problem, I thought that’s a showcase for serial execution. But apparently it’s not. Making it parallel is a two step process. First make a number of blocks, and compute the sum at the boundary for each one using something like a tree structure (in parallel). Once you have that, it’s more obvious, how to parallelize the rest.