What is an exemplar problem and solution for learning shared memory, distributed memory, and/or GPU programming?

We are looking for exemplar problems and codes that will run on any or all of shared memory, distributed memory, and GPGPU architectures. The reference platform we are using is LittleFe (littlefe.net), an open-design, low cost educational cluster currently with six dual core CPUs, each with an nVidia chipset.

These problems and solutions will be good for teaching parallelism to any newbie by providing working examples and opportunities to roll up your sleeves and code. Stackoverflow experts have good insight and are likely to have some favorites.

Calculating area under a curve is interesting, simple and easy to understand, but there are bound to be ones that are just as easily expressed and chock full of opportunities to practice and learn.

Hybrid examples using more than one of the memory architectures are most desirable, and reflective of where parallel programming seems to be trending.

On LittleFe we have predominantly been using three applications. The first is an analysis of optimal targets on a dartboard which is highly parallel with little communication overhead. The second is Conway's game of life which is a typical of problems sharing boundary conditions. It has a moderate communication overhead. The third is an n-body model of galaxy formation which requires heavy communication overhead.

Solution

The CUDA programming guide contains a detailed analysis of the implementation of matrix multiplication on a GPU. That seems to be the staple "hello world" example for learning GPU programing.

Furthermore, the CUDA SDK contains tens of other well explained examples of GPU programming in CUDA and OpenCL. My favorite is the colliding balls example. (a demo with a few thousands of balls colliding in real time)

Update:

The CUDA samples are not packaged with the Toolkit anymore. Instead you can find them on GitHub.