Two weeks ago, my focus was entirely devoted to presenting (and passing) my research proposal committee meeting. That accomplished, I spent the majority of this past week at high performance computing summer school put on by Scinet, in collaboration with Sharcnet. The four-day school offered mostly lecture based instruction, directed at researchers who need to use high performance computing facilities for their work, which typically means some form of parallel computation. There were two streams offered: programming GPUs with CUDA, and parallel programming with open MPI. Having been exposed to some MPI instruction in 4th year, I chose to take the CUDA programming lessons.
What is CUDA? It stands for Compute Unified Device Architecture, and exposes Nvidia GPUs as general computing devices that operate in parallel to the CPU on your machine. In CUDA terminology, the CPU is the host, with the GPU acting as a coprocessor. Computation on the device is performed by kernels, which are functions executed in parallel on the coprocessor. A little bit about the computing model for GPUs:
- The compute device is composed of a number of multiprocessors, each of which contains a number of SIMD (single instruction, multiple data) processors
- To take advantage of the multiple multiprocessors, kernels are executed as a grid of threaded blocks All threads in a thread block are executed by a single multiprocessor.
- The resources of a multiprocessor are divided among the threads in a block (registers, shared memory, etc.)
- Each multiprocessor can execute K threads in parallel physically, where K is called the warp size
- At runtime, a thread can determine the block that it belongs to, the block dimensions, and the thread index within the block
- These values are used to compute indices into input and output arrays
There’s much more, and you can check out the Scinet wiki or Sharcnet wiki for some more. but the theme of the course was that if you have a problem that is CPU bound where you have massive data parallelism (little to no dependence among data), then porting your code to CUDA will be worthwhile. This process is not free, of course. Memory management is key to achieving worthwhile speed-up, and tricky asynchronous data transfers are coordinated by the programmer. There is little to no built-in error checking. Libraries that are CUDA aware are still being developed, but there are some cool tools to check out: particularly Theano and PyCUDA if you are of a pythonic persuasion.
One good tip we received from our instructors was to begin measuring your code as soon as you develop it. Huge speed-up *can* be achieved (one instructor achieved 100x in his problem domain), but it takes only one memory access mistake (improper striding, improper packing of threads, overuse of registers, etc.) to slow the whole process down to negligible improvement. Time early, time often.
What are the types of problems that are well disposed to solution by GPU programming? Problems with a lot of operations performed on the same data, and problems with high data parallelism. Should I look into pursuing GPU development for ISOMAP and LLE, which require eigenvalue decompositions and nearest neighbour data construction? The relative scarcity of computing time on our local cluster, and the large amount of GPU programming capability available on both Scinet and Sharcnet suggests it’s worth exploring. However, given that we have *no* GPU capabilities on our local cluster, it would be risky to commit to a GPU dependent solution.