For those using Python to develop and apply machine learning algorithms, there are now a plethora of tools at your disposal. There are plenty of toolkits which collect a bunch of classic or standard models (my favourite of which is scikit-learn), but for various reasons these do not scale to larger data sets. With the exception of some use-cases like boosting or cross-validation, for example, few applications of these methods or model-fitting procedures can be parallelized simply. So what to do if you want your code to run faster, but cannot just throw more processors at the problem, and don’t want to reinvent a slightly better wheel?
One possible out is to use a high level GPU compiler. I have recently been looking into projects which combine accelerated computational abilities with NumPy’s un-adorned simplicity of expression. They all make use of GPUs for acceleration, and are of varying degrees of complexity to install and configure. Though these all depend on installing the CUDA-SDK and CUBLAS library in some form, both are easily installable via your linux package management system. For example, in debian wheezy / sid / experimental,
sudo apt-get install nvidia-cuda-dev nvidia-cuda-toolkit libcublas4 will do the trick.
- CUDAmat: this library wraps up various matrix operations in CUDA kernels, which are in turn backed by CUBLAS. Perhaps the simplest to install.
- Theano: this library does quite a lot in addition to pushing more complex calculations to a GPU device (also reliant on CUDA/CUBLAS). It also does symbolic differentiation, and has a pretty thriving user community. Definitely recommended.
- Numba: while I have not yet tried this package, it seems very promising. Travis Oliphant of NumPy fame is at the head of the company that is developing this NumPy-aware python compiler. It uses LLVM to compile decorated Python functions down into really fast byte-code. See Travis speak about it here, and Jake Vanderplas compare it with Cython here. Just recently a CUDA compiler for Numba was announced, which is very promising.
My first experiments have been with Theano. Configuring it to work on the ARC of SciNet has been a bit of a pain, not having root access and all, but so far so good. Initial tests show a speed-up of a factor of about 40 when using the GPU versus the CPU (single threaded) on matrix product tasks. I should know in short order if the auto-encoder based dimensionality reduction model I proposed in December will stack up (see what I did there?) against PCA, ISOMAP and LLE.