# The Bit-Bucket Challenge

The ALS Ice Bucket Challenge is sweeping the internet, and people are busy coming up with crafty and audacious ways to dump icy water on their heads. I’d like to propose a lesser challenge, one that would help people who work with data everywhere. I call it the bit-bucket challenge. In the words of Godard:

it’s not important where you steal from, it’s where you bring it to

So, here’s my proposal:

No need to make a video, no need to agonize over the fine line between slackto-narcisissism, and do-goodery. None of that. This is *much* easier. Here’s how to do it:

2. Take the readme.txt you stuck in the main working directory of your project, and annotate that in markdown. It should tell people how to build your code, and make note of any obvious stumbling points in the code. Take this time to write one, if you haven’t already done so.
3. Upload all of that onto the internet (see step 1).
4. Make a blog post or a tweet briefly describing what you’ve just posted.
5. Tag three colleagues in your post, challenging them to release a project that they’ve published as a paper.

That’s it.

Is your code ugly? Full of hacks? Hard coded values? Bubble sort? Do not let this deter you; if your code is the least bit functional, someone who has read your paper and would like to try your algorithm will take that mess and improve it. They will patch it, refactor it or make other radical changes to suit their purpose. Chances are good that they’ll share these improvements with you, gratis! Heck, they might even write new tests for your code (you have some tests, right?). But they can’t do that unless you give them the chance.

Why is this a worthwhile challenge? Well, re-implementing a scientific paper is hard. Also, software is now firmly entrenched in the scientific process. If you’re doing computational method development and just writing about your results, I don’t think you’re doing it properly.

“if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice)

(h/t to Gael Varoquaux)

Worried that no one will care about your code? You’re in good company. There are plenty of projects out there not used by anyone other than the author, or algorithms of marginal value. Believe me, I’ve written plenty. But on the off-chance that someone *does* find your code useful, you’ve just saved this person many, many hours of work in trying to re-implement your algorithm from scratch.

So please, dear friends, let’s all share some code. Just do it. Do it for the betterment of us all.

# This is why we need code review.

One of the most fundamental operations in parallel computing is an operation called scan. Scan comes in two varieties, inclusive scan and exclusive scan (depending on whether $y_i$ includes $x_i$ or not.

From wikipedia: In computer science, the prefix sum, scan, or cumulative sum of a sequence of numbers $x_0, x_1, x_2,\; \dots$ is a second sequence of numbers $y_0, y_1, y_2,\; \dots$ , the sums of prefixes (running totals) of the input sequence $y_i = y_{i-1} + x_i$ (note this definition is inclusive).

Simple enough right? I’ve seen two simple variants for computing scan, one by Hillis and Steele, the other by Blelloch. Though it seems like scan is an inherently serial computation due to the dependency on the previous input, it actually decouples very nicely and can be computed in parallel very easily. Hillis and Steele’s exclusive scan algorithm goes something like this:

``` len = input.size; output[threadId] = input[threadId]; for (int i = 1; i = i) { output[threadId] += input[threadId - i] } } ```[ed: wordpress is not rendering this properly, but you can find the pseudo-code description in the link below]

Hillis and Steele’s scan is what is called step efficient: it executes in $O(\log(n))$ steps for inputs of size $n$. But at each step, it performs $O(n)$ operations, and so the overall work complexity is $O(n * \log(n))$. Blelloch’s is more complex, but is more work efficient: it requires only $O(n)$ operations. Here’s my code for Hillis and Steele’s inclusive scan:

```__global__ void hs_prefix_add(int * d_in, unsigned int *d_out) { extern __shared__ unsigned int s_in[]; extern __shared__ unsigned int s_out[]; int tid = threadIdx.x; ```

``` // load input into shared memory. s_in[tid] = static_cast(d_in[tid]); s_out[tid] = s_in[tid]; __syncthreads(); for (int offset = 1; offset < blockDim.x; offset <= offset) { s_out[tid] += s_in[tid - offset]; } __syncthreads(); } ```

``` d_out[tid] = s_out[tid]; }```

I’m using shared memory for the buffers since it’s directly on board the thread blocks, and is thus much faster to access than going back to global memory for every step of the loop. Simple, right? When I first tried to implement scan, I thought I’d see if Nvidia would offer any pointers explaining the finer points. After all, my implementation doesn’t handle cases with arrays that are not a power of 2, and won’t scale to arrays larger than the dimension of a block (which happened to be fine for the problem I was solving, but not o.k in general). So I found this document, talking about prefix sum. Here’s their algorithm for Hillis and Steele’s exclusive scan:

```__global__ void scan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; // allocated on invocation int thid = threadIdx.x; int pout = 0, pin = 1; // load input into shared memory. // This is exclusive scan, so shift right by one and set first elt to 0 temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; __syncthreads(); for (int offset = 1; offset = offset) temp[pout*n+thid] += temp[pin*n+thid - offset]; else temp[pout*n+thid] = temp[pin*n+thid]; __syncthreads(); } g_odata[thid] = temp[pout*n+thid1]; // write output }```

When I read this code, I have to ask at least the following:

• I don’t understand why this code uses ‘double buffer indices’ for one large array, rather than two shared memory arrays. It makes for less readable code, and introduces a potential bug if the user places the swap below the if block rather than above.
• The last line which writes the output back to global memory contains a typo (thid1 should be thid)

This could go into a textbook as an example of why we need code review. That it was published with a compilation error tells me that this code was un-tested. Even after fixing it, try running it at home and see what you get (hint: not what you would expect).

Perhaps the mistakes and obfuscation were intentional, to foil google-copy-and-paste homework submissions for future courses. But I think that greater damage is done by having present and future searches point to obfuscated and incorrect code (the same code, error and all, appears in these slides). Parallel computation is hard enough without having to decode stuff like this. We should be teaching people to write legible, easily verifiable code that can later be optimized or generalized.

# A quick and useful note for organizing comp bio projects

I just happened upon William Noble’s guide to organizing computational biology projects. There are many good practices discussed therein, especially if you don’t have much experience with structuring a project yourself. He touches on a few important points like keeping a readme.txt file in every directory to help chart your progress throughout the project, the virtues of version control, separation of code from data and results.

For me, I would like to push as many of the bookkeeping aspects of a project (status of the project, directory readme.txt files, lab notebooks) onto version control. Why? Because it’s more easily accessible, stored in one place, and can be used to generate any of the reports or views that a readme.txt or lab notebook represents for the given project. Running an experiment? That’s a private github gist with a link to your lab notebook entry (which can be a private blog hosted by github). This way, you use one tool to synchronize different related parts of your project, so your readme.txt files aren’t out of date (or out of sync with your code/results/documentation). It’s easy to share with collaborators, and best of all you don’t have to maintain the infrastructure.

FYI: if you use git, and do not want .pyc / .o / .a / … files cluttering your repo, set up an ignore file.

# Virtualized software environments

By now, I’ve ironed out most of the problems that arose during my time installing CellProfiler on our cluster. Though the actual science can now begin, and this is where my focus should lie (just after finishing my NSERC application, of course), I can’t help but thinking about simpler ways to install complex software. This application runs in python, with many dependencies written in C (e.g hdf5, ATLAS). There is currently no build system provided other than a makefile, which is a bit brittle. I think this underscores the focus of the CP developers on desktop installations, rather than cluster or cloud computing environments.

CellProfiler requires a version of Python that was not installed on our cluster nodes. This suggests a solution which would let it run in a virtual environment. Here are two such alternatives that would likely have made this a less time-consuming process:

• Virtualenv: A python package that provides virtual python environments. This is actually recommended on the CP2 wiki. A great introduction to virtualenv is found here
• CDE: A form of linux-centric lightweight application virtualization. It runs your application, noting everything that is imported and creates a self contained package with all the dependencies included. Yaroslav Bulatov has a short intro on his blog. A larger series of use cases reside at the CDE website.

Note that since I’ve gone through the hard part of creating a virtualized environment from scratch, I haven’t actually used either of these 🙂 The next time I’m asked to install something unwieldy that has loads of dependencies though, will be a different story.