This is why we need code review.

One of the most fundamental operations in parallel computing is an operation called scan. Scan comes in two varieties, inclusive scan and exclusive scan (depending on whether y_i includes x_i or not.

From wikipedia: In computer science, the prefix sum, scan, or cumulative sum of a sequence of numbers x_0, x_1, x_2,\; \dots is a second sequence of numbers y_0, y_1, y_2,\; \dots , the sums of prefixes (running totals) of the input sequence y_i = y_{i-1} + x_i (note this definition is inclusive).

Simple enough right? I’ve seen two simple variants for computing scan, one by Hillis and Steele, the other by Blelloch. Though it seems like scan is an inherently serial computation due to the dependency on the previous input, it actually decouples very nicely and can be computed in parallel very easily. Hillis and Steele’s exclusive scan algorithm goes something like this:


len = input.size;
output[threadId] = input[threadId];
for (int i = 1; i = i)
{
output[threadId] += input[threadId - i]
}
}
[ed: wordpress is not rendering this properly, but you can find the pseudo-code description in the link below]

Hillis and Steele’s scan is what is called step efficient: it executes in O(\log(n)) steps for inputs of size n. But at each step, it performs O(n) operations, and so the overall work complexity is O(n * \log(n)). Blelloch’s is more complex, but is more work efficient: it requires only O(n) operations. Here’s my code for Hillis and Steele’s inclusive scan:

__global__ void hs_prefix_add(int * d_in, unsigned int *d_out)
{
extern __shared__ unsigned int s_in[];
extern __shared__ unsigned int s_out[];
int tid = threadIdx.x;

// load input into shared memory.
s_in[tid] = static_cast(d_in[tid]);
s_out[tid] = s_in[tid];
__syncthreads();

for (int offset = 1; offset < blockDim.x; offset <= offset)
{
s_out[tid] += s_in[tid - offset];
}
__syncthreads();
}

d_out[tid] = s_out[tid];
}

I’m using shared memory for the buffers since it’s directly on board the thread blocks, and is thus much faster to access than going back to global memory for every step of the loop. Simple, right? When I first tried to implement scan, I thought I’d see if Nvidia would offer any pointers explaining the finer points. After all, my implementation doesn’t handle cases with arrays that are not a power of 2, and won’t scale to arrays larger than the dimension of a block (which happened to be fine for the problem I was solving, but not o.k in general). So I found this document, talking about prefix sum. Here’s their algorithm for Hillis and Steele’s exclusive scan:

__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// load input into shared memory.
// This is exclusive scan, so shift right by one and set first elt to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset = offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}

When I read this code, I have to ask at least the following:

  • I don’t understand why this code uses ‘double buffer indices’ for one large array, rather than two shared memory arrays. It makes for less readable code, and introduces a potential bug if the user places the swap below the if block rather than above.
  • The last line which writes the output back to global memory contains a typo (thid1 should be thid)

This could go into a textbook as an example of why we need code review. That it was published with a compilation error tells me that this code was un-tested. Even after fixing it, try running it at home and see what you get (hint: not what you would expect).

Perhaps the mistakes and obfuscation were intentional, to foil google-copy-and-paste homework submissions for future courses. But I think that greater damage is done by having present and future searches point to obfuscated and incorrect code (the same code, error and all, appears in these slides). Parallel computation is hard enough without having to decode stuff like this. We should be teaching people to write legible, easily verifiable code that can later be optimized or generalized.

Advertisements