Non-linear Thinking with CUDA.

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

The Problem

Given a set A = \{a_1, a_2 \ldots a_n\}, output a set S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}

In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 71 113 113 128
57 + 71 71 + 113 113 + 113 113 + 128 128 + 57
57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71

Code

On my GitHub.

Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is O(n^3):

// brute force O(n^3) solution
let cyclospec (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |&gt; Seq.toArray

    let mutable parsum = 0

    (seq {
        yield 0
        yield pepArr |&gt; Array.sum
        for i = 0 to len - 2 do
            yield! [0..len-1]
                |> Seq.map
                    (fun j -&gt;
                        //F# 4.0 Feature!
                        parsum &lt;- 0
                        for ind = j to j + i do
                            parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind]
                        parsum
                        )
    }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an O(n^2) solution:

let cyclospecOpt (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |> Seq.toArray

    let output = Array.zeroCreate ((len - 1) * len)

    for i = 0 to len - 2 do
        [0..len-1] 
            |> Seq.iter 
                (fun j ->  
                    let ind = i + j
                    output.[i * len + j] <- 
                        if i = 0 then pepArr.[j] 
                        else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind]
                    )
    seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of a_5 + a_6 + a_7, thus it’s a grid X of dimensions n x (n-1)( where x_{i, j} is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>]
let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x
    let lenElem = blockIdx.y * blockDim.y + threadIdx.y

    if ind < len && lenElem < len - 1 then
        let mutable parsum = 0
        for i = 0 to lenElem do
            let idx = ind + i
            parsum <- parsum + arr.[if idx >= len then idx - len else idx]
        out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) =
    let blockSize = dim3(16, 16)
    let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y)
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

[[]
let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=
let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then
let idx = ind + lenElem
out.[lenElem * len + ind] = len then idx – len else idx]

Invocation:

let cyclospecOptGpu (peptide : int []) =
    let blockSize = 256
    let gridSize = divup peptide.Length blockSize
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    for i = 0 to peptide.Length - 2 do
        worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):
cyc

And so:

  1. A little algorithm is a beautiful thing
  2. A little CUDA is an even more beautiful thing
  3. CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
  4. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:
cyc_nonlog

One thought on “Non-linear Thinking with CUDA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.