This blog chronicles my experiences learning F# via implementing Push: a programming language designed for use in evolving populations of programs geared for solving symbolic regression problems (genetic programming). It has since evolved to contain bits and pieces reflecting my thorny twisted software development path.

Look-and-say: [Alea.]CUDA

January 6, 2016 Leave a comment

Continuing the Advent of Code theme from the previous post. Figured since this year is going to be my year of CUDA, this would be a good opportunity to take it for a ride. A good April 1st post, but why wait?

So, how can we make this even faster than the already fast imperative solution?

Naive Approach

The complete script for this is on GitHub.

The kernel computes one iteration of the look-and-say sequence. Every instance looks at a single group of repeating digits and outputs the result into an array. The new array is twice the size of the original one (representing a previous member of the sequence), every digit or the first digit of a group will produce two number (n, c) and every repeating digit will produce two “0”s on the output:

1 1 1 3 1 2 2 1 1 3 -> 3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3

[<Kernel; ReflectedDefinition>]
let lookAndSayKernel (arr : deviceptr<int8>) (len : int) (out : deviceptr<int8>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len then
        let c = arr.[ind]
        let idxOut = 2 * ind

        if ind = 0 || arr.[ind - 1] <> c then
            let mutable i = 1
            while ind + i < len && c = arr.[ind + i] do
                i <- i + 1
            out.[idxOut] <- int8 i
            out.[idxOut + 1] <- c
            out.[idxOut] <- 0y
            out.[idxOut + 1] <- 0y

In the naive approach we bring the resulting sequence into the memory and repeat:

let lookAndSayGpu (s : string) n =
    let arr = s |> (fun c -> (string>>int8) c) |> Seq.toArray

    let rec loop (arr : int8 []) n =
        if n = 0 then arr.Length
            let blockSize = 512
            let gridSize = divup arr.Length blockSize
            let lp = LaunchParam(gridSize, blockSize)

            use dArr = worker.Malloc(arr)
            use dOut = worker.Malloc(arr.Length * 2)

            worker.Launch <@lookAndSayKernel@> lp dArr.Ptr arr.Length dOut.Ptr

            let out = dOut.Gather()

            let arr = 
                |> Array.filter (fun b -> b > 0y) 

            loop arr (n - 1)

    loop arr n

This is almost not worth checking against the high performing imperative algorithm of the last post, but why not?

Not unexpectedly we are still falling short of the imperative solution. Not a problem. There are a few very low hanging fruit here that are just waiting to be picked.

A Better Solution

The complete script for this is on GitHub.

There are two problems with our solution so far:

  1. Two much memory being moved between the GPU and RAM
  2. A significant part of the algorithm is executed on the CPU with memory allocated and destroyed in the process (that would be the array compacting part)

The following algorithm addresses both problems:

GenerateSequences(str, n) {
//Allocate the maximum amount of memory for all data structures used on the GPU
len = str.Length
initializeGpuMemory(str, len)
for i = 1 to n {
    compactResultStream(2 * len)
    len = getNewLength()

Of the 3 steps in this iterative loop, there is only one where in the implementation we move data from the GPU. It would be easy enough to not do it if necessary, though.

At the core of this algorithm there is a stream compaction routine that is executed hopefully entirely on the GPU. Compaction is performed by copying only the values needed to the addresses in the output stream. These addresses are determined by running an inclusive scan on the array of integers that contains “1” in a meaningful position, and “0” in all the rest. The details of how scan is used for stream compaction and how scans are implemented on the GPU, are discussed in GPU Gems 3 (free from NVIDIA). I’m using a device scan that comes with Alea.CUDA, which, hopefully, minimizes memory copies between main memory and GPU.

Now that the stream has been compacted, we get the length of the new sequence as the last entry in the address map, obtained during compaction (and explained in the linked chapter of GPU Gems 3).

[<Kernel; ReflectedDefinition>]
let lookAndSayKernelSimple (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) (flags : deviceptr<int>)=
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len then
        let c = arr.[ind]
        let idxOut = 2 * ind
        let prevUnrepeated = if ind = 0 || arr.[ind - 1] <> c then 1 else 0
        flags.[idxOut] <- prevUnrepeated
        flags.[idxOut + 1] <- prevUnrepeated

        if prevUnrepeated = 1 then
            let mutable i = 1
            while ind + i < len && c = arr.[ind + i] do
                i <- i + 1
            out.[idxOut] <- i
            out.[idxOut + 1] <- c
            out.[idxOut] <- 0
            out.[idxOut + 1] <- 0

[<Kernel; ReflectedDefinition>]
let copyScanned (arr : deviceptr<int>) (out : deviceptr<int>) (len : int) (flags : deviceptr<int>) (addrmap : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len && flags.[ind] > 0 then out.[addrmap.[ind] - 1] <- arr.[ind]

In the modified kernel, we also fill out the “flags” array which are scanned to obtain the address map for compacting the stream. The second kernel uses the result of the scan performed on these “flags” to produce the new compacted sequence. So for the above example:

original: 1 1 1 3 1 2 2 1 1 3
new:     3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3
flags:     1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1
map:      1 2 2 2 2 2 3 4 5 6 7 8 8 8 9 10 10 10 11 12 (inclusive scan of the flags)

The last entry in the map is also the new length.

let lookAndSayGpuScan (s : string) n =
    let maxLen = 20 * 1024 * 1024
    let arr = s |> (fun c -> (string>>int) c) |> Seq.toArray

    use dOut = worker.Malloc(maxLen)
    use dFlags = worker.Malloc(maxLen)
    use dAddressMap = worker.Malloc(maxLen)
    use dArr = worker.Malloc(maxLen)

    use scanModule = new DeviceScanModule<int>(GPUModuleTarget.Worker(worker), <@ (+) @>)
    scanModule.Create(100) |> ignore

    let sw = Stopwatch()
    let rec loop n len =
        if n = 0 then len
            let blockSize = 512
            let gridSize = divup len blockSize
            let lp = LaunchParam(gridSize, blockSize)
            use scan = scanModule.Create(2 * len)

            worker.Launch <@lookAndSayKernelSimple@> lp dArr.Ptr len dOut.Ptr dFlags.Ptr
            scan.InclusiveScan(dFlags.Ptr, dAddressMap.Ptr, 2 * len)

            let gridSize = divup (2 * len) blockSize
            let lp = LaunchParam(gridSize, blockSize)

            worker.Launch <@copyScanned@> lp dOut.Ptr dArr.Ptr (2 * len) dFlags.Ptr dAddressMap.Ptr
            let len = dAddressMap.GatherScalar(2 * len - 1)

            loop (n - 1) len

    let res = loop n s.Length
    res, sw.ElapsedMilliseconds

The complete code is above. Notice, that we are not cheating when we are measuring the speed of the internal loop and not including allocations. These can be factored out completely, so the only experiment that matters is what actually runs.

Here are the comparisons:

Now, we are finally fast! But it’s a mighty weird chart, probably because most of the time is spent in context switching and all other latency necessary to make this run…

So, takeaways: GPU programming makes us think differently, cool GPU algorithms exist, writing for the GPU is tough but rewarding, and we can really blow a small problem out of any proportion considered descent.

Categories: CUDA, F# Tags: , , ,

Look-and-say: F#

December 27, 2015 1 comment

This holiday season my brief indulgence was solving the Advent of Code puzzles. One was about the look-and-say sequence. It starts with “1” and grows as follows:

1 is read off as “one 1” or 11.
11 is read off as “two 1s” or 21.
21 is read off as “one 2, then one 1” or 1211.
1211 is read off as “one 1, then one 2, then two 1s” or 111221.
111221 is read off as “three 1s, then two 2s, then one 1” or 312211.

This has been analyzed. Members’ length grows exponentially, so |\mathcal{L}_{i+1}| \approx |\mathcal{L}_{i}| \times 1.3
Advent of Code Day 10 is asking to output the length of the 41st member of this sequence, starting from 1113122113. And then, in the second part – of the 51st. This is simple enough, however, if one is not careful… It may take a while.

I wrote a solution fast enough, however I wasn’t satisfied with it at all: it looked too generic in style: it could be written this way in any language. The whole point of the indulgence being the aesthetic enjoyment of writing F#.

I looked around and found the following post, which I am delighted to quote as an example of the author’s amazing style and command of the language

let read (input : string) =
    |> Seq.fold (fun acc x ->
        match acc with
        | (n, x')::tl when x = x' -> (n+1, x')::tl
        | _ -> (1, x)::acc) []
    |> List.rev
    |> Seq.collect (fun (n, x) -> sprintf "%d%c" n x)
    |> fun xs -> System.String.Join("", xs)

If you take a moment to read through it you will be admiring it as much as I was.

Here is my boringly generic solution.

let genNext (s : List<byte>) =
    let mutable i = 0y
    let mutable c = s.[0]
    let reps = List<byte>()
    for ch in s do
        if c <> ch then
            reps.Add(byte i)
            c <- ch
            i <- 0y
        i <- i + 1y
    reps.Add(byte i)

If you look at mine – it is totally un-F#. It may as well have been written in C++! It is not motivated by a sense of style, or good taste, or sadly, by a deep sense of the language. It is, however, motivated by performance. The key to great performance is, of course, to know well ahead of time where your performance goes. In this case we do know it: since the structure holding the next sequence member grows in length exponentially, we would like to minimize all operations related to its manipulation. As far as reads – we cannot escape O(n) reads. Same for writes at every step. And now we are faced with a dilemma that a C++ (and possibly a C#) programmer would never face: are we being wasteful with our memory. In fact, so wasteful that it can bog us down? The answer in the first case is – yes, of course. In the second – no.

The first solution that would satisfy any functional purist, creates and destroys its immutable structures every step of the way. Not only that, its input is necessarily a string and so we also need to serialize the sequence we come up with into a string for the sole purpose of deconstructing it again in the next iteration.

At the same time, the second solution, using only an almost minimal amount of memory (it is provable, that the sequence never contains any numbers beyond 1, 2, and 3, so bytes are good), uses exactly one new structure which it returns. We actually don’t need to serialize it back to a string! Let’s see how these compare time wise (number of iterations over the sequence is along the X axis):

It is not unexpected, still kinda staggering. To zoom into the boring solution (it is still exponential):


So this raises a bunch of questions: as our technology evolves, are we going to be able to write more and more aesthetically impeccable (and shorter!) code without caring much for the real world? Or are performance concerns going to ever be looming somewhere very close, forcing more sensible choices? Who is the developer of the future? Is s/he going to be required to write great code but hold more down-to-earth alternatives in mind, effectively cancelling her immunity to trials and tribulations of machine machine-oriented languages like C++ or even C, something that Java set out to free us from ages ago? I don’t know, but I believe the answer for the nearest future at least, should be sought in the world of algorithms. Algorithmic and especially data structural efficiency would help us write in style while still holding on to great performance gains.

Non-linear Thinking with CUDA.

September 8, 2015 1 comment

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

The Problem

Given a set A = \{a_1, a_2 \ldots a_n\}, output a set S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}

In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 71 113 113 128
57 + 71 71 + 113 113 + 113 113 + 128 128 + 57
57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71


On my GitHub.

Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is O(n^3):

// brute force O(n^3) solution
let cyclospec (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |&gt; Seq.toArray

    let mutable parsum = 0

    (seq {
        yield 0
        yield pepArr |&gt; Array.sum
        for i = 0 to len - 2 do
            yield! [0..len-1]
                    (fun j -&gt;
                        //F# 4.0 Feature!
                        parsum &lt;- 0
                        for ind = j to j + i do
                            parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind]
    }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an O(n^2) solution:

let cyclospecOpt (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |> Seq.toArray

    let output = Array.zeroCreate ((len - 1) * len)

    for i = 0 to len - 2 do
            |> Seq.iter 
                (fun j ->  
                    let ind = i + j
                    output.[i * len + j] <- 
                        if i = 0 then pepArr.[j] 
                        else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind]
    seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of a_5 + a_6 + a_7, thus it’s a grid X of dimensions n x (n-1)( where x_{i, j} is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>]
let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x
    let lenElem = blockIdx.y * blockDim.y + threadIdx.y

    if ind < len && lenElem < len - 1 then
        let mutable parsum = 0
        for i = 0 to lenElem do
            let idx = ind + i
            parsum <- parsum + arr.[if idx >= len then idx - len else idx]
        out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) =
    let blockSize = dim3(16, 16)
    let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y)
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=
let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then
let idx = ind + lenElem
out.[lenElem * len + ind] = len then idx – len else idx]


let cyclospecOptGpu (peptide : int []) =
    let blockSize = 256
    let gridSize = divup peptide.Length blockSize
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    for i = 0 to peptide.Length - 2 do
        worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):

And so:

  1. A little algorithm is a beautiful thing
  2. A little CUDA is an even more beautiful thing
  3. CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
  4. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:

Turnpike Problem (F#)

September 6, 2015 Leave a comment

The Problem

This is a problem for which no polynomial time solution exists, although a pseudo-polynomial (polynomial in the degree of the maximum element) does exist. In this paper, the algorithm proposed for the sum-function with full information is simple enough. For the original problem with the difference funcition, the algorithm is simple as well:

Problem. Given a set D_A=\{a_i-a_j\},\ 1\le i, j\le n, reconstruct the original set A=\{a_i\},\ 1 \le i \le n. (Naturally, we can only consider positive deltas:
D_A=\{a_i-a_j\},\ 1\le i<j \le n, the set is sorted in the descending order).

A simple algorithm exists. It performs in O(2^n\cdot\log{n}) time, but in practice its performance is closer to O(n^2\cdot\log{n}), as it "probably" won't hit the worst case.

We first reduce our set D_A to only positive values and assume they are sorted in the descending order.
Then the pseudocode will run as follows:

solution = [0; Da[0]]
max_element = Da[0]
remove (Da, Da[0])
while (Da not empty) {
   if solution = [0] return []
   if distances(solution.Add(Da[0])) \subseteq Da && not explored_this_path()
      remove(Da, distances)
   else if distances(solution.Add(max - Da[0])) \subseteq Da && not explored_this_path()
      remove(Da, distances)
      backtrack(solution, Da)
return solution;

We start by adding 0 and the maximal element to the solution, and remove the latter from the Da. At every step, we have a new version of the solution and Da. We take \max{D_{A_cur}} to be the possible next element of the set A. Then we compute pair-wise distances of the current sub-solution and see if they represent the subset of D_{A_cur}. If that fails – we then try a_{max} - \max{D_{A_cur}} the same way. If that fails as well, we backtrack to the previous state of the solution and previous D_{A_cur}, and keep backtracking until we reach a state, where not all of alternatives have been tried.

Once current Da is empty – we have a solution. If we have reached the root state and all alternatives have been tried – there is no solution.

Essentially, we are building a binary tree, where each node contains the current solution, current Da, and an edge connecting it with one or both of the alternatives above. Hopefully, we do not need to keep building this tree for too long. Here is a nice illustration of this process.

Note, that at every step we favor the remaining maximal element of D_{A_cur} as opposed to a_{max} - \max{D_{A_cur}}. Doing the latter would produce a symmetric solution.

F# Implementation

The source for this implementation is here

At every step, we keep building the following tree:

type DecisionTree =
    | Empty
    | TreeNode of deltas : Dictionary<int, int> * solution : int list * visited: uint32 * maxElem : DecisionTree * maxDist : DecisionTree * prev : DecisionTree 

Every node contains the current deltas, current state of the solution, the number of times we have visited this node (for optimization), two possible branches and a pointer to the parent node.

The function that implements the current algorithm:

let rec insert (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prevNode : DecisionTree) maxSol =
    let insPrev () =
        let res, deltas, prevPrev = getPrev prevNode
        insert deltas res prevNode prevPrev maxSol

    match node with 
    | Empty -> TreeNode(deltas, res, 0u, Empty, Empty, prevNode)
    | TreeNode(dct, rslt, visited, maxElem, maxDist, prev) when visited < 2u ->
        let curVisited = visit node
        let elem = 
            if visited = 0u then keySeqMax deltas else maxSol - keySeqMax deltas
        let dists = stepOk elem res deltas
        if dists.Length > 0 then
            let newDeltas = removeDist deltas dists
            insert newDeltas (elem::res) (if visited = 0u then maxElem else maxDist) curVisited maxSol
        elif visited = 0u then
            insert deltas res curVisited prev maxSol
    | _ -> insPrev()

The insert function implements each step of the algorithm and returns a reference to the currently inserted node.
The driver is a loop, implemented, naturally as a tail-recursive function:

let rec buildSolution (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prev : DecisionTree) =
    if deltas.Count = 0 then res |> List.sort
        let newNode = insert deltas res node prev maxSol
        let prevNode = node
        match newNode with
        | Empty -> [] // no solution
        | TreeNode(deltas, res, visited, maxElem, maxDist, prev) ->
             if visited >=2u && prev = Empty then [] // came all the way back, no solution
                 buildSolution deltas res newNode prevNode

This algorithm is biased towards always selecting the maximum remaining element from the Da as the next element in the solution. So, for an input set [2; 2; 3; 3; 4; 5; 6; 7; 8; 10], this algorithm produces [0; 3; 6; 8; 10]. Symmetrically, [0; 2; 4; 7; 10] is also a solution.

Categories: bioinformatics, F# Tags: ,

Decomposition Problem with F#, Dynamic Programming

August 20, 2015 Leave a comment

As my former boss and mentor used to say, in order to understand recursion one must first understand recursion. This is funny, ha-ha, but if we tweak it slightly we get a really useful statement: in order to understand dynamic programming, one must first understand recursion. Here is an example.



Decomposition Problem

Given a number, m, and an (ordered) set \mathcal{N} = \{n_i\}, \ n_i > 0, \ n_{i+1} \geq n_{i} output how many sets \mathcal{A}=\{a_i\}, \ a_i \geq 0 there are, such that m=\sum\limits_i a_i\dot n_i

Sequencing Antibiotics

I was solving a “particular” case (the solution is nonetheless general, but just to explain function names appearing in the code), in Chapter 2 on Antibiotics sequencing in my bioinformatics book.

Antibiotics sequencing is done through mass spectrum analysis, so if we have a spectrum of weight m, we may need to build the list of linear peptides of the same mass. In a toy version of this problem, we have an integer value of m and a list of amino acid weights (\mathcal{N}) out of which antibiotics can be built:

let weights = [|57; 71; 87; 97; 99; 101; 103; 113; 114; 115; 128; 129; 131; 137; 147; 156; 163; 186|]

To Understand Recursion

Given a number m, we assume that the problem is solved for m-n_0, m-n_1, etc until m-n_i, and we have values k_0, k_1, ..., k_i for these solutions Our solution, then is simply \sum\limits_i k_i. We are done.

If we look a bit closer, however, we’ll see that our recursion creates a tree with one root node (a solution for the case of m), branching exponentially: at the first level we get l = |\mathcal{N}| new nodes, each of which also produces l nodes at the next level. So this will pretty quickly become unfeasible.

Even if we did have unlimited computing power, this is still not the most efficient use of it, since lots of quantities will be recursively computed multiple times as we descend down the recursion tree.

Dynamic Programming

We therefore turn to dynamic programming, which is sort of recursion backwards. We still use the idea above, but instead of going back we go forward. We pre-compute all the necessary values by starting from n_0, the minimal member of \mathcal(N), marching until m. At each step we will perform exactly the recursive step described above, so once we reach higher values of m, all previous values of it will be ready!

The Solution

We allocate an array of 0 elements and length m (this can actually be optimized to be of some constant length of O(l), perhaps just of length l, I didn’t check, since as we move along old values become useless). We set all entries corresponding to n_i to 1.

The rest is pretty straight-forward.

open System

// array of amino acid weights
let weights = [|57; 71; 87; 97; 99; 101; 103; 113; 114; 115; 128; 129; 131; 137; 147; 156; 163; 186|]

/// Count the number of peptides of mass m
let countNumPeptides m =
    let allCounts : int64 [] = Array.zeroCreate (max m weights.[weights.Length - 1] + 1)
    // initialize dynamic programming store
    Array.ForEach(weights, fun w -> allCounts.[w] <- 1L)

    // equivalent to a simple loop, but we are using tail recursion
    let rec fillDynArray n =
        if n > m then allCounts.[m]
            // dynamic_count(N) = Sum(dynamic_count((N - weights[0]), dynamic_count(N - weights[1]),.., dynamic_count(N - weights.Last())
            let step = 
                |> (fun w -> if n - w < 0 then 0L else allCounts.[n - w])
                |> Array.sum

            allCounts.[n] <- allCounts.[n] + step
            fillDynArray (n + 1)

    fillDynArray weights.[0], allCounts

…and a mandatory chart

I really enjoy working with FSharp.Charting (available through NuGet), and it would be cool to visualize how fast the number of solutions grows with m increasing linearly.

Notice above I am returning the entire array of counts we have generated, so since we are saving all of the solutions from n_0 to m, let’s see what we get:

#load @"..\packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx"
open FSharp.Charting

let chartNumPeptides m =
    let _, counts = countNumPeptides m
    let series = 
        Array.init m (fun i -> i, if counts.[i] > 0L then counts.[i] else 1L)
    Chart.Line series.[weights.[0]..]
    |> Chart.WithXAxis(Min=100., Max=1000.)
    |> Chart.WithYAxis(Log=true)

Cheating a little here: if values are 0 I’m setting them to 1 since this is a semi-log chart (no log of 0). I think it’s easier than matplotlib!

Fun with Alea.CUDA, F# Interactive, Charts

July 19, 2015 1 comment

Source code for this post can be found on my GitHub.

It’s great to see technologies evolving over the years. Alea.CUDA has done so in leaps and bounds since the first time I laid eyes on it a couple of years ago. At the time the name seemed unfortunate and hinted at the “aleatic” programming paradigm, meaning that you could not reliably predict the results your code would produce in consequent runs (alea is a Latin dice). It has all changed since.

I listened to an NVIDA Webinar on the subject and immediately wanted to take Alea.CUDA development for a test drive. I wanted to do it the same way I would in Python: just like we use Numba and IPython console, use F# scripting with F# Interactive.

For a test subject, I picked the same 1D LBP described in my post here. The actual algorithm for extracting the local binary pattern and its use in face recognition is described in this paper.

The algorithm consists of two parts:

  1. Sliding a window of size 2*n across the array and computing the pattern at each location
  2. Computing the histogram of these values.

Here is my F# way of doing it:

let lbp (arr : int[]) n =
    if arr.Length < n * 2 + 1 
        then failwith ("Length should be at least " + (n * 2 + 1).ToString())

    // computes a pattern around an array element i
    let calcOnePattern i (around : int [])=
        [0..around.Length - 1] 
        |> Seq.fold (fun st j -> if around.[j] > i then st ||| (1 <<< j) else st) 0 

    // moves sliding windoe of size '2*n' through the array and computes the pattern for each 
    // element at the center of a window
    let prep =
        [|n .. arr.Length - n - 1|]
        |> (fun i -> calcOnePattern arr.[i] (Array.concat [arr.[i - n..i - 1]; arr.[i + 1..i + n]]))
    // return histogram
    |> Array.fold (fun (st : int[]) v -> st.[v] <- st.[v] + 1; st) 
          (Array.zeroCreate (1 <<< n * 2))

For comparison, here is the Python code (here p is 2 ** [0..(2n-1)]), in F# code I’m simply shifting left, so no such array is necessary.

def extract_1dlbp_cpu(input, neighborhood, p):
    hist = np.zeros(1 << (2 * neighborhood))
    for i in range(neighborhood, len(input) - neighborhood):
        left = input[i - neighborhood : i]
        right = input[i + 1 : i + neighborhood + 1]
        both = np.r_[left, right]
        hist[np.sum(p [both >= input[i]])] += 1
    return hist

Putting it on GPU

I took one of Alea.CUDA provided sample scripts and modified it slightly. This is why my code is wrapped in a class. Here it is:

type LBP(arr : int [], ?n : int) =
    inherit GPUModule(GPUModuleTarget.DefaultWorker)

    let mutable arr = arr
    let n = defaultArg n 4
        if arr.Length < n * 2 + 1 then failwith ("Length should be at least " + (n * 2 + 1).ToString())

    member this.Arr 
        with get() = arr
        and set (value) = arr <- value

    member this.Kernel (arr:deviceptr<int>) (n:int)  (hist:deviceptr<int>) (len : int)=
        let mutable i = blockIdx.x * blockDim.x + threadIdx.x
        let mutable res = 0
        if i < len - 2 * n then
            i <- i + n
            for j = i - n to i - 1 do
                if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n)))

            for j = i + 1 to i + n do
                if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n + 1)))

            __atomic_add (hist + res) 1 |> ignore

    member this.Compute () =
        let blockSize = 512
        let gridSize = divup (arr.Length - 2 * n) blockSize
        let lp = LaunchParam(gridSize, blockSize)

        let hist = Array.zeroCreate (1 <<< n * 2)
        use d_arr = this.GPUWorker.Malloc(arr)
        use d_hist = this.GPUWorker.Malloc(hist)
        this.GPULaunch <@ this.Kernel @> lp d_arr.Ptr n d_hist.Ptr (arr.Length)

There is not much here, at least in spirit, that any CUDA developer is not used to. We define a kernel and then launch it in a very familiar way, albeit wrapped in a slightly unfamiliar form. Nothing that is not easy to understand or adopt, though. F# quotations are used to interface with nvcc, so makes sense. Pointer arithmetic is wrapped nicely, so code like the one on line 25 above (__atomic_add) is possible. Works great, easy to use.

Here the kernel, as one would expect, computes a pattern around a single array element, at the same time contributing its result to the final histogram. Definitely room for optimization as this atomic_add is an obvious bottleneck, but I didn’t go any further since the results are quite spectacular already.

Also, I really like the “divup” function above, that figures out the number of blocks. Normally it is done as:

gridSize = (total + blockSize) / blockSize

… a small thing, but a nice convenience.


Since I was doing it in F# Interactive, some boilerplate was necessary. I installed my binaries through NuGet and here are the references. The “send to F# interactive” functionality that has been around for a while now is a great help to figure things out.

#r "System.Configuration.dll"
#I @"packages\Alea.Cuda.\lib\net40"
#I @"packages\NUnit.2.6.3\lib\"
#I @"packages\FsUnit.\Lib\Net40"
#r "Alea.CUDA.dll"
#r "nunit.framework.dll"
#r "FsUnit.NUnit.dll"
#load @"packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx"

open System
open System.IO
open Alea.CUDA
open Alea.CUDA.Utilities
open FsUnit
open FSharp.Charting
open System.Diagnostics

Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(__SOURCE_DIRECTORY__, @"packages\Alea.Cuda.\private")
Alea.CUDA.Settings.Instance.Resource.Path <- Path.Combine(__SOURCE_DIRECTORY__, @"release")

The last two lines come straight from Alea tutorial sample and are necessary to hook up the jit compiler. That’s it!

Experiments & results

I used the following functions to drive the experiments:

// random array of length n
let generate n =
    let rng = Random(int DateTime.Now.Ticks)
    Array.init n (fun _ -> rng.Next())
// experiment: run from 10 ** low to 10 ** high array length
let experiment low high =
    if low >= high || low < 0 then failwith "must be: low < high, both non-negative"

    // salt it
    let arr = [|0..1000|]
    use lb = new LBP(arr)
    lb.Compute() |> ignore

    let sw = Stopwatch()
    let cpuTimes = Array.zeroCreate (high - low + 1)
    let gpuTimes = Array.zeroCreate (high - low + 1)

    for i = low to high do
        let range = int (10.0 ** float i)
        let arr = generate range
        lb.Arr <- arr

        // Run on CPU
        printfn "Legnth: %d" range
        printfn "-----------"
        let h1 = lbp arr 4

        let idx = i - low

        cpuTimes.[idx] <- range, sw.Elapsed.TotalSeconds
        printfn "Computed on CPU: %0.5f sec" (snd cpuTimes.[idx])

        //Run on GPU
        let h2 = lb.Compute()
        gpuTimes.[idx] <- range, float sw.Elapsed.TotalSeconds
        printfn "Computed on GPU: %0.5f sec" (snd gpuTimes.[idx])
        printfn ""

        // make sure we are ok
        should equal h1 h2

        [Chart.Line(cpuTimes, Name="CPU")
         Chart.Line(gpuTimes, Name="GPU")
        .WithYAxis(Log=true, Title = "sec")
        .WithXAxis(Min=10.**float low, Log=true, Title = "length")

The helper “generate” generates a random integer array of arbitrary length, and the “experiment” function runs the experiments on arrays of sizes 10 ** [low..high]

experiment 3 7

I got:

Legnth: 1000
Computed on CPU: 0.00937 sec
Computed on GPU: 0.00285 sec

Legnth: 10000
Computed on CPU: 0.02445 sec
Computed on GPU: 0.00308 sec

Legnth: 100000
Computed on CPU: 0.17697 sec
Computed on GPU: 0.00388 sec

Legnth: 1000000
Computed on CPU: 1.70085 sec
Computed on GPU: 0.00412 sec

Legnth: 10000000
Computed on CPU: 16.53772 sec
Computed on GPU: 0.03045 sec

Here the straight F# mode runs about twice as fast as pure Python:

Length: 1000
Finished on CPU: time: 0.02263s

Length: 10000
Finished on CPU: time: 0.28701s

Length: 100000
Finished on CPU: time: 2.88549s

Length: 1000000
Finished on CPU: time: 29.97346s

The GPU performances are about comparable between Python and Numba and F# with Alea.CUDA. Same card: GTX Titan with 6Gb RAM, 14 SMPs

Of course a mandatory chart produced with FSharpChart and the following small scriptlet:

        [Chart.Line(cpuTimes, Name="CPU")
         Chart.Line(gpuTimes, Name="GPU")
        .WithYAxis(Log=true, Title = "sec")
        .WithXAxis(Min=10.**float low, Log=true, Title = "length")

(Still not sure, why my X axis starts from 100 and not from 1000, but I guess it’s minor :)).


A great experience doing all this overall: smooth and painless.

Categories: CUDA, F#, Parallel Tags: , , , , ,

Motif Finding with Gibbs Sampling (F#)

May 17, 2015 Leave a comment

The Problem

“Motif finding is a problem of finding common substrings of specified length in a set of strings. In bioinformatics, this is useful for finding transcription binding sites” (recap here). The problem is succinctly stated on Rosalind.

Given a set of strings DNA of size t, find “most common” substrings of length k. “Most common” means, that substrings should deviate from each other as little as possible, but they may not necessarily be the same. With this in mind, scoring is defined as follows:

let colScore (col : int []) =
    col.Length - col.GroupBy(fun c -> c).Select(fun gr -> gr.Count()).Max()

let score (motifs : int [,]) =
    [0..Array2D.length2 motifs - 1] 
    |> (fun col -> motifs.[*,col] |> colScore) 
    |> List.sum

To illustrate with the DNA alphabet (we only have 4 letters in the alphabet = {A, C, G, T}), the following:


will be scored as: 1 + 0 + 2 = 3. We add up scores of each column, where the score of the column is defined as col_length – max_repeating_letters

So, column 1 score is 1 because A repeats 3 times – it is the maximum repeating letter, etc. The problem is then to find an arrangement of k-length strings (called k-mers) that minimize the score.


Given a collection (matrix) of k-mers, a profile is built in order to determine what would be “the most probable” next k-mer if we were to augment our collection. A profile is a 4xk matrix, where P(i, j) is the probability of a given letter alphabet.[i] in the column (k-mer position) j. For the matrix above:

3/4 0 0
1/4 4/4 1/4
0 0 1/4
0 0 2/4

(it’s the number of letters occurring in the column, divided by the total number of k-mers, t)

Laplace Rule

We amend our simple profile calculation with the Laplace rule of succession. This is done to compensate for 0 and 1 probabilities: keeping those will have a disastrous effect on our algorithm, where we will never be able to pick a k-mer that, like in the example above, has anything other than “C” in the second position.

We apply the rule by always adding 4 strings to the collection, each consisting of one letter of the alphabet repeated k times. So the collection above becomes:


And the profile becomes:

4/8 1/8 1/8
2/8 5/8 2/8
1/8 1/8 2/8
1/8 1/8 3/8
// borrowed from StackOverflow. Stack two 2D arrays vertically (I miss Python!)
let stackV (a1: 'a[,]) (a2: 'a[,]) =
    let a1l1,a1l2,a2l1,a2l2 = (Array2D.length1 a1),(Array2D.length2 a1),(Array2D.length1 a2),(Array2D.length2 a2)
    if a1l2 <> a2l2 then failwith "arrays have different column sizes"
    let result = Array2D.zeroCreate (a1l1 + a2l1) a1l2
    Array2D.blit a1 0 0 result 0 0 a1l1 a1l2
    Array2D.blit a2 0 0 result a1l1 0 a2l1 a2l2

let createProfile (motifs : int[,]) =
    let k = Array2D.length2 motifs
    let t = Array2D.length1 motifs + 4

    let appendLaplace () =
        let As = Array.zeroCreate k
        let rest = 
           [|yield As 
             for i = 1 to 3 do 
                 yield As |> (fun a -> a + i) |]
           |> array2D
        stackV motifs rest
    let m = appendLaplace()
    let profile = 
        [|0..k - 1|] 
            (fun col -> m.[0..,col]
                .GroupBy(fun c -> c)
                .OrderBy(fun gr -> gr.Key)
                .Select(fun gr -> float(gr.Count()) / float t)
    Array2D.init 4 k (fun i j -> profile.[j].[i])

Here we append the motif matrix (represented by a 2D integer array in F#, where a row consists of integers in the range [0..3], corresponding to the letters of our DNA alphabet) with 4 additional entries as described above and retrieve the resulting profile matrix.

Consensus k-mer

The profile matrix is used to pick a k-mer that fits the profile best. For instance, if we have built a profile based on motifs extracted from t – 1 DNA strings, and need to pick a k-mer from the t-th DNA string that fits the profile best, we first compute probabilities of all k-mers in string t based on the profile matrix:

Pr(k-mer|profile) = profile[k-mer[0], 0] * profile[k-mer[1], 1] *..* profile[k-mer[k-1], k-1]

(Here is where Laplace rule comes in handy: we would really be in trouble if there were 0 entries in the profilie matrix. Many k-mers would have Pr(k-mer|profile) = 0 and would short-circuit the algorithm).

and then pick the k-mer with the highest probability.

Code below computes probabilities of a set of k-mers given a profile.

let kmerProbs (kmers : string []) (profile : float [,]) =
    let k = Array2D.length2 profile
    let probs = 
            (fun kmer -> 
                kmer, kmer.ToCharArray() 
                    |> (fun c -> alphabet.[c])
                    |> Array.mapi (fun i ind -> profile.[ind, i])
                    |> Array.fold (fun state p -> state * p) 1.)


Gibbs Sampling

Given a discrete distribution, we want to sample from it:

  1. Pick a sample s from the uniform distribution [0, n)
  2. Lookup its probability, ps
  3. Sample from a uniform [0, 1], pu
  4. If pu <= ps – accept the sample and return it, otherwise repeat.

One thing to note here is that our probabilities do not necessarily sum up to 1 by design. So we first normalize them, by dividing each probability value by the sum of all probabilities.

The function below returns a motif sampled from the distribution we built from a profile:

let gibbsSample (rnd : Random) (probs : (string * float) []) =
    let sumall = probs |> Array.sumBy (fun (s, p) -> p)
    let prs = probs |> (fun (k, v) -> k, v / sumall)
    let len = probs.Length

    let rec rollUntilAccepted value =
        let n = rnd.NextDouble()
        if n <= snd prs.[value] then value
            rollUntilAccepted (rnd.Next(0, len))

    fst probs.[rollUntilAccepted (rnd.Next(0, len))]

The Algorithm

Since solving the problem exactly by brute force is not feasible in practice with long strings and huge collections, approximations are used to get solutions that are “close enough”.

The Gibbs sampling based solution starts from a collection of random k-mers, each picked from one string of DNA and tries to improve on this original pick, performing a sequence of iterations, using Gibbs sampling.

Since the solution evidently depends on the initial collection picked, this is re-run many times to pick the collection with the minimal score.

Briefly the algorithm is:
Given a collection of strings (DNA) of size t, length k,

  1. Start from a collection of motifs (k-mers) of size t randomly picked from DNA, one motif from one string of DNA
  2. Store the score of this collection
  3. Loop many times:
    1. Select a random number j in [0, t)
    2. Exclude j-th motif from the motif collection and build the profile of the remaining collection
    3. Compute probability distribution of motifs in DNA[j] using the profile matrix from the previous step
    4. Sample from the distribution and replace j-th motif with the obtained sample
    5. Memorize best score, best collection of motifs

Here are the above steps translated to F#:

let gibbsSamplingMotifSearch (dna : string []) k iters rndStarts =
    let t = dna.Length
    //prepopulate the list of k-mers for all DNA strings
    let kmers = 
       |> s -> 
                        [0..s.Length - k]
                         .Select(fun i -> s.Substring(i, k)).Distinct().ToArray())

    let len = dna.[0].Length - k

    // single pass of the algorithm (single "random start")
    let runSingle () = 
        let firstMotifs = 
            |> (fun i -> rnd.Next(0, len))
            |> Array.mapi (fun i p -> dna.[i].Substring(p, k) |> toInts)

        let bestMotifs = array2D firstMotifs
        let bestScore = score bestMotifs

        // repeat the sampling step n times
        let rec runSampler (motifs : int [,]) (bestMotifs : int [,]) bestScore n =
            if n = 0 then bestScore, bestMotifs
                let except = rnd.Next(0, t)  // excluded motif
                // exclude from the collection
                let profileMotifs = 
                   stackV motifs.[0..except-1, *] motifs.[except+1.., *] 
                let profile = createProfile profileMotifs // create profile from the rest
                let probs = kmerProbs kmers.[except] profile // build probability distribution 
                let motif = array2D [|gibbsSample rnd probs |> toInts|] // sample to obtain a replacement
                let curMotifs = 
                    motifs.[except+1.., *] 
                    |> stackV motif  
                    |> stackV motifs.[0..except-1, *] // replace  
                let curScore = score motifs // obtain the current score

                //recursive step, memorizing best results
                runSampler curMotifs 
                     (if curScore < bestScore then curMotifs else bestMotifs) 
                     (if curScore < bestScore then curScore else bestScore) 
                     (n - 1) 
        runSampler bestMotifs bestMotifs bestScore iters

    let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()



Here is the header for this script:

open System
open System.Linq
open System.IO

#load "..\packages\MathNet.Numerics.FSharp.3.6.0\MathNet.Numerics.fsx"

open MathNet.Numerics.Random

let rndSeed = RandomSeed.Robust()
let rnd = Random.mersenneTwisterSeed rndSeed

I am using MathNet.Numerics random generator because it happens to be thread safe (which the .NET one is not). Thread safety is crucial because:

Line 46 above:

    let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()

Since all single runs of the algorithm are created equal, it is easy to parallelize and improve your running time up to as many times as you have processor cores. This comes almost for free from .NET if you care enough to add AsParallel().

I do freely mix LINQ and native F# functionality. Perhaps many a purist would scorn me, but I find it works great in practice. Sometimes generic .NET structures which LINQ likes so much (and which I am not using here, not explicitly anyway) are simply faster because they are mutable (another heresy in the functional world, I realize, however it is faster to add an element to a collection than to rebuild the entire collection in order to add one element. For long collections it adds up).

I don’t use the query{} syntax in F# 3.1. I do like it, it is just that I’m used to lambdas and chaining. A matter of taste, I suppose.

The code above only has one explicit “for” loop (Line 17 in Laplace Rule code), tail recursion rocks!

And the final observation: I have seen lots of folks note how F# lets you do things succinctly. I would like to add, that it also enables you to be “right” the first time much more often than many other languages. (Meaning, that the functions you write have a much better chance to have fewer bugs the very first time you launch them than if you were to write them in a non-functional language).


Get every new post delivered to your Inbox.

Join 60 other followers