Archive for the ‘bioinformatics’ Category

HashSet, Graph, Cognac

February 13, 2017 Leave a comment

The post on applying GPU to finding Eulerian path mentioned a stumbling block: partitioning a very specific kind of graph.

In general, partitioning is a hard problem, NP-hard actually. The graphs we are dealing with are very specific, though, and may be partitioned in O(|E|) time (E – the set of graph edges). Our graphs are of the following form: partitions If we are “lucky” it consists of just one or very few (directed) loops, if not – we have lots of such disconnected loops. And it is this unlucky case that kept the bottle of good cognac sealed all these months (only a descent solution to the partitioning problem would break it out).

To partition such a graph, i.e., color every loop in its distinct color, all we need to do is walk the graph from any vertex and once the loop is detected, we pick a vertex at random from the set of those we haven’t visited yet and start walking again. We repeat until everything is visited:

let partitionLinear (end' : int [])=
    let allVertices = HashSet<int>(end')
    let colors = Array.create end'.Length -1
    let mutable color = 0
    while allVertices.Count > 0 do
        let mutable v = allVertices.First()
        while colors.[v] < 0 do
            allVertices.Remove v |> ignore
            colors.[v] <- color
            v <- end'.[v]
        color <- color + 1
    colors, color

This is great, except it doesn’t work. The problem is revealed when a graph is very large and very fragmented. This is when the code at line 7 fails us. The problem is, we expect it to work in O(1): how hard can it be to retrieve the “first” element?! Well, since we are dealing with a data structure that does not have an intrinsic notion of order, it may be quite hard. In fact, the complexity here is O(|E|) (actually O(|V|, but for our graphs |V| = |E|), thus the complexity of the code above is O(|E|^2).

The following code actually works:

let partitionLinear (end' : int [])=
    let colors = Array.create end'.Length -1
    let mutable color = 0
    let mutable num = end'.Length
    let mutable curIdx = 0
    let mutable nextIdx = num - 1

    while num > 0 && curIdx >= 0 do
        let mutable v = curIdx
        while colors.[v] < 0 do
            colors.[v] <- color
            v <- end'.[v]
        color <- color + 1
        num <- num - 1
        while nextIdx >= 0 && colors.[nextIdx] >= 0 do
            nextIdx <- nextIdx - 1
        curIdx <- nextIdx
    colors, color

In the worst case it is still O(|E|^2), however, this worst case is unlikely and we expect the performance close to O(|E|) in general.
This won’t cure the performance problems of the GPU algorithm, which still relies on the number of graph partitions to be somewhat reasonable, but it will enable it to run in some respectable time. Perhaps time to break out the cognac after all!

Zooming Through Euler Path: Supercharging with GPU

December 26, 2016 Leave a comment

So, continuing where we left off:

  1. Walking the Euler Path: Intro
  2. Visualizing Graphs
  3. Walking the Euler Path: GPU for the Road
  4. Walking the Euler Path: PIN Cracking and DNA Sequencing

For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch()
let N = 1024 * 1024
let k = 7
let avgedges k = [1..k] |> float |> List.average
let gr = StrGraph.GenerateEulerGraph(N * 10, k)
printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges))

let eulerCycle = findEulerTimed gr // GPU-based algorithm
let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm
let cpu = float sw.ElapsedMilliseconds
printfn "CPU: Euler cycle generated in %A" sw.Elapsed

And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4
Generated euler graph in 00:00:19.7520656, edges: 41,944,529
Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00
1. Predecessors computed in 00:00:03.2146705
2. Partitioned linear graph in 00:00:06.4475982
Partitions of LG: 6
3. Circuit graph generated in 00:00:31.4655218
4. Swips implemented in 00:00:00.2189634

GPU: Euler cycle generated in 00:00:41.3474044
CPU: Euler cycle generated in 00:01:02.9022833

And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.
Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!
While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.
Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15
let k = 5
let gr = StrGraph.GenerateEulerGraph(N, k)


1. We walk it as we like, computing edge predecessors. For two edges e_1 = (u_1, v_1), e_2 = (u_2, v_2),\ e_1 is a predecessor of e_2 iff v_1 \equiv u_2, i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) =

    let rowIndex = arrayCopy gr.RowIndex
    let ends = gr.ColIndex

    let predecessors = Array.create gr.NumEdges -1

    [|0..ends.Length - 1|]
        |> Array.iter
        (fun i ->
            predecessors.[rowIndex.[ends.[i]]] <- i
            rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1


2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so: predecessors.[List.head result] :: result. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:


This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])=
    let allVertices = HashSet<int>(end')
    let colors = Array.create end'.Length -1
    let mutable color = 0

    while allVertices.Count > 0 do
        let mutable v = allVertices.First()
        while colors.[v] < 0 do
            allVertices.Remove v |> ignore
            colors.[v] <- color
            v <- end'.[v]
        color <- color + 1
    colors, color

So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a vertex two partitions have in common.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition


4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge (u_1, v_1) is repeated as (v_1, u_1) All we actually need is a spanning tree of this graph:



Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity
let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors


And it’s a beautiful circle, we are done!

Why not Break out That Cognac?

let N = 1024 * 1024
let i = 1

let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N)
let eulerCycle = findEulerTimed gr
Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00
1. Predecessors computed in 00:00:00.3479258
2. Partitioned linear graph in 00:02:48.3658898
Partitions of LG: 45514
# of partitions: 45514 (CPU generation of CG)
3. Circuit graph generated in 00:00:34.1632645
4. Swips implemented in 00:00:00.1707746
GPU: Euler cycle generated in 00:03:23.0505569

This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the HashSet implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a HashSet to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as GenerateEulerGraph and GenerateEulerGraphAlt. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

Categories: bioinformatics, CUDA, F# Tags: , , , ,

Walking the Euler Path: PIN Cracking and DNA Sequencing

November 8, 2016 Leave a comment

Continuing on to some cool applications of Eulerian paths.

  1. Intro
  2. Visualization
  3. GPU/CUDA Injection

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) =
    let mutable curVertex = defaultArg start 0

    let stack = Stack<int>()
    let connections = Dictionary<int, int []>()
    let start = curVertex
    let mutable cycle = []
    connections.Add(curVertex, this.GetConnectedVertices curVertex)
    let mutable first = true

    while stack.Count > 0 || first do
        first <- false
        let connected = connections.[curVertex]
        if connected.Length = 0 then
            cycle <- curVertex :: cycle
            curVertex <- stack.Pop()
            stack.Push curVertex
            connections.[curVertex] <- connected.[1..]
            curVertex <- connected.[0]
            if not (connections.ContainsKey curVertex) then
                connections.Add(curVertex, this.GetConnectedVertices curVertex)

    let path = start::cycle
    if path.Length <> this.NumEdges + 1 then []
        start::cycle |> (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal =
    let start = rowIndex.[ordinal]
    let end' = rowIndex.[ordinal + 1] - 1

De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

  1. Contains all sub-sequences of length n, and
  2. Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:


We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: 01011100. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2]
let suffix (s:string) = s.[1..]
let prefSuf s = prefix s, suffix s // shorthand

let numToBinary len n =
    let rec numToBinaryRec n len acc =
        if len = 0 then acc
           numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc))
    numToBinaryRec n len ""

let binaryDebruijnSeq n =
    if n <= 0 then failwith "n should be positive"
    let finish = pown 2 n
    let gr =
        |> (numToBinary n >> prefSuf)
        |> List.groupBy fst
        |> (fun (v, prefSuf) -> v + " -> " + (prefSuf |> snd |> List.reduce (fun st e -> st + "," + e )))
        |> DirectedGraph<string>.FromStrings

    let debruinSeq = gr.FindEulerPath()
    let debruinNum = 
        |> List.windowed 2 
        |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"")

    gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function binaryDeruijnSeq computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: p\ ->\ s_1,\ s_2...s_n , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a DirectedGraph class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

Walking the Euler Path: GPU for the Road

September 25, 2016 Leave a comment

Continuation of the previous posts:

  1. Intro
  2. Visualization

GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “alea iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite Aela.CUDA library.

GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) =
    use dSorted = sortGpu dArr
    use dGrouped = worker.Malloc<int>(dSorted.Length)

    let lp = LaunchParam(divup dSorted.Length blockSize, blockSize)
    worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr

    compactGpuWithKernel <@createDistinctMap @> dGrouped
  1. We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
  2. We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
  3. Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000))
GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152
CPU distinct: 00:00:00.0262776
GPU distinct: 00:00:02.9162098
Length: 26,214,400
CPU distinct: 00:00:00.5622276
GPU distinct: 00:00:03.2298218
Length: 262,144,000
CPU distinct: 00:00:03.8712437
GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000
CPU distinct: 00:00:04.2019013
GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

Computing Graph Properties


Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true)

euler.Visualize(spanningTree=true, washNonSpanning=false)


Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing IEnumerable<DirectedGraph> to accomplish this, right? Wrong! What we want is the AsEnumerable property. Makes things clean and easy. It uses Seq.init method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices 
    (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the StructuredFormatDisplay("{AsEnumerable}") to enable F# Interactive pretty printing of our graph:

type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : 

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph =
    [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]);
     ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

“Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given G=(V, E), G_r=(V_r, E_r):\ V_r=V,\ E_r=\{(v_i, v_j) | (v_j, v_i) \in V \}

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple (v_i, v_j), we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse =
    lazy (
        let allExistingRows = [0..rowIndex.Length - 1]

        let subSeq =
            if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse
                let start, end' =
                    let dStart, dEnd = getEdgesGpu rowIndex colIndex
                    sortStartEnd dStart dEnd

       end' start
                asOrdinalsEnumerable ()
                |> (fun (i, verts) -> verts |> (fun v -> (v, i)))
                |> Seq.collect id

        let grSeq =
            |> Seq.groupBy fst
            |> (fun (key, sq) -> key, sq |> snd |> Seq.toArray)

        let allRows : seq<int * int []> =
            allExistingRows.Except (grSeq |> fst) |> (fun e -> e, [||])
            |> fun col -> col.Union grSeq
            |> Seq.sortBy fst

        let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length
        let revColIndex = allRows |> Seq.collect snd

        DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal)

    member this.Reverse = reverse.Force()

On line 35, .Force() will only call the computation once and cache the result. Each subsequent call to .Force() will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” grSeq will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|];

let gr = StrGraph.FromStrings s

rays rays_rev

Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to |V| = 10 \cdot 2^{10} \cdot 2^{10}.

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: start and end' that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.

    let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) =
        let idx = blockIdx.x * blockDim.x + threadIdx.x
        if idx < len - 1 then
            for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do
                start.[vertex] <- idx
                end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024
let k = 5

let gr = StrGraph.GenerateEulerGraph(N, k)

printfn "Graph: %s vertices, %s edges generated in %A" 
    (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed

let starts, ends = getEdges gr.RowIndex gr.ColIndex

printfn "GPU edges: %A" sw.Elapsed



printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697
GPU edges: 00:00:01.5234606
CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!

Categories: bioinformatics, CUDA, F# Tags: , , , ,

Walking the Euler Path: Intro

September 17, 2016 2 comments

Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.


The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row i will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with INumeric included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs

/// <summary>
/// Sparse matrix implementation with CSR and CSC storage
/// </summary>
type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) =


static member CreateMatrix (row : 'a []) (isCSR : bool) =
    let ops = GlobalAssociations.GetNumericAssociation<'a>()
    let colIdx, vals = [|0..row.Length - 1|] row 
        |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0)
        |> Array.unzip

    SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the GlobalAssociations to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to CreateMatrix and then rows can be appended to it by calling AddValues row. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix |M| = m \cdot n, then the answer is given by the equation: m \cdot n > 2 \cdot e + m + 1, here e is the number of non-zero elements in the matrix.

For a graph G=(V, E) the set V takes a place of rows, and E – that of columns. The above inequality becomes: v \cdot e > e + v + 1 \ (v = |V|,\ e = |E|), so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:

type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this =
    let rowIndex  = rowIndex.ToArray()
    let colIndex = colIndex.ToArray()
    let nEdges = colIndex.Length
    let verticesNameToOrdinal = verticesNameToOrdinal
    let nVertices = verticesNameToOrdinal.Count

    // vertices connected to the ordinal vertex
    let getVertexConnections ordinal =
        let start = rowIndex.[ordinal]
        let end' = rowIndex.[ordinal + 1] - 1

This is not very useful, however, since it assumes that we already have rowIndex for the CSR type “R” and colIndex for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file Generators.fs.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.
In the next post I’ll talk about visualizing things, and vsiualization methods really have nothing to do with the graph itself. Nevertheless, it is natural to write:


instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:


type Graphs.DirectedGraph<'a when 'a:comparison> with
    /// <summary>
    /// Create the graph from a file
    /// </summary>
    /// <param name="fileName"></param>
    static member FromFile (fileName : string) =

        if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file"

        let lines = File.ReadLines(fileName)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

Non-linear Thinking with CUDA.

September 8, 2015 1 comment

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

The Problem

Given a set A = \{a_1, a_2 \ldots a_n\}, output a set S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}

In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 71 113 113 128
57 + 71 71 + 113 113 + 113 113 + 128 128 + 57
57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71


On my GitHub.

Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is O(n^3):

// brute force O(n^3) solution
let cyclospec (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |&gt; Seq.toArray

    let mutable parsum = 0

    (seq {
        yield 0
        yield pepArr |&gt; Array.sum
        for i = 0 to len - 2 do
            yield! [0..len-1]
                    (fun j -&gt;
                        //F# 4.0 Feature!
                        parsum &lt;- 0
                        for ind = j to j + i do
                            parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind]
    }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an O(n^2) solution:

let cyclospecOpt (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |> Seq.toArray

    let output = Array.zeroCreate ((len - 1) * len)

    for i = 0 to len - 2 do
            |> Seq.iter 
                (fun j ->  
                    let ind = i + j
                    output.[i * len + j] <- 
                        if i = 0 then pepArr.[j] 
                        else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind]
    seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of a_5 + a_6 + a_7, thus it’s a grid X of dimensions n x (n-1)( where x_{i, j} is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>]
let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x
    let lenElem = blockIdx.y * blockDim.y + threadIdx.y

    if ind < len && lenElem < len - 1 then
        let mutable parsum = 0
        for i = 0 to lenElem do
            let idx = ind + i
            parsum <- parsum + arr.[if idx >= len then idx - len else idx]
        out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) =
    let blockSize = dim3(16, 16)
    let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y)
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=
let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then
let idx = ind + lenElem
out.[lenElem * len + ind] = len then idx – len else idx]


let cyclospecOptGpu (peptide : int []) =
    let blockSize = 256
    let gridSize = divup peptide.Length blockSize
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    for i = 0 to peptide.Length - 2 do
        worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):

And so:

  1. A little algorithm is a beautiful thing
  2. A little CUDA is an even more beautiful thing
  3. CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
  4. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:

Turnpike Problem (F#)

September 6, 2015 Leave a comment

The Problem

This is a problem for which no polynomial time solution exists, although a pseudo-polynomial (polynomial in the degree of the maximum element) does exist. In this paper, the algorithm proposed for the sum-function with full information is simple enough. For the original problem with the difference funcition, the algorithm is simple as well:

Problem. Given a set D_A=\{a_i-a_j\},\ 1\le i, j\le n, reconstruct the original set A=\{a_i\},\ 1 \le i \le n. (Naturally, we can only consider positive deltas:
D_A=\{a_i-a_j\},\ 1\le i<j \le n, the set is sorted in the descending order).

A simple algorithm exists. It performs in O(2^n\cdot\log{n}) time, but in practice its performance is closer to O(n^2\cdot\log{n}), as it "probably" won't hit the worst case.

We first reduce our set D_A to only positive values and assume they are sorted in the descending order.
Then the pseudocode will run as follows:

solution = [0; Da[0]]
max_element = Da[0]
remove (Da, Da[0])
while (Da not empty) {
   if solution = [0] return []
   if distances(solution.Add(Da[0])) \subseteq Da && not explored_this_path()
      remove(Da, distances)
   else if distances(solution.Add(max - Da[0])) \subseteq Da && not explored_this_path()
      remove(Da, distances)
      backtrack(solution, Da)
return solution;

We start by adding 0 and the maximal element to the solution, and remove the latter from the Da. At every step, we have a new version of the solution and Da. We take \max{D_{A_cur}} to be the possible next element of the set A. Then we compute pair-wise distances of the current sub-solution and see if they represent the subset of D_{A_cur}. If that fails – we then try a_{max} - \max{D_{A_cur}} the same way. If that fails as well, we backtrack to the previous state of the solution and previous D_{A_cur}, and keep backtracking until we reach a state, where not all of alternatives have been tried.

Once current Da is empty – we have a solution. If we have reached the root state and all alternatives have been tried – there is no solution.

Essentially, we are building a binary tree, where each node contains the current solution, current Da, and an edge connecting it with one or both of the alternatives above. Hopefully, we do not need to keep building this tree for too long. Here is a nice illustration of this process.

Note, that at every step we favor the remaining maximal element of D_{A_cur} as opposed to a_{max} - \max{D_{A_cur}}. Doing the latter would produce a symmetric solution.

F# Implementation

The source for this implementation is here

At every step, we keep building the following tree:

type DecisionTree =
    | Empty
    | TreeNode of deltas : Dictionary<int, int> * solution : int list * visited: uint32 * maxElem : DecisionTree * maxDist : DecisionTree * prev : DecisionTree 

Every node contains the current deltas, current state of the solution, the number of times we have visited this node (for optimization), two possible branches and a pointer to the parent node.

The function that implements the current algorithm:

let rec insert (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prevNode : DecisionTree) maxSol =
    let insPrev () =
        let res, deltas, prevPrev = getPrev prevNode
        insert deltas res prevNode prevPrev maxSol

    match node with 
    | Empty -> TreeNode(deltas, res, 0u, Empty, Empty, prevNode)
    | TreeNode(dct, rslt, visited, maxElem, maxDist, prev) when visited < 2u ->
        let curVisited = visit node
        let elem = 
            if visited = 0u then keySeqMax deltas else maxSol - keySeqMax deltas
        let dists = stepOk elem res deltas
        if dists.Length > 0 then
            let newDeltas = removeDist deltas dists
            insert newDeltas (elem::res) (if visited = 0u then maxElem else maxDist) curVisited maxSol
        elif visited = 0u then
            insert deltas res curVisited prev maxSol
    | _ -> insPrev()

The insert function implements each step of the algorithm and returns a reference to the currently inserted node.
The driver is a loop, implemented, naturally as a tail-recursive function:

let rec buildSolution (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prev : DecisionTree) =
    if deltas.Count = 0 then res |> List.sort
        let newNode = insert deltas res node prev maxSol
        let prevNode = node
        match newNode with
        | Empty -> [] // no solution
        | TreeNode(deltas, res, visited, maxElem, maxDist, prev) ->
             if visited >=2u && prev = Empty then [] // came all the way back, no solution
                 buildSolution deltas res newNode prevNode

This algorithm is biased towards always selecting the maximum remaining element from the Da as the next element in the solution. So, for an input set [2; 2; 3; 3; 4; 5; 6; 7; 8; 10], this algorithm produces [0; 3; 6; 8; 10]. Symmetrically, [0; 2; 4; 7; 10] is also a solution.

Categories: bioinformatics, F# Tags: ,