## Zooming Through Euler Path: Supercharging with GPU

So, continuing where we left off:

- Walking the Euler Path: Intro
- Visualizing Graphs
- Walking the Euler Path: GPU for the Road
- Walking the Euler Path: PIN Cracking and DNA Sequencing

### For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch() let N = 1024 * 1024 let k = 7 let avgedges k = [1..k] |> List.map float |> List.average let gr = StrGraph.GenerateEulerGraph(N * 10, k) printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges)) let eulerCycle = findEulerTimed gr // GPU-based algorithm sw.Restart() let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm sw.Stop() let cpu = float sw.ElapsedMilliseconds printfn "CPU: Euler cycle generated in %A" sw.Elapsed

And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4 Generated euler graph in 00:00:19.7520656, edges: 41,944,529 Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00 1. Predecessors computed in 00:00:03.2146705 2. Partitioned linear graph in 00:00:06.4475982 Partitions of LG: 6 3. Circuit graph generated in 00:00:31.4655218 4. Swips implemented in 00:00:00.2189634 GPU: Euler cycle generated in00:00:41.3474044CPU: Euler cycle generated in00:01:02.9022833

And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.

Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!

While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

### The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.

Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15 let k = 5 let gr = StrGraph.GenerateEulerGraph(N, k) gr.Visualize(edges=true)

1. We walk it as we like, computing edge predecessors. For two edges is a predecessor of iff , i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) = let rowIndex = arrayCopy gr.RowIndex let ends = gr.ColIndex let predecessors = Array.create gr.NumEdges -1 [|0..ends.Length - 1|] |> Array.iter (fun i -> predecessors.[rowIndex.[ends.[i]]] <- i rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1 ) predecessors

2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so:` predecessors.[List.head result] :: result`

. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:

This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a **vertex two partitions have in common**.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition gcGraph.Visualize()

4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge is repeated as All we actually need is a spanning tree of this graph:

gcGraph.Visualize(spanningTree=true)

Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors finalGraph.Reverse.Visualize()

And it’s a beautiful circle, we are done!

### Why not Break out That Cognac?

let N = 1024 * 1024 let i = 1 let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N) let eulerCycle = findEulerTimed gr

Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00 1. Predecessors computed in 00:00:00.3479258 2. Partitioned linear graph in00:02:48.3658898Partitions of LG: 45514 # of partitions:45514 (CPU generation of CG)3. Circuit graph generated in 00:00:34.1632645 4. Swips implemented in 00:00:00.1707746 GPU: Euler cycle generated in 00:03:23.0505569

This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the `HashSet`

implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a `HashSet`

to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as `GenerateEulerGraph`

and `GenerateEulerGraphAlt`

. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

## Walking the Euler Path: PIN Cracking and DNA Sequencing

Continuing on to some cool applications of Eulerian paths.

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

### Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) = let mutable curVertex = defaultArg start 0 let stack = Stack<int>() let connections = Dictionary<int, int []>() let start = curVertex let mutable cycle = [] connections.Add(curVertex, this.GetConnectedVertices curVertex) let mutable first = true while stack.Count > 0 || first do first <- false let connected = connections.[curVertex] if connected.Length = 0 then cycle <- curVertex :: cycle curVertex <- stack.Pop() else stack.Push curVertex connections.[curVertex] <- connected.[1..] curVertex <- connected.[0] if not (connections.ContainsKey curVertex) then connections.Add(curVertex, this.GetConnectedVertices curVertex) let path = start::cycle if path.Length <> this.NumEdges + 1 then [] else start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal = let start = rowIndex.[ordinal] let end' = rowIndex.[ordinal + 1] - 1 colIndex.[start..end']

### De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

- Contains all sub-sequences of length n, and
- Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: `01011100`

. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2] let suffix (s:string) = s.[1..] let prefSuf s = prefix s, suffix s // shorthand let numToBinary len n = let rec numToBinaryRec n len acc = if len = 0 then acc else numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc)) numToBinaryRec n len "" let binaryDebruijnSeq n = if n <= 0 then failwith "n should be positive" let finish = pown 2 n let gr = [0..finish-1] |> List.map (numToBinary n >> prefSuf) |> List.groupBy fst |> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e ))) |> DirectedGraph<string>.FromStrings let debruinSeq = gr.FindEulerPath() let debruinNum = debruinSeq |> List.windowed 2 |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"") gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function `binaryDeruijnSeq`

computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a `DirectedGraph`

class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

### PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

### DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

## Motif Finding with Gibbs Sampling (F#)

## The Problem

“Motif finding is a problem of finding common substrings of specified length in a set of strings. In bioinformatics, this is useful for finding transcription binding sites” (recap here). The problem is succinctly stated on Rosalind.

Given a set of strings DNA of size *t*, find “most common” substrings of length *k.* “Most common” means, that substrings should deviate from each other as little as possible, but they may not necessarily be the same. With this in mind, scoring is defined as follows:

let colScore (col : int []) = col.Length - col.GroupBy(fun c -> c).Select(fun gr -> gr.Count()).Max() let score (motifs : int [,]) = [0..Array2D.length2 motifs - 1] |> List.map (fun col -> motifs.[*,col] |> colScore) |> List.sum

To illustrate with the DNA alphabet (we only have 4 letters in the alphabet = {A, C, G, T}), the following:

ACC

ACG

ACT

CCT

will be scored as: 1 + 0 + 2 = 3. We add up scores of each column, where the score of the column is defined as col_length – max_repeating_letters

So, column 1 score is 1 because A repeats 3 times – it is the maximum repeating letter, etc. The problem is then to find an arrangement of k-length strings (called **k-mers**) that minimize the score.

## Profile

Given a collection (matrix) of k-mers, a profile is built in order to determine what would be “the most probable” next k-mer if we were to augment our collection. A profile is a 4xk matrix, where P(i, j) is the probability of a given letter alphabet.[i] in the column (k-mer position) j. For the matrix above:

3/4 | 0 | 0 |

1/4 | 4/4 | 1/4 |

0 | 0 | 1/4 |

0 | 0 | 2/4 |

(it’s the number of letters occurring in the column, divided by the total number of k-mers, *t*)

### Laplace Rule

We amend our simple profile calculation with the Laplace rule of succession. This is done to compensate for 0 and 1 probabilities: keeping those will have a disastrous effect on our algorithm, where we will never be able to pick a k-mer that, like in the example above, has anything other than “C” in the second position.

We apply the rule by always adding 4 strings to the collection, each consisting of one letter of the alphabet repeated *k* times. So the collection above becomes:

ACC

ACG

ACT

CCT

AAA

CCC

GGG

TTT

And the profile becomes:

// borrowed from StackOverflow. Stack two 2D arrays vertically (I miss Python!) let stackV (a1: 'a[,]) (a2: 'a[,]) = let a1l1,a1l2,a2l1,a2l2 = (Array2D.length1 a1),(Array2D.length2 a1),(Array2D.length1 a2),(Array2D.length2 a2) if a1l2 <> a2l2 then failwith "arrays have different column sizes" let result = Array2D.zeroCreate (a1l1 + a2l1) a1l2 Array2D.blit a1 0 0 result 0 0 a1l1 a1l2 Array2D.blit a2 0 0 result a1l1 0 a2l1 a2l2 result let createProfile (motifs : int[,]) = let k = Array2D.length2 motifs let t = Array2D.length1 motifs + 4 let appendLaplace () = let As = Array.zeroCreate k let rest = [|yield As for i = 1 to 3 do yield As |> Array.map (fun a -> a + i) |] |> array2D stackV motifs rest let m = appendLaplace() let profile = [|0..k - 1|] |> Array.map (fun col -> m.[0..,col] .GroupBy(fun c -> c) .OrderBy(fun gr -> gr.Key) .Select(fun gr -> float(gr.Count()) / float t) .ToArray()) Array2D.init 4 k (fun i j -> profile.[j].[i])

Here we append the motif matrix (represented by a 2D integer array in F#, where a row consists of integers in the range [0..3], corresponding to the letters of our DNA alphabet) with 4 additional entries as described above and retrieve the resulting profile matrix.

## Consensus k-mer

The profile matrix is used to pick a k-mer that fits the profile best. For instance, if we have built a profile based on motifs extracted from *t – 1* DNA strings, and need to pick a k-mer from the *t-th* DNA string that fits the profile best, we first compute probabilities of all k-mers in string t based on the profile matrix:

Pr(k-mer|profile) = profile[k-mer[0], 0] * profile[k-mer[1], 1] *..* profile[k-mer[k-1], k-1]

(Here is where Laplace rule comes in handy: we would really be in trouble if there were 0 entries in the profilie matrix. Many k-mers would have Pr(k-mer|profile) = 0 and would short-circuit the algorithm).

and then pick the k-mer with the highest probability.

Code below computes probabilities of a set of k-mers given a profile.

let kmerProbs (kmers : string []) (profile : float [,]) = let k = Array2D.length2 profile let probs = kmers |> Array.map (fun kmer -> kmer, kmer.ToCharArray() |> Array.map (fun c -> alphabet.[c]) |> Array.mapi (fun i ind -> profile.[ind, i]) |> Array.fold (fun state p -> state * p) 1.) probs

## Gibbs Sampling

Given a discrete distribution, we want to sample from it:

- Pick a sample
*s*from the uniform distribution [0, n) - Lookup its probability,
*p*_{s} - Sample from a uniform [0, 1],
*p*_{u} - If
*p*<=_{u}*p*– accept the sample and return it, otherwise repeat._{s}

One thing to note here is that our probabilities do not necessarily sum up to 1 by design. So we first normalize them, by dividing each probability value by the sum of all probabilities.

The function below returns a motif sampled from the distribution we built from a profile:

let gibbsSample (rnd : Random) (probs : (string * float) []) = let sumall = probs |> Array.sumBy (fun (s, p) -> p) let prs = probs |> Array.map (fun (k, v) -> k, v / sumall) let len = probs.Length let rec rollUntilAccepted value = let n = rnd.NextDouble() if n <= snd prs.[value] then value else rollUntilAccepted (rnd.Next(0, len)) fst probs.[rollUntilAccepted (rnd.Next(0, len))]

## The Algorithm

Since solving the problem exactly by brute force is not feasible in practice with long strings and huge collections, approximations are used to get solutions that are “close enough”.

The Gibbs sampling based solution starts from a collection of random k-mers, each picked from one string of DNA and tries to improve on this original pick, performing a sequence of iterations, using Gibbs sampling.

Since the solution evidently depends on the initial collection picked, this is re-run many times to pick the collection with the minimal score.

Briefly the algorithm is:

Given a collection of strings (DNA) of size *t*, length *k*,

- Start from a collection of motifs (k-mers) of size
*t*randomly picked from DNA, one motif from one string of DNA - Store the score of this collection
- Loop many times:
- Select a random number j in [0, t)
- Exclude j-th motif from the motif collection and build the profile of the remaining collection
- Compute probability distribution of motifs in DNA[j] using the profile matrix from the previous step
- Sample from the distribution and replace j-th motif with the obtained sample
- Memorize best score, best collection of motifs

Here are the above steps translated to F#:

let gibbsSamplingMotifSearch (dna : string []) k iters rndStarts = let t = dna.Length //prepopulate the list of k-mers for all DNA strings let kmers = dna |> Array.map(fun s -> [0..s.Length - k] .Select(fun i -> s.Substring(i, k)).Distinct().ToArray()) let len = dna.[0].Length - k // single pass of the algorithm (single "random start") let runSingle () = let firstMotifs = [|1..t|] |> Array.map (fun i -> rnd.Next(0, len)) |> Array.mapi (fun i p -> dna.[i].Substring(p, k) |> toInts) let bestMotifs = array2D firstMotifs let bestScore = score bestMotifs // repeat the sampling step n times let rec runSampler (motifs : int [,]) (bestMotifs : int [,]) bestScore n = if n = 0 then bestScore, bestMotifs else let except = rnd.Next(0, t) // excluded motif // exclude from the collection let profileMotifs = stackV motifs.[0..except-1, *] motifs.[except+1.., *] let profile = createProfile profileMotifs // create profile from the rest let probs = kmerProbs kmers.[except] profile // build probability distribution let motif = array2D [|gibbsSample rnd probs |> toInts|] // sample to obtain a replacement let curMotifs = motifs.[except+1.., *] |> stackV motif |> stackV motifs.[0..except-1, *] // replace let curScore = score motifs // obtain the current score //recursive step, memorizing best results runSampler curMotifs (if curScore < bestScore then curMotifs else bestMotifs) (if curScore < bestScore then curScore else bestScore) (n - 1) runSampler bestMotifs bestMotifs bestScore iters let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()

## Notes

Here is the header for this script:

open System open System.Linq open System.IO #load "..\packages\MathNet.Numerics.FSharp.3.6.0\MathNet.Numerics.fsx" open MathNet.Numerics.Random let rndSeed = RandomSeed.Robust() let rnd = Random.mersenneTwisterSeed rndSeed

I am using MathNet.Numerics random generator because it happens to be thread safe (which the .NET one is not). Thread safety is crucial because:

Line 46 above:

let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()

Since all single runs of the algorithm are created equal, it is easy to parallelize and improve your running time up to as many times as you have processor cores. This comes almost for free from .NET if you care enough to add AsParallel().

I do freely mix LINQ and native F# functionality. Perhaps many a purist would scorn me, but I find it works great in practice. Sometimes generic .NET structures which LINQ likes so much (and which I am not using here, not explicitly anyway) are simply faster because they are mutable (another heresy in the functional world, I realize, however it is faster to add an element to a collection than to rebuild the entire collection in order to add one element. For long collections it adds up).

I don’t use the query{} syntax in F# 3.1. I do like it, it is just that I’m used to lambdas and chaining. A matter of taste, I suppose.

The code above only has one explicit “for” loop (Line 17 in Laplace Rule code), tail recursion rocks!

And the final observation: I have seen lots of folks note how F# lets you do things succinctly. I would like to add, that it also enables you to be “right” the first time much more often than many other languages. (Meaning, that the functions you write have a much better chance to have fewer bugs the very first time you launch them than if you were to write them in a non-functional language).