## Zooming Through Euler Path: Supercharging with GPU

So, continuing where we left off:

- Walking the Euler Path: Intro
- Visualizing Graphs
- Walking the Euler Path: GPU for the Road
- Walking the Euler Path: PIN Cracking and DNA Sequencing

### For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch() let N = 1024 * 1024 let k = 7 let avgedges k = [1..k] |> List.map float |> List.average let gr = StrGraph.GenerateEulerGraph(N * 10, k) printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges)) let eulerCycle = findEulerTimed gr // GPU-based algorithm sw.Restart() let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm sw.Stop() let cpu = float sw.ElapsedMilliseconds printfn "CPU: Euler cycle generated in %A" sw.Elapsed

And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4 Generated euler graph in 00:00:19.7520656, edges: 41,944,529 Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00 1. Predecessors computed in 00:00:03.2146705 2. Partitioned linear graph in 00:00:06.4475982 Partitions of LG: 6 3. Circuit graph generated in 00:00:31.4655218 4. Swips implemented in 00:00:00.2189634 GPU: Euler cycle generated in00:00:41.3474044CPU: Euler cycle generated in00:01:02.9022833

And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.

Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!

While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

### The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.

Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15 let k = 5 let gr = StrGraph.GenerateEulerGraph(N, k) gr.Visualize(edges=true)

1. We walk it as we like, computing edge predecessors. For two edges is a predecessor of iff , i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) = let rowIndex = arrayCopy gr.RowIndex let ends = gr.ColIndex let predecessors = Array.create gr.NumEdges -1 [|0..ends.Length - 1|] |> Array.iter (fun i -> predecessors.[rowIndex.[ends.[i]]] <- i rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1 ) predecessors

2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so:` predecessors.[List.head result] :: result`

. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:

This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a **vertex two partitions have in common**.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition gcGraph.Visualize()

4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge is repeated as All we actually need is a spanning tree of this graph:

gcGraph.Visualize(spanningTree=true)

Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors finalGraph.Reverse.Visualize()

And it’s a beautiful circle, we are done!

### Why not Break out That Cognac?

let N = 1024 * 1024 let i = 1 let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N) let eulerCycle = findEulerTimed gr

Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00 1. Predecessors computed in 00:00:00.3479258 2. Partitioned linear graph in00:02:48.3658898Partitions of LG: 45514 # of partitions:45514 (CPU generation of CG)3. Circuit graph generated in 00:00:34.1632645 4. Swips implemented in 00:00:00.1707746 GPU: Euler cycle generated in 00:03:23.0505569

This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the `HashSet`

implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a `HashSet`

to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as `GenerateEulerGraph`

and `GenerateEulerGraphAlt`

. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

## Walking the Euler Path: GPU for the Road

Continuation of the previous posts:

#### GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “*alea* iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite *Aela*.CUDA library.

#### GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) = use dSorted = sortGpu dArr use dGrouped = worker.Malloc<int>(dSorted.Length) let lp = LaunchParam(divup dSorted.Length blockSize, blockSize) worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr compactGpuWithKernel <@createDistinctMap @> dGrouped

- We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
- We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
- Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000)) GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152 CPU distinct: 00:00:00.0262776 GPU distinct: 00:00:02.9162098 Length: 26,214,400 CPU distinct: 00:00:00.5622276 GPU distinct: 00:00:03.2298218 Length: 262,144,000 CPU distinct: 00:00:03.8712437 GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000 CPU distinct: 00:00:04.2019013 GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

### Computing Graph Properties

#### Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true) euler.Visualize(spanningTree=true, washNonSpanning=false)

Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

#### Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing `IEnumerable<DirectedGraph> `

to accomplish this, right? Wrong! What we want is the `AsEnumerable`

property. Makes things clean and easy. It uses `Seq.init`

method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the `StructuredFormatDisplay("{AsEnumerable}")`

to enable F# Interactive pretty printing of our graph:

[<StructuredFormatDisplay("{AsEnumerable}")>] type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal :

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph = seq [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]); ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

#### “Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given ,

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple , we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse = lazy ( let allExistingRows = [0..rowIndex.Length - 1] let subSeq = if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse let start, end' = let dStart, dEnd = getEdgesGpu rowIndex colIndex sortStartEnd dStart dEnd Seq.zip end' start else asOrdinalsEnumerable () |> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i))) |> Seq.collect id let grSeq = subSeq |> Seq.groupBy fst |> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray) let allRows : seq<int * int []> = allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||]) |> fun col -> col.Union grSeq |> Seq.sortBy fst let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length let revColIndex = allRows |> Seq.collect snd DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal) ) member this.Reverse = reverse.Force()

On line 35, `.Force()`

will only call the computation once and cache the result. Each subsequent call to `.Force()`

will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” `grSeq`

will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|]; let gr = StrGraph.FromStrings s gr.Visualize() gr.Reverse.Visualize()

#### Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to .

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: `start`

and `end'`

that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.

[<Kernel;ReflectedDefinition>] let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len - 1 then for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do start.[vertex] <- idx end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024 let k = 5 sw.Restart() let gr = StrGraph.GenerateEulerGraph(N, k) sw.Stop() printfn "Graph: %s vertices, %s edges generated in %A" (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed sw.Restart() let starts, ends = getEdges gr.RowIndex gr.ColIndex sw.Stop() printfn "GPU edges: %A" sw.Elapsed sw.Restart() gr.OrdinalEdges sw.Stop() printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697 GPU edges: 00:00:01.5234606 CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!

## Walking the Euler Path: Intro

### Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

### Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

### Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

#### Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

#### Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with `INumeric`

included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs /// <summary> /// Sparse matrix implementation with CSR and CSC storage /// </summary> [<StructuredFormatDisplay("{PrintMatrix}")>] type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) = .... static member CreateMatrix (row : 'a []) (isCSR : bool) = let ops = GlobalAssociations.GetNumericAssociation<'a>() let colIdx, vals = Array.zip [|0..row.Length - 1|] row |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0) |> Array.unzip SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the `GlobalAssociations`

to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to `CreateMatrix`

and then rows can be appended to it by calling `AddValues row`

. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

#### Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix , then the answer is given by the equation: , here is the number of non-zero elements in the matrix.

For a graph the set V takes a place of rows, and E – that of columns. The above inequality becomes: , so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:

[<StructuredFormatDisplay("{AsEnumerable}")>] type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this = let rowIndex = rowIndex.ToArray() let colIndex = colIndex.ToArray() let nEdges = colIndex.Length let verticesNameToOrdinal = verticesNameToOrdinal let nVertices = verticesNameToOrdinal.Count // vertices connected to the ordinal vertex let getVertexConnections ordinal = let start = rowIndex.[ordinal] let end' = rowIndex.[ordinal + 1] - 1 colIndex.[start..end']

This is not very useful, however, since it assumes that we already have `rowIndex`

for the CSR type “R” and `colIndex`

for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file `Generators.fs`

.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.

In the next post I’ll talk about visualizing things, and vsiualization methods *really* have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)

instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:

//Generators.fs type Graphs.DirectedGraph<'a when 'a:comparison> with /// <summary> /// Create the graph from a file /// </summary> /// <param name="fileName"></param> static member FromFile (fileName : string) = if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file" let lines = File.ReadLines(fileName) DirectedGraph<string>.FromStrings(lines)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

## Look-and-say: [Alea.]CUDA

Continuing the Advent of Code theme from the previous post. Figured since this year is going to be my year of CUDA, this would be a good opportunity to take it for a ride. A good April 1st post, but why wait?

So, how can we make this even faster than the already fast imperative solution?

### Naive Approach

The complete script for this is on GitHub.

The kernel computes one iteration of the look-and-say sequence. Every instance looks at a single group of repeating digits and outputs the result into an array. The new array is twice the size of the original one (representing a previous member of the sequence), every digit or the first digit of a group will produce two number (n, c) and every repeating digit will produce two “0”s on the output:

1 1 1 3 1 2 2 1 1 3 -> 3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3

[<Kernel; ReflectedDefinition>] let lookAndSayKernel (arr : deviceptr<int8>) (len : int) (out : deviceptr<int8>) = let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len then let c = arr.[ind] let idxOut = 2 * ind if ind = 0 || arr.[ind - 1] <> c then let mutable i = 1 while ind + i < len && c = arr.[ind + i] do i <- i + 1 out.[idxOut] <- int8 i out.[idxOut + 1] <- c else out.[idxOut] <- 0y out.[idxOut + 1] <- 0y

In the naive approach we bring the resulting sequence into the memory and repeat:

let lookAndSayGpu (s : string) n = let arr = s |> Seq.map (fun c -> (string>>int8) c) |> Seq.toArray let rec loop (arr : int8 []) n = if n = 0 then arr.Length else let blockSize = 512 let gridSize = divup arr.Length blockSize let lp = LaunchParam(gridSize, blockSize) use dArr = worker.Malloc(arr) use dOut = worker.Malloc(arr.Length * 2) worker.Launch <@lookAndSayKernel@> lp dArr.Ptr arr.Length dOut.Ptr let out = dOut.Gather() let arr = out |> Array.filter (fun b -> b > 0y) loop arr (n - 1) loop arr n

This is almost not worth checking against the high performing imperative algorithm of the last post, but why not?

Not unexpectedly we are still falling short of the imperative solution. Not a problem. There are a few very low hanging fruit here that are just waiting to be picked.

### A Better Solution

The complete script for this is on GitHub.

There are two problems with our solution so far:

- Two much memory being moved between the GPU and RAM
- A significant part of the algorithm is executed on the CPU with memory allocated and destroyed in the process (that would be the array compacting part)

The following algorithm addresses both problems:

GenerateSequences(str, n) {

//Allocate the maximum amount of memory for all data structures used on the GPU

len = str.Length

initializeGpuMemory(str, len)

for i = 1 to n {

lookAndSayKernel(len)

compactResultStream(2 * len)

len = getNewLength()

}

}

Of the 3 steps in this iterative loop, there is only one where in the implementation we move data from the GPU. It would be easy enough to not do it if necessary, though.

At the core of this algorithm there is a stream compaction routine that is executed hopefully entirely on the GPU. Compaction is performed by copying only the values needed to the addresses in the output stream. These addresses are determined by running an inclusive scan on the array of integers that contains “1” in a meaningful position, and “0” in all the rest. The details of how scan is used for stream compaction and how scans are implemented on the GPU, are discussed in GPU Gems 3 (free from NVIDIA). I’m using a device scan that comes with Alea.CUDA, which, hopefully, minimizes memory copies between main memory and GPU.

Now that the stream has been compacted, we get the length of the new sequence as the last entry in the address map, obtained during compaction (and explained in the linked chapter of GPU Gems 3).

[<Kernel; ReflectedDefinition>] let lookAndSayKernelSimple (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) (flags : deviceptr<int>)= let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len then let c = arr.[ind] let idxOut = 2 * ind let prevUnrepeated = if ind = 0 || arr.[ind - 1] <> c then 1 else 0 flags.[idxOut] <- prevUnrepeated flags.[idxOut + 1] <- prevUnrepeated if prevUnrepeated = 1 then let mutable i = 1 while ind + i < len && c = arr.[ind + i] do i <- i + 1 out.[idxOut] <- i out.[idxOut + 1] <- c else out.[idxOut] <- 0 out.[idxOut + 1] <- 0 [<Kernel; ReflectedDefinition>] let copyScanned (arr : deviceptr<int>) (out : deviceptr<int>) (len : int) (flags : deviceptr<int>) (addrmap : deviceptr<int>) = let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len && flags.[ind] > 0 then out.[addrmap.[ind] - 1] <- arr.[ind]

In the modified kernel, we also fill out the “flags” array which are scanned to obtain the address map for compacting the stream. The second kernel uses the result of the scan performed on these “flags” to produce the new compacted sequence. So for the above example:

original: 1 1 1 3 1 2 2 1 1 3

new: 3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3

flags: 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1

map: 1 2 2 2 2 2 3 4 5 6 7 8 8 8 9 10 10 10 11 12 (inclusive scan of the flags)

The last entry in the map is also the new length.

let lookAndSayGpuScan (s : string) n = let maxLen = 20 * 1024 * 1024 let arr = s |> Seq.map (fun c -> (string>>int) c) |> Seq.toArray use dOut = worker.Malloc(maxLen) use dFlags = worker.Malloc(maxLen) use dAddressMap = worker.Malloc(maxLen) use dArr = worker.Malloc(maxLen) dArr.Scatter(arr) use scanModule = new DeviceScanModule<int>(GPUModuleTarget.Worker(worker), <@ (+) @>) scanModule.Create(100) |> ignore let sw = Stopwatch() let rec loop n len = if n = 0 then len else let blockSize = 512 let gridSize = divup len blockSize let lp = LaunchParam(gridSize, blockSize) use scan = scanModule.Create(2 * len) worker.Launch <@lookAndSayKernelSimple@> lp dArr.Ptr len dOut.Ptr dFlags.Ptr scan.InclusiveScan(dFlags.Ptr, dAddressMap.Ptr, 2 * len) let gridSize = divup (2 * len) blockSize let lp = LaunchParam(gridSize, blockSize) worker.Launch <@copyScanned@> lp dOut.Ptr dArr.Ptr (2 * len) dFlags.Ptr dAddressMap.Ptr let len = dAddressMap.GatherScalar(2 * len - 1) loop (n - 1) len sw.Start() let res = loop n s.Length sw.Stop() res, sw.ElapsedMilliseconds

The complete code is above. Notice, that we are not cheating when we are measuring the speed of the internal loop and not including allocations. These can be factored out completely, so the only experiment that matters is what actually runs.

Now, we are finally fast! But it’s a mighty weird chart, probably because most of the time is spent in context switching and all other latency necessary to make this run…

So, takeaways: GPU programming makes us think differently, cool GPU algorithms exist, writing for the GPU is tough but rewarding, and we can really blow a small problem out of any proportion considered descent.

## Non-linear Thinking with CUDA.

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

### The Problem

Given a set , output a set

In other words, pretend our array is *cyclical*, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 | 71 | 113 | 113 | 128 |

57 + 71 | 71 + 113 | 113 + 113 | 113 + 128 | 128 + 57 |

57 + 71 + 113 | 71 + 113 + 113 | 113 + 113 + 128 | 113 + 128 + 57 | 128 + 57 + 71 |

### Code

On my GitHub.

### Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is :

// brute force O(n^3) solution let cyclospec (peptide : int seq) = let len = peptide.Count() let pepArr = peptide |> Seq.toArray let mutable parsum = 0 (seq { yield 0 yield pepArr |> Array.sum for i = 0 to len - 2 do yield! [0..len-1] |> Seq.map (fun j -> //F# 4.0 Feature! parsum <- 0 for ind = j to j + i do parsum <- parsum + pepArr.[if ind >= len then ind - len else ind] parsum ) }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an solution:

let cyclospecOpt (peptide : int seq) = let len = peptide.Count() let pepArr = peptide |> Seq.toArray let output = Array.zeroCreate ((len - 1) * len) for i = 0 to len - 2 do [0..len-1] |> Seq.iter (fun j -> let ind = i + j output.[i * len + j] <- if i = 0 then pepArr.[j] else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind] ) seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

### Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of , thus it’s a grid X of dimensions n x (n-1)( where is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>] let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) = let ind = blockIdx.x * blockDim.x + threadIdx.x let lenElem = blockIdx.y * blockDim.y + threadIdx.y if ind < len && lenElem < len - 1 then let mutable parsum = 0 for i = 0 to lenElem do let idx = ind + i parsum <- parsum + arr.[if idx >= len then idx - len else idx] out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) = let blockSize = dim3(16, 16) let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y) let lp = LaunchParam(gridSize, blockSize) use dPeptide = worker.Malloc(peptide) use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1)) worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr let output = dOutput.Gather() seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

[[]

let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=

let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then

let idx = ind + lenElem

out.[lenElem * len + ind] = len then idx – len else idx]

Invocation:

let cyclospecOptGpu (peptide : int []) = let blockSize = 256 let gridSize = divup peptide.Length blockSize let lp = LaunchParam(gridSize, blockSize) use dPeptide = worker.Malloc(peptide) use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1)) for i = 0 to peptide.Length - 2 do worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i let output = dOutput.Gather() seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):

And so:

- A little algorithm is a beautiful thing
- A little CUDA is an even more beautiful thing
- CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
- And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:

## Generating Permutations: Clojure or F#: Part 2

Marching on from the last post.

### Lazy Sequences

This is my favorite feature ever. If I want to generate just a few of 10! (nobody even knows how much that is) permutations, I could:

(take 10 (permute [1 2 3 4 5 6 7 8 9 10]))

provided, the function is defined (as described in the first post):

(defn permute [v] (when-let [[pos2 pos1] (findStartingPos v)] (let [nxt (sort-remainder (swapDigits v pos2 pos1) (inc pos1))] (cons nxt (lazy-seq (permute nxt))))))

Here I am not sure which language I like more. Clojure has easier syntax: everything fits nicely within the recursive function call. Returning nil terminates the loop, while in F# you need to know to return an option type where None terminates iteration. On the other hand, I like the fact that everything is neatly wrapped in the “unfold” function: seems more “natural” to me: fold/unfold – there is a certain symmetry here. Also, everything exists nicely in this LINQ-like world…

let permute (v : 'a array when 'a: comparison) = Seq.unfold (fun prev -> match findStartingPos prev with | None -> None | Some (cur, pos) -> Some(prev, sortRemainder (swapPositions prev cur pos) (pos + 1))) v

### Weak Typing

I really like Clojure weak typing. And I like the F# strong type system:

let sortRemainder (v : 'a array) pos = if v.Length - 1 = pos then v else [| yield! v.[0..pos - 1] yield! Array.sort v.[pos..v.Length - 1]; |]

F# type system requires that the first argument be qualified, but it is happy with this abbreviation, while the full qualification should be:

let sortRemainder (v : 'a array when 'a: comparison) pos =

Since we are sorting a subvector, the array has to be of a “comparable” type. Which is the condition of the applicability of the algorithm.

In Clojure it looks simpler, but it’s essentially the same:

(defn sort-remainder [v pos1] (if (= (dec (count v)) pos1) v (into (subvec v 0 pos1) (sort (subvec v pos1)))))

### Tail Recursion

One more cool feature of functional languages. I think it’s another tie once you use it, although the “loop” construct that demands it is very nice.

The following function returns a tuple (current, found) of two positions within the array: one of the element that is being “promoted” up (current), and the other – of the smaller element being pushed back. (So, current > found && v[current] < v[found]). Or nil/None if no such pair can be found. This is the key function of the algorithm:

(defn findStartingPos [v] (loop [cur (dec (count v)) acc [-1 -1]] (let [maxPos (second acc)] (if (or (< cur maxPos) (< cur 0)) (if (= maxPos -1) nil acc) (if-let [pos (findFirstLessThan v cur)] (recur (dec cur) (if (< maxPos pos) [cur pos] acc)) (recur (dec cur) acc))))))

And F#:

let findStartingPos v = let rec findStartingPosRec cur acc = let maxPos = snd acc if cur < 0 || cur < maxPos then if maxPos < 0 then None else Some acc else let pos = findFirstLessThan v cur match pos with | Some pos -> findStartingPosRec (cur - 1) (if maxPos < pos then (cur, pos) else acc) | None -> findStartingPosRec (cur - 1) acc findStartingPosRec (v.Length - 1) (-1, -1)

It’s nice that we have a “loop” keyword in Clojure to provide cleaner syntax and more discipline for defining tail-recursive functions, but I am not appalled with the way we do it in F# either.

(The above functions contain obvious optimizations: we stop scanning once we have a pair of “swappable” elements and we have moved beyond the “found” position. Also, we discard a valid pair if we already have a pair where “found” position is larger than the “found” position of the current iteration).

## Doing it in a Massively Parallel Way

Of course, I like everything parallel… So what about doing it on a GPU, using, say CUDA? It is definitely possible, although probably not very practical. Even if we only have an array of 10 distinct elements, the number of permutations is already ridiculously large (although who knows what we are going to be using them for)… In any event, this is solvable if we can get “random access” to permutations. Instead of unfolding them as a lazy sequence, generate them all at once in a massively parallel fashion.

This is possible because permutations are neatly tied to factoradic numbers, as this Wikipedia article explains. So, it is always possible to generate “permutation #10” to be guaranteed different from “permutation #5” for distinct, fully ordered sets. (Any sets where ordering relationship is not defined can still be easily permuted as long as its elements are stored in indexed data structures, such as arrays, by simply generating permutations of indices). Thus, taking CUDA “single data multiple threads” computation model it is easy to generate all (or however many) permutations in parallel. Naturally, if we are not just outputting the results but need to store them, the exponential nature of the problem memory growth, as well as the number of threads required, and the limited amount of GPU memory (a single computer RAM for that matter) will quickly become a problem. I guess the CUDA C++ version of this will have to wait until the next job interview…

## Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

In the previous post I spoke briefly about motivations for implementing self-organizing maps in F# using GPU with CUDA. I have finally been able to outperform a single threaded C++ implementation by a factor of about 1.5. This is quite modest, but on the other hand rather impressive since we started out by being 60 times slower. At this point I am bidding farewell to F# and switching to C++. It will be interesting to see how this works out, but here are my initial algorithms.

So, we are parallelizing the following:

member this.GetBMU (node : Node) = let min = ref Double.MaxValue let minI = ref -1 let minJ = ref -1 this.somMap |> Array2D.iteri (fun i j e -> let dist = getDistance e node this.Metric if dist < !min then min := dist; minI := i; minJ := j) !minI, !minJ

Here somMap is just a two-dimensional array of Node-s. And a Node is simply an array of float’s (double’s in C#), of the size equal to dimensionality of our space.

The code for getDistance is also simple:

let getDistanceEuclidian (x : Node) (y : Node) = Math.Sqrt([0..x.Dimension - 1] |> Seq.fold(fun sq i -> sq + (x.[i] - y.[i]) ** 2.) 0.) let getDistance (x : Node) (y : Node) metric = if x.Dimension <> y.Dimension then failwith "Dimensions must match" else match metric with | Euclidian -> getDistanceEuclidian x y | Taxicab -> getDistanceTaxicab x y

In my parallel version I am only using Euclidian distance for simplicity.

### Parallel Algorithms

As I have mentioned above, Alea.cuBase (my F#-to-CUDA framework) does not support multidimensional arrays (as well as shared memory or calling a kernel from a kernel), so this is the price I had to pay for sticking to my favorite language. Given these constraints, here is what I came up with:

#### Node-by-node

The very first idea (that proved also the very best) is quite intuitive. To find BMUs for the entire set of nodes, just take each individual node, find its BMU in parallel, repeat.

The map is first flattened into a single dimension, where if the cell coordinates were (i, j, k), they are mapped to (i * j * k + j * k + k). All distance sub-calculations can be performed in one shot, and then one thread finishes them up. By “sub-calculation” I mean computing (node(i) – map(j)) ** 2. These are then added up to the distance squared (I don’t calculate sqrt, since I don’t need it to find the minimum).

So, here is the implementation:

let pDistances = cuda { let! kernel = <@ fun nodeLen len (node : DevicePtr<float>) (map : DevicePtr<float>) (distances : DevicePtr<float>) (minDist : DevicePtr<float>) (minIndex : DevicePtr<int>) -> // index into the original map, assuming // a node is a single entity let mapI = blockIdx.x * blockDim.x // actual position of the node component in the map let i = mapI + threadIdx.x if i < len then // index into the node let j = threadIdx.x % nodeLen distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j]) if threadIdx.x = 0 then __syncthreads() let mutable thread = 0 minIndex.[blockIdx.x] <- -1 // find the minimum among threads while mapI + thread < len && thread < blockDim.x do let k = mapI + thread let mutable sum = 0. for j = 0 to nodeLen - 1 do sum <- sum + distances.[k + j] if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then minDist.[blockIdx.x] <- sum minIndex.[blockIdx.x] <- k / nodeLen thread <- thread + nodeLen @> |> defineKernelFunc let diagnose (stats:KernelExecutionStats) = printfn "gpu timing: %10.3f ms %6.2f%% threads(%d) reg(%d) smem(%d)" stats.TimeSpan (stats.Occupancy * 100.0) stats.LaunchParam.BlockDim.Size stats.Kernel.NumRegs stats.Kernel.StaticSharedMemBytes return PFunc(fun (m:Module) (nodes : float [] list) (map : float []) -> let kernel = kernel.Apply m let nodeLen = nodes.[0].Length let chunk = map.Length let nt = (256 / nodeLen) * nodeLen // number of threads divisible by nodeLen let nBlocks = (chunk + nt - 1)/ nt //map.Length is a multiple of nodeLen by construction use dMap = m.Worker.Malloc(map) use dDist = m.Worker.Malloc<float>(map.Length) use dMinIndices = m.Worker.Malloc<int>(nBlocks * nodes.Length) use dMinDists = m.Worker.Malloc<float>(nBlocks * nodes.Length) use dNodes = m.Worker.Malloc(nodes.SelectMany(fun n -> n :> float seq).ToArray()) let lp = LaunchParam(nBlocks, nt) //|> Engine.setDiagnoser diagnose nodes |> List.iteri (fun i node -> kernel.Launch lp nodeLen chunk (dNodes.Ptr + i * nodeLen) dMap.Ptr dDist.Ptr (dMinDists.Ptr + i * nBlocks) (dMinIndices.Ptr + i * nBlocks)) let minDists = dMinDists.ToHost() let indices = dMinIndices.ToHost() let mins = (Array.zeroCreate nodes.Length) for i = 0 to nodes.Length - 1 do let baseI = i * nBlocks let mutable min = minDists.[baseI] mins.[i] <- indices.[baseI] for j = 1 to nBlocks - 1 do if minDists.[baseI + j] < min then min <-minDists.[baseI + j] mins.[i] <- indices.[baseI + j] mins ) }

To get as much parallelism as possible, I start with 256 threads per block (maximum on Kepler is 1024, but 256 gives the best performance). Since each block of threads is going to compute as many distances as possible, I try to allocate the maximum number of threads divisible by node dimensionality. In my case: 256 / 12 * 256 = 252. Pretty good, we get almost an optimal number of threads per block.

The number of blocks are computed from the length of the map. I want to split this in an optimal way since each block is scheduled on one SP, and I have 192 of those I don’t want them to idle. The algorithm leans towards parallelizing calculations relative to the map (see picture above, I want all those “arrows” to be executed in parallel), so the number of blocks will be (somArray.length + nt – 1) / nt (nt is the number of threads – 252, somArray.length is 200 * 200 * 12 = 480000, the formula above takes into account the fact that this number may not be a multiple of 252, in which case we will need one more incomplete block). My block size is 1905 – pretty good, CUDA devs recommend that to be at least twice the number of multiprocessors. It is necessary to hide latency, which is killer in this development paradigm (you need to rely on your PCI slot to transfer data).

One weird thing here is that I have to allocate a relatively large dDist arrray. This is where temporary values of (map(i) – node(j))**2 go. In reality (if I were writing in C++) I would not do that. I would just use the super-fast shared memory for this throw-away array. I could not get that to work with Alea, although the documentation says it is supported.

Another thing: the call to __syncthreads() is issued under a conditional statement. This, in general, is a horrible idea, because in the SIMT (single instruction multiple threads), threads march “in sync” so to speak, instruction by instruction. Thus, doing what I have done may lead to things hanging as some threads may take different branches and other threads will wait forever. Here, however, we are good, because the only way to go if the condition evaluates to false is to simply quit.

The kernel is called in a loop. One call for each node: 10000 passes (vs 10000 * 40000) in the sequential case. I also make sure that all of my memory is allocated once and everything I need is copied there. Not doing that leads to disastrous consequences, since all you would have in that case is pure latency.

The code is self-explanatory. Once everyone has computed their part of the distance, these parts are assembled by the 0-th thread of the block. That thread is responsible for calculating the final square of the distance, comparing the result of what we already have in the minDist array for this map node and storing that result.

#### Node-by-node optimized

And here lies the mistake that makes this algorithm lose out on performance: there is no need to delegate this work to the 0-th thread (looping over all 252 threads of the block). It is enough to delegate that to each “threadIdx.x % nodeLen = 0″‘s thread of the block, so now 21 threads do this work in parallel, each for only 12 threads. Algorithm re-written this way outperforms everything else I could come up with.

Here is the re-write of the kernel function:

let! kernel = <@ fun nodeLen len (node : DevicePtr<float>) (map : DevicePtr<float>) (distances : DevicePtr<float>) (minDist : DevicePtr<float>) (minIndex : DevicePtr<int>) -> // index into the original map, assuming // a node is a single entity let mapI = blockIdx.x * blockDim.x // actual position of the node component in the map let i = mapI + threadIdx.x if i < len then // index into the node let j = threadIdx.x % nodeLen distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j]) __syncthreads() if threadIdx.x % nodeLen = 0 then minIndex.[blockIdx.x] <- -1 // find the minimum among threads let k = mapI + threadIdx.x let mutable sum = 0. for j = k to k + nodeLen - 1 do sum <- sum + distances.[j] if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then minDist.[blockIdx.x] <- sum minIndex.[blockIdx.x] <- k / nodeLen @> |> defineKernelFunc

This kernel function only stores “winning” (minimal) distances within each section of the map. Now they need to be reduced to one value per node. There are 1905 blocks and 10000 nodes. 10000 minimum values are computed in one pass over the 1905 * 10000-length array of accumulated minimal distances:

let minDists = dMinDists.ToHost() let indices = dMinIndices.ToHost() let mins = (Array.zeroCreate nodes.Length) for i = 0 to nodes.Length - 1 do let baseI = i * nBlocks let mutable min = minDists.[baseI] mins.[i] <- indices.[baseI] for j = 1 to nBlocks - 1 do if minDists.[baseI + j] < min then min <-minDists.[baseI + j] mins.[i] <- indices.[baseI + j] mins

And we are done. The improved version beats all the rest of my algorithms. Since all of these performance improvements looked so “juicy”, I decided it was high time to abandon managed code and go back to the C++ world. Especially since writing C++ code is not going to be so different: no annoying loops to write, just express it linearly and reap the massively parallel goodness.