## Zooming Through Euler Path: Supercharging with GPU

So, continuing where we left off:

- Walking the Euler Path: Intro
- Visualizing Graphs
- Walking the Euler Path: GPU for the Road
- Walking the Euler Path: PIN Cracking and DNA Sequencing

### For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch() let N = 1024 * 1024 let k = 7 let avgedges k = [1..k] |> List.map float |> List.average let gr = StrGraph.GenerateEulerGraph(N * 10, k) printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges)) let eulerCycle = findEulerTimed gr // GPU-based algorithm sw.Restart() let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm sw.Stop() let cpu = float sw.ElapsedMilliseconds printfn "CPU: Euler cycle generated in %A" sw.Elapsed

And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4 Generated euler graph in 00:00:19.7520656, edges: 41,944,529 Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00 1. Predecessors computed in 00:00:03.2146705 2. Partitioned linear graph in 00:00:06.4475982 Partitions of LG: 6 3. Circuit graph generated in 00:00:31.4655218 4. Swips implemented in 00:00:00.2189634 GPU: Euler cycle generated in00:00:41.3474044CPU: Euler cycle generated in00:01:02.9022833

And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.

Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!

While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

### The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.

Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15 let k = 5 let gr = StrGraph.GenerateEulerGraph(N, k) gr.Visualize(edges=true)

1. We walk it as we like, computing edge predecessors. For two edges is a predecessor of iff , i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) = let rowIndex = arrayCopy gr.RowIndex let ends = gr.ColIndex let predecessors = Array.create gr.NumEdges -1 [|0..ends.Length - 1|] |> Array.iter (fun i -> predecessors.[rowIndex.[ends.[i]]] <- i rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1 ) predecessors

2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so:` predecessors.[List.head result] :: result`

. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:

This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a **vertex two partitions have in common**.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition gcGraph.Visualize()

4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge is repeated as All we actually need is a spanning tree of this graph:

gcGraph.Visualize(spanningTree=true)

Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors finalGraph.Reverse.Visualize()

And it’s a beautiful circle, we are done!

### Why not Break out That Cognac?

let N = 1024 * 1024 let i = 1 let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N) let eulerCycle = findEulerTimed gr

Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00 1. Predecessors computed in 00:00:00.3479258 2. Partitioned linear graph in00:02:48.3658898Partitions of LG: 45514 # of partitions:45514 (CPU generation of CG)3. Circuit graph generated in 00:00:34.1632645 4. Swips implemented in 00:00:00.1707746 GPU: Euler cycle generated in 00:03:23.0505569

This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the `HashSet`

implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a `HashSet`

to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as `GenerateEulerGraph`

and `GenerateEulerGraphAlt`

. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

## Walking the Euler Path: GPU for the Road

Continuation of the previous posts:

#### GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “*alea* iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite *Aela*.CUDA library.

#### GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) = use dSorted = sortGpu dArr use dGrouped = worker.Malloc<int>(dSorted.Length) let lp = LaunchParam(divup dSorted.Length blockSize, blockSize) worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr compactGpuWithKernel <@createDistinctMap @> dGrouped

- We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
- We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
- Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000)) GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152 CPU distinct: 00:00:00.0262776 GPU distinct: 00:00:02.9162098 Length: 26,214,400 CPU distinct: 00:00:00.5622276 GPU distinct: 00:00:03.2298218 Length: 262,144,000 CPU distinct: 00:00:03.8712437 GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000 CPU distinct: 00:00:04.2019013 GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

### Computing Graph Properties

#### Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true) euler.Visualize(spanningTree=true, washNonSpanning=false)

Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

#### Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing `IEnumerable<DirectedGraph> `

to accomplish this, right? Wrong! What we want is the `AsEnumerable`

property. Makes things clean and easy. It uses `Seq.init`

method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the `StructuredFormatDisplay("{AsEnumerable}")`

to enable F# Interactive pretty printing of our graph:

[<StructuredFormatDisplay("{AsEnumerable}")>] type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal :

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph = seq [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]); ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

#### “Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given ,

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple , we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse = lazy ( let allExistingRows = [0..rowIndex.Length - 1] let subSeq = if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse let start, end' = let dStart, dEnd = getEdgesGpu rowIndex colIndex sortStartEnd dStart dEnd Seq.zip end' start else asOrdinalsEnumerable () |> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i))) |> Seq.collect id let grSeq = subSeq |> Seq.groupBy fst |> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray) let allRows : seq<int * int []> = allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||]) |> fun col -> col.Union grSeq |> Seq.sortBy fst let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length let revColIndex = allRows |> Seq.collect snd DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal) ) member this.Reverse = reverse.Force()

On line 35, `.Force()`

will only call the computation once and cache the result. Each subsequent call to `.Force()`

will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” `grSeq`

will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|]; let gr = StrGraph.FromStrings s gr.Visualize() gr.Reverse.Visualize()

#### Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to .

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: `start`

and `end'`

that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.

[<Kernel;ReflectedDefinition>] let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len - 1 then for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do start.[vertex] <- idx end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024 let k = 5 sw.Restart() let gr = StrGraph.GenerateEulerGraph(N, k) sw.Stop() printfn "Graph: %s vertices, %s edges generated in %A" (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed sw.Restart() let starts, ends = getEdges gr.RowIndex gr.ColIndex sw.Stop() printfn "GPU edges: %A" sw.Elapsed sw.Restart() gr.OrdinalEdges sw.Stop() printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697 GPU edges: 00:00:01.5234606 CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!

## Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 1: F#

`By 2017, it is expected that GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architecturesâ€™ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth limitations from an interface such as PCI Express for transfers between CPUs and GPUs, and the system-level energy overheads for both chip crossings and replicated chip infrastructure.`

```
```This is going to be all about parallelism, CUDA, performance, and new direction in software development. For me personally anyway. All the code mentioned below is here.

So, here we've got it. The need for SOMs. SOMs are wonderful for clustering and visualizing high dimensional data. Naturally, using the sacred principle of software development (best software engineers steal) I looked around for some existing code. Found this and this very fast. The first of these shows you very quickly what SOMs are and how to do build them step-by-step, while the second already has the C++ code that can just be taken as is and used for computing SOMs. Which was what I did.

In my experiments, I used around 12,000 12-dimensional nodes with a map of 200x200. On my puny i5 650 (3.2 Ghz), 8 Gb RAM, generating a SOM with these parameters takes around 4 hrs, maybe less. One "epoch" takes around 40 sec and I run 500 epochs, however, since the neighborhood of code vectors that gets trained in one epoch gradually diminishes, it is not a straight multiplication of 500 * 40.

These experiments have actually not yielded the results I was hoping for, perhaps because the training set is not large enough for the data I am trying to cluster. Be it as it may, more experiments are needed with a larger dataset, and I am already at the brink of feasibility as far as performance. It does increase linearly with the number of nodes. The C++ code that I have (stolen) is actually pretty good, but it is single-threaded, so doing things in parallel seems to be a logical next step.

### Step 0. CPU F#

At this point, I was actually interested more in performance improvements than in re-implementing SOM in F#. I did re-implement it in F#, but for my performance bench-marking I did not use the whole algorithm, just the first part where BMUs are calculated. Since BMU is a map vector that is closest to the given node, in order to compute one BMU it is necessary to iterate over the entire map. So computing BMUs for the entire set of nodes gets us the cost of O(n * d * m1 * m2) (m1, m2 are map dimensions, n is the length of the nodes array, d is dimensionality of each node vector). That's for one epoch only. And there are 500 of those. It adds up.

My F# implementation computed the BMU for one 12 dimensional node on a 200x200 SOM in a whopping 140 ms. Vs just 4ms for C++. I did expect a perfromance drop from C++, I just did not expect it to be that drastic.

member this.GetBMU (node : Node) =
let min = ref Double.MaxValue
let minI = ref -1
let minJ = ref -1
this.somMap |> Array2D.iteri (fun i j e ->
let dist = getDistance e node this.Metric
if dist < !min then min := dist; minI := i; minJ := j)
!minI, !minJ

Then I added parallelism. And Parallel.ForEach worked slightly better than Parallel.For.

member this.GetBMUParallel (node : Node) =
let monitor = new obj()
let minList = ref []
Parallel.ForEach(
Partitioner.Create(0, fst dims),
(fun () -> ref (Double.MaxValue, -1, -1)),
(fun range state local ->
let mutable(min, minI, minJ) =
match !local with
| m, i, j -> m, i, j
for i = fst range to snd range - 1 do
for j = 0 to snd this.Dimensions - 1 do
let dist = getDistance this.somMap.[i, j] node this.Metric
if dist < min then
min <- dist; minI <- i; minJ <- j
local := (min, minI, minJ)
local),
(fun local -> lock monitor (fun () ->
match !local with
| m, i, j when i > 0 ->
minList := (m, i, j) :: !minList
|_ -> ()
))) |> ignore
let minTuple = !minList |> List.minBy (fun (x, i, j) -> x)
match minTuple with
| x, i, j -> i, j

Nothing fancy here. Split the first dimension of the map into chunks and try processing them as much as possible in parallel, by utilizing all the 4 logical cores. The inner loop could also be re-written the same way (to use Parallel.For or Parallel.ForEach), but it would probably not do much good since we are already as parallel as we can be. (And in reality it did not. Do any good, that is). While I expected an at least 4-fold performance increase, I did not get. I did get a 2 times increase. Now it only took 70 ms for one node.

### Going massively parallel

At this point, things are really intuitive. If it were up to me, I'd do every calculation there is in parallel and then reduce them once they are done. If it takes, I dunno, say 0.001 mks to multiply 2 numbers in one processor thread, how long does it take to multiply 12000 * 12 * 2 numbers on 144000 processors? Obviously the same 0.001 ms. So the problem becomes almost constant in the number of nodes if we only could always have as many processors as the number of nodes * their dimensions.

Reality is of course vastly different but it does not have to be measured by the number 4 (of CPU cores). Thus, CUDA or OpenCL. I invested $166 in a Quadro K600 Graphics card, which has 192 cores and 1 Gb of on-board RAM. I still wanted to remain faithful to F#, so I looked for a .NET/F# CUDA framework. After briefly evaluating several such frameworks (and they are all in different stages of nascent at this point), I picked Alea.cuBase from QuantAlea.

#### The Framework

Alea.cuBase is pretty neat. I like the paradigm - using quotations to write CUDA code. The documentation is pretty good, gets you up and running very quickly, once the basic concepts of CUDA have been grasped. There are problems, though.

- I could not get shared memory to work despite claims that it is supported. Things just crashed.
- No support yet for multi-dimensional arrays. This was kind of a bummer, because it meant some preprocessing on my part to get things going on the GPU. Oh well. Whatever doesn't kill you...

So how did I do having re-written things to run massively in parallel (with one hand cuffed to the radiator, since I could not use multidimensional arrays)? Well, I did several implementations, I will describe them next time, but here are the charts.

#### Results

The results are averaged over 3 repetitions. Y-axis values are in ms, both axes are logarithmic. Experiments, using CPU are in dashed lines, they start fading at the point where I simply estimated the run-time based on previous performance (did not want to wait 168 sec). On the x-axis are the number of nodes (200x200 SOM, 12 dimensions). I finally did outperform a single-threaded C++ implementation with the "GPU-iterations" algorithm. (Why is its performance so staggeringly awful on small sizes of the dataset? I will talk about it in my next post). Although the gains are not that impressive. At the end I was able to shave about 13-16 seconds off of the "real" 12 000 strong dataset. Which, I guess, is not bad, although not quite there yet... Why? As it turns out, the parallel part of all of this has never exceeded 50% of the total run-time. Which means, the algorithms work in the "hurry-up-and-wait" mode. While parallel calculations do their part pretty fast, negotiating things back and forth with .NET kills the performance.

Still, I made some optimizations and here is what I got:

Notice, how the "GPU-node-by-node" algorithm from being a loser actually advanced to the first place. This was due to a very small change.

At the end, I absolutely fell in love with GPUs and CUDA. It really demystifies parallel programming. All of the behaviors I encountered while experimenting with different implementations were immediately obvious and predictable. I also changed quite a few of my previous convictions about software development. I will talk about it next time (with code in hand).

```
```

```
```

```
```