Walking the Euler Path: GPU for the Road

Continuation of the previous posts:

  1. Intro
  2. Visualization

GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “alea iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite Aela.CUDA library.

GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) =
    use dSorted = sortGpu dArr
    use dGrouped = worker.Malloc<int>(dSorted.Length)

    let lp = LaunchParam(divup dSorted.Length blockSize, blockSize)
    worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr

    compactGpuWithKernel <@createDistinctMap @> dGrouped
  1. We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
  2. We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
  3. Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000))
GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152
CPU distinct: 00:00:00.0262776
GPU distinct: 00:00:02.9162098
Length: 26,214,400
CPU distinct: 00:00:00.5622276
GPU distinct: 00:00:03.2298218
Length: 262,144,000
CPU distinct: 00:00:03.8712437
GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000
CPU distinct: 00:00:04.2019013
GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

Computing Graph Properties

Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true)

euler.Visualize(spanningTree=true, washNonSpanning=false)

spanningoverlay

Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing IEnumerable<DirectedGraph> to accomplish this, right? Wrong! What we want is the AsEnumerable property. Makes things clean and easy. It uses Seq.init method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices 
    (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the StructuredFormatDisplay("{AsEnumerable}") to enable F# Interactive pretty printing of our graph:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : 

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph =
  seq
    [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]);
     ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

“Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given G=(V, E), G_r=(V_r, E_r):\ V_r=V,\ E_r=\{(v_i, v_j) | (v_j, v_i) \in V \}

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple (v_i, v_j), we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse =
    lazy (
        let allExistingRows = [0..rowIndex.Length - 1]

        let subSeq =
            if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse
                let start, end' =
                    let dStart, dEnd = getEdgesGpu rowIndex colIndex
                    sortStartEnd dStart dEnd

                Seq.zip end' start
            else
                asOrdinalsEnumerable ()
                |> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i)))
                |> Seq.collect id


        let grSeq =
            subSeq
            |> Seq.groupBy fst
            |> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray)

        let allRows : seq<int * int []> =
            allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||])
            |> fun col -> col.Union grSeq
            |> Seq.sortBy fst


        let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length
        let revColIndex = allRows |> Seq.collect snd

        DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal)
        )

    member this.Reverse = reverse.Force()

On line 35, .Force() will only call the computation once and cache the result. Each subsequent call to .Force() will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” grSeq will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|];

let gr = StrGraph.FromStrings s
gr.Visualize()

gr.Reverse.Visualize()
rays rays_rev

Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to |V| = 10 \cdot 2^{10} \cdot 2^{10}.

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: start and end' that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.


    [<Kernel;ReflectedDefinition>]
    let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) =
        let idx = blockIdx.x * blockDim.x + threadIdx.x
        if idx < len - 1 then
            for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do
                start.[vertex] <- idx
                end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024
let k = 5

sw.Restart()
let gr = StrGraph.GenerateEulerGraph(N, k)
sw.Stop()

printfn "Graph: %s vertices, %s edges generated in %A" 
    (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed

sw.Restart()
let starts, ends = getEdges gr.RowIndex gr.ColIndex

sw.Stop()
printfn "GPU edges: %A" sw.Elapsed

sw.Restart()

gr.OrdinalEdges

sw.Stop()
printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697
GPU edges: 00:00:01.5234606
CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.