Archive

Posts Tagged ‘Alea.CUDA’

Walking the Euler Path: GPU for the Road

September 25, 2016 Leave a comment

Continuation of the previous posts:

  1. Intro
  2. Visualization

GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “alea iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite Aela.CUDA library.

GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) =
    use dSorted = sortGpu dArr
    use dGrouped = worker.Malloc<int>(dSorted.Length)

    let lp = LaunchParam(divup dSorted.Length blockSize, blockSize)
    worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr

    compactGpuWithKernel <@createDistinctMap @> dGrouped
  1. We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
  2. We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
  3. Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000))
GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152
CPU distinct: 00:00:00.0262776
GPU distinct: 00:00:02.9162098
Length: 26,214,400
CPU distinct: 00:00:00.5622276
GPU distinct: 00:00:03.2298218
Length: 262,144,000
CPU distinct: 00:00:03.8712437
GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000
CPU distinct: 00:00:04.2019013
GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

Computing Graph Properties

Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true)

euler.Visualize(spanningTree=true, washNonSpanning=false)

spanningoverlay

Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing IEnumerable<DirectedGraph> to accomplish this, right? Wrong! What we want is the AsEnumerable property. Makes things clean and easy. It uses Seq.init method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices 
    (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the StructuredFormatDisplay("{AsEnumerable}") to enable F# Interactive pretty printing of our graph:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : 

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph =
  seq
    [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]);
     ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

“Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given G=(V, E), G_r=(V_r, E_r):\ V_r=V,\ E_r=\{(v_i, v_j) | (v_j, v_i) \in V \}

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple (v_i, v_j), we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse =
    lazy (
        let allExistingRows = [0..rowIndex.Length - 1]

        let subSeq =
            if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse
                let start, end' =
                    let dStart, dEnd = getEdgesGpu rowIndex colIndex
                    sortStartEnd dStart dEnd

                Seq.zip end' start
            else
                asOrdinalsEnumerable ()
                |> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i)))
                |> Seq.collect id


        let grSeq =
            subSeq
            |> Seq.groupBy fst
            |> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray)

        let allRows : seq<int * int []> =
            allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||])
            |> fun col -> col.Union grSeq
            |> Seq.sortBy fst


        let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length
        let revColIndex = allRows |> Seq.collect snd

        DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal)
        )

    member this.Reverse = reverse.Force()

On line 35, .Force() will only call the computation once and cache the result. Each subsequent call to .Force() will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” grSeq will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|];

let gr = StrGraph.FromStrings s
gr.Visualize()

gr.Reverse.Visualize()
rays rays_rev

Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to |V| = 10 \cdot 2^{10} \cdot 2^{10}.

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: start and end' that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.


    [<Kernel;ReflectedDefinition>]
    let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) =
        let idx = blockIdx.x * blockDim.x + threadIdx.x
        if idx < len - 1 then
            for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do
                start.[vertex] <- idx
                end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024
let k = 5

sw.Restart()
let gr = StrGraph.GenerateEulerGraph(N, k)
sw.Stop()

printfn "Graph: %s vertices, %s edges generated in %A" 
    (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed

sw.Restart()
let starts, ends = getEdges gr.RowIndex gr.ColIndex

sw.Stop()
printfn "GPU edges: %A" sw.Elapsed

sw.Restart()

gr.OrdinalEdges

sw.Stop()
printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697
GPU edges: 00:00:01.5234606
CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!

Categories: F#, CUDA, bioinformatics Tags: , , , ,

Visualizing Graphs

September 18, 2016 1 comment

Previously

Walking the Eule Path: Intro

Generating and Visualizing Graphs

I can hardly overemphasize the importance of visusalizations. Many a bug had been immediately spotted just by looking at a visual of a complex data structure. I therefore decided to add visuals to the project as soon as the DirectedGraph class was born.

Code & Prerequisits

Code is on GitHub.

  1. GraphViz: install and add the bin directory to the PATH
  2. EmguCV v3.1: install and add the bin directory to the PATH

DrawGraph

This is a small auxiliary component I wrote to make all future visualizations possible. And here is a sidebar. I didn’t want to write this component. I am not a fan of re-writing something that was written a hundred times before me, so the first thing I did was look for something similar I could use. Sure enough, I found a few things. How can I put it? Software engineering is great, but boy, do we tend to overengineer things! I know, I’m guilty of the same thing myself. All I wanted from the library was an ability to receive a text file written in GraphViz DSL, and get on the output a .png containing the picture of the graph. Just a very simple GraphViz driver, nothing more.

One library had me instantiate 3 (three!) classes, another developed a whole API of its own to build the GraphViz file… I ended up writing my own component, it has precisely 47 lines of code. the last 4 lines are aliasing a single function that does exactly what I wanted. It creates the png file and then immediately invokes the EmguCV image viewer to show it. After we’re done, it cleans up after itself, deleting the temporary png file. Here it is.

Taking it for a Ride

Just to see this work…
Another digression. Love the new feature that generates all the “#r” instructions for F# scripts and sticks them into one file! Yes, this one! Right-click on “References” in an F# project:

generaterefs.

And the generated scripts auto-update as you recompile with new references! A+ for the feature, thank you so much.

Comes with a small gotcha, though: sometimes it doesn’t get the order of references quite right and then errors complaining of references not being loaded appear in the interactive. I spent quite a few painful hours wondering how is it that this reference was not loaded, when here it is! Then I realized: it was being loaded AFTER it was required by references coming after it).

#load "load-project-release.fsx"
open DrawGraph

createGraph "digraph{a->b; b->c; 2->1; d->b; b->b; a->d}" "dot.exe" None

initial

Cool. Now I can take this and use my own function to generate a graph from a string adjacency list, visualize it, and even view some of its properties. Sort of make the graph “palpable”:

let sparse = ["a -> b, c, d"; "b -> a, c"; "d -> e, f"; "e -> f"; "1 -> 2, 3"; "3 -> 4, 5"; "x -> y, z"; "2 -> 5"]
let grs = StrGraph.FromStrings sparse

grs.Visualize(clusters = true)

clusters

StrGraph.FromStrings does exactly what it says: it generates a graph from a sequence of strings, formatted like the sparse list above.
My Visualize function is a kitchen sink for all kinds of visuals, driven by its parameters. In the above example, it invokes graph partitioning to clearly mark connected components.

It is important to note, that this functionality was added to the visualizer not because I wanted to see connected components more clearly, but as a quick way to ensure that my partitioning implementation was indeed working correctly.

Generating Data and Looking at It

Now we have a class that builds graphs and even lets us look at them, so where do we get these graphs? The easiest thing (seemed at the time) was to create them.

Enter FsCheck. It’s not the easiest library to use, there is a learning curve and getting used to things takes time, but it’s very helpful. Their documentation is quite good too. The idea is to write a generator for your type and then use that generator to create as many samples as you like:

#load "load-project-release.fsx"

open Graphs
open FsCheck
open System
open DataGen

let grGen = graphGen 3 50
let gr = grGen.Sample(15, 5).[2]

gr.Visualize(into=3, out= 3)

This produces something like:

gengraph

My function graphGen len num generates a graph of text vertices where len is the length of a vertex name and num is the number of vertices. It returns an FsCheck generator that can then be sampled to get actual graphs. This was a one-off kind of experiment, so it’s in a completely separate module:


//DataGen.fs

module DataGen
open FsCheck
open System
open Graphs

let nucl = Gen.choose(int 'A', int 'Z') |> Gen.map char

let genVertex len =  Gen.arrayOfLength len nucl |> Gen.map (fun c -> String(c))
let vertices len number = Gen.arrayOfLength number (genVertex len) |> Gen.map Array.distinct

let graphGen len number =
    let verts = vertices len number
    let rnd = Random(int DateTime.UtcNow.Ticks)
    let pickFrom = verts |> Gen.map (fun lst -> lst.[rnd.Next(lst.Length)])
    let pickTo = Gen.sized (fun n -> Gen.listOfLength (if n = 0 then 1 else n) pickFrom)

    Gen.sized
    <| 
    (fun n ->
        Gen.map2 
            (fun from to' -> 
                from, (to' |> Seq.reduce (fun acc v -> acc + ", " + v))) pickFrom pickTo
        |>
        Gen.arrayOfLength (if n = 0 then 1 else n)
        |> Gen.map (Array.distinctBy fst)
        |> Gen.map (fun arr ->  arr |> Array.map (fun (a, b) -> a + " -> " + b))
    )
    |> Gen.map StrGraph.FromStrings

This whole module cascades different FsCheck generators to create a random graph.
The simplest of them nucl, generates a random character. (Its name comes from the fact that originally I wanted to limit the alphabet to just four nucleotide characters A, C, G, T). Then this generator is used by genVertex to generate a random string vertex, and finally vertices creates an array of distinct random vertices.

graphGen creates a sequence of strings that FromStrings (above) understands. It first creates a string of “inbound” vertices and then adds an outbound vertex to each such string.

Sampling is a little tricky, for instance, the first parameter to the Sample function, which, per documentation, controls sample size, in this case is responsible for complexity and connectivity of the resulting graphs.

On to Euler…

The script above also specifies a couple of optional parameters to the visualizer: into will mark any vertex that has into or more inbound connections in green. And out will do the same for outbound connections and yellow. If the same vertex possesses both properties, it turns blue.

Inspired by all this success, I now want to write a function that would generate Eulerian graphs. The famous theorem states that being Eulerian (having an Euler cycle) for a directed graph is equivalent to being strongly connected and having in-degree of each vertex equal to its out-degree. Thus, the above properties of the visualizer are quite helpful in confirming that the brand new generator I have written for Eulerain graphs (GenerateEulerGraph) is at the very least on track:


let gre = StrGraph.GenerateEulerGraph(10, 5)
gre.Visualize(into=3, out=3)

eulerinout

Very encouraging! Whatever has at least 3 edges out, has at least 3 edges in. Not a definitive test, but the necessary condition of having only blue and transparent vertices in the case of an Eulerian graph is satisfied.

In the next post – more about Eulerian graphs, de Brujin sequences, building (and visualizing!) de Bruijn graphs, used for DNA sequence assembly.

Walking the Euler Path: Intro

September 17, 2016 2 comments

Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row i will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with INumeric included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs

/// <summary>
/// Sparse matrix implementation with CSR and CSC storage
/// </summary>
[<StructuredFormatDisplay("{PrintMatrix}")>]
type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) =

....

static member CreateMatrix (row : 'a []) (isCSR : bool) =
    let ops = GlobalAssociations.GetNumericAssociation<'a>()
    let colIdx, vals = 
        Array.zip [|0..row.Length - 1|] row 
        |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0)
        |> Array.unzip

    SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the GlobalAssociations to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to CreateMatrix and then rows can be appended to it by calling AddValues row. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix |M| = m \cdot n, then the answer is given by the equation: m \cdot n > 2 \cdot e + m + 1, here e is the number of non-zero elements in the matrix.

For a graph G=(V, E) the set V takes a place of rows, and E – that of columns. The above inequality becomes: v \cdot e > e + v + 1 \ (v = |V|,\ e = |E|), so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this =
    let rowIndex  = rowIndex.ToArray()
    let colIndex = colIndex.ToArray()
    let nEdges = colIndex.Length
    let verticesNameToOrdinal = verticesNameToOrdinal
    let nVertices = verticesNameToOrdinal.Count

    // vertices connected to the ordinal vertex
    let getVertexConnections ordinal =
        let start = rowIndex.[ordinal]
        let end' = rowIndex.[ordinal + 1] - 1
        colIndex.[start..end']

This is not very useful, however, since it assumes that we already have rowIndex for the CSR type “R” and colIndex for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file Generators.fs.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.
In the next post I’ll talk about visualizing things, and vsiualization methods really have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)

instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:


//Generators.fs

type Graphs.DirectedGraph<'a when 'a:comparison> with
    /// <summary>
    /// Create the graph from a file
    /// </summary>
    /// <param name="fileName"></param>
    static member FromFile (fileName : string) =

        if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file"

        let lines = File.ReadLines(fileName)
        DirectedGraph<string>.FromStrings(lines)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

GPU Split & Sort With Alea.CUDA

June 11, 2016 1 comment

Code

The complete source: here

Intro

Starting to play with graph algorithms on the GPU (who wants to wait for nvGraph from NVIDIA, right?) As one of the precursor problems – sorting large arrays of non-negative integers came up. Radix sort is a simple effective algorithm quite suitable for GPU implementation.

Least Significant Digit Radix Sort

The idea is simple. We take a set A of fixed size elements with lexicographical ordering defined. We represent each element in a numeric system of a given radix (e.g., 2). We then scan each element from right to left and group the elements based on the digit within the current scan window, preserving the original order of the elements. The algorithm is described here in more detail.

Here is a the pseudocode for radix = 2 (easy to extend for any radix):

k = sizeof(int)
n = array.length
for i = 0 to k
    all_0s = []
    all_1s = []
    for j = 0 to n
        if bit(i, array[j]) == 0
        then all_0s.add(array[j])
        else all_1s.add(array[j])
    array = all_0s + all_1s

GPU Implementation

This is a poster problem for the “split” GPU primitive. Split is what <code>GroupBy</code> is in LINQ (or <code>GROUP BY</code> in TSQL) where a sequence of entries are split (grouped) into a number of categories. The code above applies split k times to an array of length n, each time the categories are are “0” and “1”, and splitting/grouping are done based on the current digit of an entry.

The particular case of splitting into just two categories is easy to implement on the GPU. Chapter 39 of GPU Gems describes such implementation (search for “radix sort” on the page).

The algorithm described there shows how to implement the pseudocode above by computing the position of each member of the array in the new merged array (line 10 above) without an intermediary of accumulating lists. The new position of each member of the array is computed based on the exclusive scan where the value of scanned[i] = scanned[i – 1] + 1 when array[i-1] has 0 in the “current” position. (scanned[0] = 0). Thus, by the end of the scan, we know where in the new array the “1” category starts (it’s scanned[n – 1] + is0(array[n – 1]) – total length of the “0” category, and the new address of each member of the array is computed from the scan: for the “0” category – it is simply the value of scanned (each element of scanned is only increased when a 0 bit is encountered), and start_of_1 + (i – scanned[i]) for each member of the “1” category: its position in the original array minus the number of “0” category members up to this point, offset by the start of the “1” category.

The algorithm has two parts: sorting each block as described above and then merging the results. In our implementation we skip the second step, since Alea.CUDA DeviceScanModule does the merge for us (inefficiently so at each iteration, but makes for simpler, more intuitive code).

Coding with Alea.CUDA

The great thing about Alea.CUDA libray v2 is the Alea.CUDA.Unbound namespace that implements all kinds of scans (and reducers, since a reducer is just a scan that throws away almost all of its results).

let sort (arr : int []) =
    let len = arr.Length
    if len = 0 then [||]
    else
        let gridSize = divup arr.Length blockSize
        let lp = LaunchParam(gridSize, blockSize)
        
        // reducer to find the maximum number & get the number of iterations
        // from it.
        use reduceModule = new DeviceReduceModule<int>(target, <@ max @>)
        use reducer = reduceModule.Create(len)

        use scanModule = new DeviceScanModule<int>(target, <@ (+) @>)
        use scanner = scanModule.Create(len)

        use dArr = worker.Malloc(arr)
        use dBits = worker.Malloc(len)
        use numFalses = worker.Malloc(len)
        use dArrTemp = worker.Malloc(len)

        // Number of iterations = bit count of the maximum number
        let numIter = reducer.Reduce(dArr.Ptr, len) |> getBitCount

        let getArr i = if i &&& 1 = 0 then dArr else dArrTemp
        let getOutArr i = getArr (i + 1)

        for i = 0 to numIter - 1 do
            // compute significant bits
            worker.Launch <@ getNthSignificantReversedBit @> lp (getArr i).Ptr i len dBits.Ptr

            // scan the bits to compute starting positions further down
            scanner.ExclusiveScan(dBits.Ptr, numFalses.Ptr, 0, len)

            // scatter
            worker.Launch <@ scatter @> lp (getArr i).Ptr len numFalses.Ptr dBits.Ptr (getOutArr i).Ptr

        (getOutArr (numIter - 1)).Gather()

let generateRandomData n =
    if n <= 0 then failwith "n should be positive"
    let seed = uint32 DateTime.Now.Second

    // setup random number generator        
    use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32>
    use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n
    
    // create random numbers
    cudaRandom.Fill(0, n, prngBuffer)
    // transfer results from device to host 
    prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))


This is basically it. One small optimization - instead of looping 32 times (length of int in F#), we figure out the bit count the largest element and iterate fewer times.

Helper functions/kernels are straightforward enough:

let worker = Worker.Default
let target = GPUModuleTarget.Worker worker

let blockSize = 512

[<Kernel; ReflectedDefinition>]
let getNthSignificantReversedBit (arr : deviceptr<int>) (n : int) (len : int) (revBits : deviceptr<int>) =
    let idx = blockIdx.x * blockDim.x + threadIdx.x
    if idx < len then
        revBits.[idx] <- ((arr.[idx] >>> n &&& 1) ^^^ 1)

[<Kernel; ReflectedDefinition>]    
let scatter (arr : deviceptr<int>) (len: int) (falsesScan : deviceptr<int>) (revBits : deviceptr<int>) (out : deviceptr<int>) =
    let idx = blockIdx.x * blockDim.x + threadIdx.x
    if idx < len then

        let totalFalses = falsesScan.[len - 1] + revBits.[len - 1]

        // when the bit is equal to 1 - it will be offset by the scan value + totalFalses
        // if it's equal to 0 - just the scan value contains the right address
        let addr = if revBits.[idx] = 1 then falsesScan.[idx] else totalFalses + idx - falsesScan.[idx]
        out.[addr] <- arr.[idx]

let getBitCount n =
    let rec getNextPowerOfTwoRec n acc =
        if n = 0 then acc
        else getNextPowerOfTwoRec (n >>> 1) (acc + 1)

    getNextPowerOfTwoRec n 0

Generating Data and Testing with FsCheck

I find FsCheck quite handy for quick testing and generating data with minimal setup in FSharp scripts:

let genNonNeg = Arb.generate<int> |> Gen.filter ((<=) 0)

type Marker =
    static member arbNonNeg = genNonNeg |> Arb.fromGen
    static member ``Sorting Correctly`` arr =
        sort arr = Array.sort arr

Arb.registerByType(typeof<Marker>)
Check.QuickAll(typeof<Marker>)


Just a quick note: the first line may seem weird at first glance: looks like the filter condition is dropping non-positive numbers. But a more attentive second glance reveals that it is actually filtering out negative numbers (currying, prefix notation should help convince anyone in doubt).

Large sets test.

I have performance tested this on large datasets (10, 000, 000 - 300, 000, 000) integers, any more and I'm out of memory on my 6 Gb Titan GPU. The chart below goes up to 100,000,000.
I have generated data for these tests on the GPU as well, using Alea.CUDA random generators, since FsCheck generator above is awfully slow when it comes to large enough arrays.

The code comes almost verbatim from this blog post:

let generateRandomData n =
    if n <= 0 then failwith "n should be positive"
    let seed = uint32 DateTime.Now.Second

    // setup random number generator        
    use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32>
    use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n
    
    // create random numbers
    cudaRandom.Fill(0, n, prngBuffer)
    // transfer results from device to host 
    prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))

except I'm converting the generated array to integers in the [0; n] range. Since my radix sort work complexity is O(k * n) (step complexity O(k))  where k = bitCount(max input) I figured this would give an adequate (kinda) comparison with the native F# Array.orderBy .

sorting_chart

Optimization

It is clear from the above chart, that there is a lot of latency in our GPU implementation - the cost we incur for the coding nicety: by using DeviceScanModule and DeviceReduceModule we bracket the merges we would otherwise have to do by hand. Hence the first possible optimization: bite the bullet, do it right, with a single merge at the end of the process and perform a block-by-block sort with BlockScanModule

Categories: CUDA, F# Tags: ,

Look-and-say: [Alea.]CUDA

January 6, 2016 Leave a comment

Continuing the Advent of Code theme from the previous post. Figured since this year is going to be my year of CUDA, this would be a good opportunity to take it for a ride. A good April 1st post, but why wait?

So, how can we make this even faster than the already fast imperative solution?

Naive Approach

The complete script for this is on GitHub.

The kernel computes one iteration of the look-and-say sequence. Every instance looks at a single group of repeating digits and outputs the result into an array. The new array is twice the size of the original one (representing a previous member of the sequence), every digit or the first digit of a group will produce two number (n, c) and every repeating digit will produce two “0”s on the output:

1 1 1 3 1 2 2 1 1 3 -> 3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3

[<Kernel; ReflectedDefinition>]
let lookAndSayKernel (arr : deviceptr<int8>) (len : int) (out : deviceptr<int8>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len then
        let c = arr.[ind]
        let idxOut = 2 * ind

        if ind = 0 || arr.[ind - 1] <> c then
            let mutable i = 1
            while ind + i < len && c = arr.[ind + i] do
                i <- i + 1
            out.[idxOut] <- int8 i
            out.[idxOut + 1] <- c
        else
            out.[idxOut] <- 0y
            out.[idxOut + 1] <- 0y

In the naive approach we bring the resulting sequence into the memory and repeat:

let lookAndSayGpu (s : string) n =
    let arr = s |> Seq.map (fun c -> (string>>int8) c) |> Seq.toArray

    let rec loop (arr : int8 []) n =
        if n = 0 then arr.Length
        else
            let blockSize = 512
            let gridSize = divup arr.Length blockSize
            let lp = LaunchParam(gridSize, blockSize)

            use dArr = worker.Malloc(arr)
            use dOut = worker.Malloc(arr.Length * 2)

            worker.Launch <@lookAndSayKernel@> lp dArr.Ptr arr.Length dOut.Ptr

            let out = dOut.Gather()

            let arr = 
                out
                |> Array.filter (fun b -> b > 0y) 

            loop arr (n - 1)

    loop arr n

This is almost not worth checking against the high performing imperative algorithm of the last post, but why not?
looksay_cpu_gpu_slow

Not unexpectedly we are still falling short of the imperative solution. Not a problem. There are a few very low hanging fruit here that are just waiting to be picked.

A Better Solution

The complete script for this is on GitHub.

There are two problems with our solution so far:

  1. Two much memory being moved between the GPU and RAM
  2. A significant part of the algorithm is executed on the CPU with memory allocated and destroyed in the process (that would be the array compacting part)

The following algorithm addresses both problems:

GenerateSequences(str, n) {
//Allocate the maximum amount of memory for all data structures used on the GPU
len = str.Length
initializeGpuMemory(str, len)
for i = 1 to n {
    lookAndSayKernel(len)
    compactResultStream(2 * len)
    len = getNewLength()
    }
}

Of the 3 steps in this iterative loop, there is only one where in the implementation we move data from the GPU. It would be easy enough to not do it if necessary, though.

At the core of this algorithm there is a stream compaction routine that is executed hopefully entirely on the GPU. Compaction is performed by copying only the values needed to the addresses in the output stream. These addresses are determined by running an inclusive scan on the array of integers that contains “1” in a meaningful position, and “0” in all the rest. The details of how scan is used for stream compaction and how scans are implemented on the GPU, are discussed in GPU Gems 3 (free from NVIDIA). I’m using a device scan that comes with Alea.CUDA, which, hopefully, minimizes memory copies between main memory and GPU.

Now that the stream has been compacted, we get the length of the new sequence as the last entry in the address map, obtained during compaction (and explained in the linked chapter of GPU Gems 3).

[<Kernel; ReflectedDefinition>]
let lookAndSayKernelSimple (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) (flags : deviceptr<int>)=
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len then
        let c = arr.[ind]
        let idxOut = 2 * ind
        let prevUnrepeated = if ind = 0 || arr.[ind - 1] <> c then 1 else 0
        flags.[idxOut] <- prevUnrepeated
        flags.[idxOut + 1] <- prevUnrepeated

        if prevUnrepeated = 1 then
            let mutable i = 1
            while ind + i < len && c = arr.[ind + i] do
                i <- i + 1
            out.[idxOut] <- i
            out.[idxOut + 1] <- c
        else
            out.[idxOut] <- 0
            out.[idxOut + 1] <- 0

[<Kernel; ReflectedDefinition>]
let copyScanned (arr : deviceptr<int>) (out : deviceptr<int>) (len : int) (flags : deviceptr<int>) (addrmap : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x

    if ind < len && flags.[ind] > 0 then out.[addrmap.[ind] - 1] <- arr.[ind]

In the modified kernel, we also fill out the “flags” array which are scanned to obtain the address map for compacting the stream. The second kernel uses the result of the scan performed on these “flags” to produce the new compacted sequence. So for the above example:

original: 1 1 1 3 1 2 2 1 1 3
new:     3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3
flags:     1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1
map:      1 2 2 2 2 2 3 4 5 6 7 8 8 8 9 10 10 10 11 12 (inclusive scan of the flags)

The last entry in the map is also the new length.

let lookAndSayGpuScan (s : string) n =
    let maxLen = 20 * 1024 * 1024
    let arr = s |> Seq.map (fun c -> (string>>int) c) |> Seq.toArray

    use dOut = worker.Malloc(maxLen)
    use dFlags = worker.Malloc(maxLen)
    use dAddressMap = worker.Malloc(maxLen)
    use dArr = worker.Malloc(maxLen)
    dArr.Scatter(arr)

    use scanModule = new DeviceScanModule<int>(GPUModuleTarget.Worker(worker), <@ (+) @>)
    scanModule.Create(100) |> ignore

    let sw = Stopwatch()
    let rec loop n len =
        if n = 0 then len
        else
            let blockSize = 512
            let gridSize = divup len blockSize
            let lp = LaunchParam(gridSize, blockSize)
            use scan = scanModule.Create(2 * len)

            worker.Launch <@lookAndSayKernelSimple@> lp dArr.Ptr len dOut.Ptr dFlags.Ptr
            scan.InclusiveScan(dFlags.Ptr, dAddressMap.Ptr, 2 * len)

            let gridSize = divup (2 * len) blockSize
            let lp = LaunchParam(gridSize, blockSize)

            worker.Launch <@copyScanned@> lp dOut.Ptr dArr.Ptr (2 * len) dFlags.Ptr dAddressMap.Ptr
            let len = dAddressMap.GatherScalar(2 * len - 1)

            loop (n - 1) len

    sw.Start()
    let res = loop n s.Length
    sw.Stop()
    res, sw.ElapsedMilliseconds

The complete code is above. Notice, that we are not cheating when we are measuring the speed of the internal loop and not including allocations. These can be factored out completely, so the only experiment that matters is what actually runs.

Here are the comparisons:
looksay_fast

Now, we are finally fast! But it’s a mighty weird chart, probably because most of the time is spent in context switching and all other latency necessary to make this run…

So, takeaways: GPU programming makes us think differently, cool GPU algorithms exist, writing for the GPU is tough but rewarding, and we can really blow a small problem out of any proportion considered descent.

Categories: CUDA, F# Tags: , , ,

Non-linear Thinking with CUDA.

September 8, 2015 1 comment

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

The Problem

Given a set A = \{a_1, a_2 \ldots a_n\}, output a set S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}

In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 71 113 113 128
57 + 71 71 + 113 113 + 113 113 + 128 128 + 57
57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71

Code

On my GitHub.

Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is O(n^3):

// brute force O(n^3) solution
let cyclospec (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |&gt; Seq.toArray

    let mutable parsum = 0

    (seq {
        yield 0
        yield pepArr |&gt; Array.sum
        for i = 0 to len - 2 do
            yield! [0..len-1]
                |> Seq.map
                    (fun j -&gt;
                        //F# 4.0 Feature!
                        parsum &lt;- 0
                        for ind = j to j + i do
                            parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind]
                        parsum
                        )
    }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an O(n^2) solution:

let cyclospecOpt (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |> Seq.toArray

    let output = Array.zeroCreate ((len - 1) * len)

    for i = 0 to len - 2 do
        [0..len-1] 
            |> Seq.iter 
                (fun j ->  
                    let ind = i + j
                    output.[i * len + j] <- 
                        if i = 0 then pepArr.[j] 
                        else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind]
                    )
    seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of a_5 + a_6 + a_7, thus it’s a grid X of dimensions n x (n-1)( where x_{i, j} is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>]
let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x
    let lenElem = blockIdx.y * blockDim.y + threadIdx.y

    if ind < len && lenElem < len - 1 then
        let mutable parsum = 0
        for i = 0 to lenElem do
            let idx = ind + i
            parsum <- parsum + arr.[if idx >= len then idx - len else idx]
        out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) =
    let blockSize = dim3(16, 16)
    let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y)
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

[[]
let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=
let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then
let idx = ind + lenElem
out.[lenElem * len + ind] = len then idx – len else idx]

Invocation:

let cyclospecOptGpu (peptide : int []) =
    let blockSize = 256
    let gridSize = divup peptide.Length blockSize
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    for i = 0 to peptide.Length - 2 do
        worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):
cyc

And so:

  1. A little algorithm is a beautiful thing
  2. A little CUDA is an even more beautiful thing
  3. CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
  4. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:
cyc_nonlog

Fun with Alea.CUDA, F# Interactive, Charts

July 19, 2015 1 comment

Source code for this post can be found on my GitHub.

It’s great to see technologies evolving over the years. Alea.CUDA has done so in leaps and bounds since the first time I laid eyes on it a couple of years ago. At the time the name seemed unfortunate and hinted at the “aleatic” programming paradigm, meaning that you could not reliably predict the results your code would produce in consequent runs (alea is a Latin dice). It has all changed since.

I listened to an NVIDA Webinar on the subject and immediately wanted to take Alea.CUDA development for a test drive. I wanted to do it the same way I would in Python: just like we use Numba and IPython console, use F# scripting with F# Interactive.

For a test subject, I picked the same 1D LBP described in my post here. The actual algorithm for extracting the local binary pattern and its use in face recognition is described in this paper.

The algorithm consists of two parts:

  1. Sliding a window of size 2*n across the array and computing the pattern at each location
  2. Computing the histogram of these values.

Here is my F# way of doing it:

let lbp (arr : int[]) n =
    if arr.Length < n * 2 + 1 
        then failwith ("Length should be at least " + (n * 2 + 1).ToString())

    // computes a pattern around an array element i
    let calcOnePattern i (around : int [])=
        [0..around.Length - 1] 
        |> Seq.fold (fun st j -> if around.[j] > i then st ||| (1 <<< j) else st) 0 

    // moves sliding windoe of size '2*n' through the array and computes the pattern for each 
    // element at the center of a window
    let prep =
        [|n .. arr.Length - n - 1|]
        |> Array.map (fun i -> calcOnePattern arr.[i] (Array.concat [arr.[i - n..i - 1]; arr.[i + 1..i + n]]))
    
    // return histogram
    prep 
    |> Array.fold (fun (st : int[]) v -> st.[v] <- st.[v] + 1; st) 
          (Array.zeroCreate (1 <<< n * 2))

For comparison, here is the Python code (here p is 2 ** [0..(2n-1)]), in F# code I’m simply shifting left, so no such array is necessary.

def extract_1dlbp_cpu(input, neighborhood, p):
    hist = np.zeros(1 << (2 * neighborhood))
    for i in range(neighborhood, len(input) - neighborhood):
        left = input[i - neighborhood : i]
        right = input[i + 1 : i + neighborhood + 1]
        both = np.r_[left, right]
        hist[np.sum(p [both >= input[i]])] += 1
    return hist

Putting it on GPU

I took one of Alea.CUDA provided sample scripts and modified it slightly. This is why my code is wrapped in a class. Here it is:

type LBP(arr : int [], ?n : int) =
    inherit GPUModule(GPUModuleTarget.DefaultWorker)

    let mutable arr = arr
    let n = defaultArg n 4
    do
        if arr.Length < n * 2 + 1 then failwith ("Length should be at least " + (n * 2 + 1).ToString())

    member this.Arr 
        with get() = arr
        and set (value) = arr <- value

    [<Kernel;ReflectedDefinition>]
    member this.Kernel (arr:deviceptr<int>) (n:int)  (hist:deviceptr<int>) (len : int)=
        let mutable i = blockIdx.x * blockDim.x + threadIdx.x
        let mutable res = 0
        if i < len - 2 * n then
            i <- i + n
            for j = i - n to i - 1 do
                if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n)))

            for j = i + 1 to i + n do
                if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n + 1)))

            __atomic_add (hist + res) 1 |> ignore

    member this.Compute () =
        let blockSize = 512
        let gridSize = divup (arr.Length - 2 * n) blockSize
        let lp = LaunchParam(gridSize, blockSize)

        let hist = Array.zeroCreate (1 <<< n * 2)
        use d_arr = this.GPUWorker.Malloc(arr)
        use d_hist = this.GPUWorker.Malloc(hist)
        this.GPULaunch <@ this.Kernel @> lp d_arr.Ptr n d_hist.Ptr (arr.Length)
        d_hist.Gather()

There is not much here, at least in spirit, that any CUDA developer is not used to. We define a kernel and then launch it in a very familiar way, albeit wrapped in a slightly unfamiliar form. Nothing that is not easy to understand or adopt, though. F# quotations are used to interface with nvcc, so makes sense. Pointer arithmetic is wrapped nicely, so code like the one on line 25 above (__atomic_add) is possible. Works great, easy to use.

Here the kernel, as one would expect, computes a pattern around a single array element, at the same time contributing its result to the final histogram. Definitely room for optimization as this atomic_add is an obvious bottleneck, but I didn’t go any further since the results are quite spectacular already.

Also, I really like the “divup” function above, that figures out the number of blocks. Normally it is done as:

gridSize = (total + blockSize) / blockSize

… a small thing, but a nice convenience.

Boilerplate

Since I was doing it in F# Interactive, some boilerplate was necessary. I installed my binaries through NuGet and here are the references. The “send to F# interactive” functionality that has been around for a while now is a great help to figure things out.

#r "System.Configuration.dll"
#I @"packages\Alea.Cuda.2.1.2.3274\lib\net40"
#I @"packages\NUnit.2.6.3\lib\"
#I @"packages\FsUnit.1.3.0.1\Lib\Net40"
#r "Alea.CUDA.dll"
#r "nunit.framework.dll"
#r "FsUnit.NUnit.dll"
#load @"packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx"

open System
open System.IO
open Alea.CUDA
open Alea.CUDA.Utilities
open FsUnit
open FSharp.Charting
open System.Diagnostics

Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(__SOURCE_DIRECTORY__, @"packages\Alea.Cuda.2.1.2.3274\private")
Alea.CUDA.Settings.Instance.Resource.Path <- Path.Combine(__SOURCE_DIRECTORY__, @"release")

The last two lines come straight from Alea tutorial sample and are necessary to hook up the jit compiler. That’s it!

Experiments & results

I used the following functions to drive the experiments:

// random array of length n
let generate n =
    let rng = Random(int DateTime.Now.Ticks)
    Array.init n (fun _ -> rng.Next())
    
// experiment: run from 10 ** low to 10 ** high array length
let experiment low high =
    if low >= high || low < 0 then failwith "must be: low < high, both non-negative"

    // salt it
    let arr = [|0..1000|]
    use lb = new LBP(arr)
    lb.Compute() |> ignore

    let sw = Stopwatch()
    let cpuTimes = Array.zeroCreate (high - low + 1)
    let gpuTimes = Array.zeroCreate (high - low + 1)


    for i = low to high do
        let range = int (10.0 ** float i)
        let arr = generate range
        lb.Arr <- arr

        // Run on CPU
        sw.Restart()
        printfn "Legnth: %d" range
        printfn "-----------"
        let h1 = lbp arr 4
        sw.Stop()

        let idx = i - low

        cpuTimes.[idx] <- range, sw.Elapsed.TotalSeconds
        printfn "Computed on CPU: %0.5f sec" (snd cpuTimes.[idx])

        //Run on GPU
        sw.Restart()
        let h2 = lb.Compute()
        sw.Stop()
        gpuTimes.[idx] <- range, float sw.Elapsed.TotalSeconds
        printfn "Computed on GPU: %0.5f sec" (snd gpuTimes.[idx])
        printfn ""

        // make sure we are ok
        should equal h1 h2

    Chart.Combine(
        [Chart.Line(cpuTimes, Name="CPU")
         Chart.Line(gpuTimes, Name="GPU")
        ] 
    )
        .WithYAxis(Log=true, Title = "sec")
        .WithXAxis(Min=10.**float low, Log=true, Title = "length")
        .WithLegend(InsideArea=true) 

The helper “generate” generates a random integer array of arbitrary length, and the “experiment” function runs the experiments on arrays of sizes 10 ** [low..high]
with

experiment 3 7

I got:


Legnth: 1000
-----------
Computed on CPU: 0.00937 sec
Computed on GPU: 0.00285 sec

Legnth: 10000
-----------
Computed on CPU: 0.02445 sec
Computed on GPU: 0.00308 sec

Legnth: 100000
-----------
Computed on CPU: 0.17697 sec
Computed on GPU: 0.00388 sec

Legnth: 1000000
-----------
Computed on CPU: 1.70085 sec
Computed on GPU: 0.00412 sec

Legnth: 10000000
-----------
Computed on CPU: 16.53772 sec
Computed on GPU: 0.03045 sec

Here the straight F# mode runs about twice as fast as pure Python:

Length: 1000
--------------
Finished on CPU: time: 0.02263s

Length: 10000
--------------
Finished on CPU: time: 0.28701s

Length: 100000
--------------
Finished on CPU: time: 2.88549s

Length: 1000000
--------------
Finished on CPU: time: 29.97346s

The GPU performances are about comparable between Python and Numba and F# with Alea.CUDA. Same card: GTX Titan with 6Gb RAM, 14 SMPs

Of course a mandatory chart produced with FSharpChart and the following small scriptlet:

    Chart.Combine(
        [Chart.Line(cpuTimes, Name="CPU")
         Chart.Line(gpuTimes, Name="GPU")
        ] 
    )
        .WithYAxis(Log=true, Title = "sec")
        .WithXAxis(Min=10.**float low, Log=true, Title = "length")
        .WithLegend(InsideArea=true) 

(Still not sure, why my X axis starts from 100 and not from 1000, but I guess it’s minor :)).

alea

A great experience doing all this overall: smooth and painless.

Categories: CUDA, F#, Parallel Tags: , , , , ,