This blog chronicles my experiences learning F# via implementing Push: a programming language designed for use in evolving populations of programs geared for solving symbolic regression problems (genetic programming). It has since evolved to contain bits and pieces reflecting my thorny twisted software development path.

Getting Emotional with Affectiva, F#, and Emgu

I’ve been playing with Affectiva emotion, demographics, and face detection SDK, found it excellent, however, their sample gallery lacks a sample in F#! So here we are to correct that.

I just wanted a simple F# script that would let me take all kinds of the SDK options for a ride. The script itself is 130 lines. Out of that about 30 lines is just a boilerplate to load all the relevant libraries, setup the environment, etc.

Finally, here I am goofing off in front of my webcam.

Setup

Not much in terms of setup. So, yes, regular things for downloading/installing EmguCV, OpenCV, and installing Affectiva SDK.

Then all this needs to be reflected in the script:

open System

Environment.CurrentDirectory <- @"C:\Program Files\Affectiva\Affdex SDK\bin\release"
#r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.UI.dll"
#r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.UI.GL.dll"
#r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.World.dll"
#r "../packages/OpenTK.1.1.2225.0/lib/net20/OpenTK.dll"
#r "System.Drawing.dll"
#r "System.Windows.Forms.dll"
#r @"C:\Program Files\Affectiva\Affdex SDK\bin\release\Affdex.dll"

open Affdex
open Emgu.CV
open Emgu.CV.CvEnum
open System.IO
open System.Collections.Generic
open Emgu.CV.UI
open Emgu.CV.Structure
open System.Drawing
open System.Linq
open System.Diagnostics

let classifierPath = @"C:\Program Files\Affectiva\Affdex SDK\data"
let resources = Path.Combine(__SOURCE_DIRECTORY__, "Resources")



Just loading libraries, no big deal. Except we need to make sure Affdex.dll finds its dependencies, hence setting the current path at the beginning.

Initializing the Detector

let detector = new CameraDetector()
try
detector.setClassifierPath(classifierPath)

detector.setDetectAllEmotions(true);
detector.setDetectAllExpressions(false);
detector.setDetectAllEmojis(true);
detector.setDetectGender(true);
detector.setDetectGlasses(true);
detector.setDetectEngagement(true);
detector.setDetectValence(true);
detector.setDetectAttention(true);
detector.setDetectAge(true);

detector.setImageListener(imageListener)
detector.setProcessStatusListener(processStatusListener)

detector.start();

sw.Start()
while not finished do
sw.Stop()
finally
detector.Dispose()


Here setDetectGlasses is my favorite. Check it out in the video.

I’m using CameraDetector to capture video from the webcam, if I needed to capture a file video I’d use VideoDetector. Setting properties is easy, albeit slightly confusing at first – all these subtle differences between valence and attention… It makes sense when you get used to it. My favorite is setDetectAllEmojis. The SDK comes with quite a few emojis that can be used to reflect what’s going on in the video.

The VideoDetector is set up in a similar way, except you also need to issue detector.process() to start running, camera detector does it automatically.

I would also like to use use instead of let to instantiate the disposable detector, but cannot do it in the script, so true to an instinct for plugging memory leaks before they spring, I wrapped it in the try..finally – not at all necessary in a script, and I don’t do it for EmguCV elements anyway. This is not a production code practice.

Fun Part: Processing Results

As processed frames start coming in, we hook up to the detector image listener (detector.setImageListener()) which will feed us images and all kinds of fun stats as they come in. Also, setProcessStatusListener will tell us when things are done or errors occur.

let imageListener = {
new ImageListener with
member this.onImageCapture (frame : Affdex.Frame) = ()

member this.onImageResults(faces : Dictionary<int, Face>, frame : Affdex.Frame) =
let img = new Image<Rgb, byte>(frame.getWidth(), frame.getHeight());
img.Bytes <- frame.getBGRByteArray()

let faces = faces |> Seq.map (fun kvp -> kvp.Key, kvp.Value) |> Seq.toArray

// draw tracking points
faces.ToList().ForEach(fun (idx, face) ->
let points = face.FeaturePoints |> Array.map featurePointToPoint
let tl, br = Point(points.Min(fun p -> p.X), points.Min(fun p -> p.Y)), Point(points.Max(fun p -> p.X), points.Max(fun p -> p.Y))

let rect = Rectangle(tl, Size(Point(br.X - tl.X, br.Y - tl.Y)))
CvInvoke.Rectangle(img, rect, Bgr(Color.Green).MCvScalar, 2)

// tracking points
points.AsParallel().ForAll(fun p ->
CvInvoke.Circle(img, p, 2, Bgr(Color.Red).MCvScalar, -1)
)

// age
let age = string face.Appearance.Age
CvInvoke.PutText(img, age, Point(rect.Right + 5, rect.Top), FontFace.HersheyComplex, 0.5, Bgr(Color.BlueViolet).MCvScalar, 1)

// gender & appearance
let gender = int face.Appearance.Gender

// glasses
let glasses = int face.Appearance.Glasses

let appearanceFile = makeFileName gender glasses
loadIntoImage img appearanceFile (Point(rect.Right + 5, rect.Top + 15)) iconSize

// emoji
if face.Emojis.dominantEmoji <> Affdex.Emoji.Unknown then
let emofile = Path.ChangeExtension(Path.Combine(resources, (int >> string) face.Emojis.dominantEmoji), ".png")
loadIntoImage img emofile (Point(rect.Left, rect.Top - 50)) iconSize
)

viewer.Image <- img.Mat
}

let processStatusListener = {
new ProcessStatusListener with
member this.onProcessingException ex = ()
member this.onProcessingFinished () = finished <- true
}



Nothing all that tricky about this code. F# object expression comes in handy for quickly creating an object that implements an interface. onImageResults is the key function here. It processes everything and sends it to the EmguCV handy viewer, which is launched at the start of script execution and runs asynchronously (I like how it doesn’t force me to modify its UI elements on the same thread that created it. This is totally cheating and smells buggy, but it’s so convenient for scripting!)

// Create our simplistic UI
let viewer = new ImageViewer()
let sd =
async {
return (viewer.ShowDialog()) |> ignore
}
Async.Start(sd)


In the first couple of lines we transform the captured frame to EmguCV-understandable format. I am using Image rather than the recommended Mat class, because I want to splat emojis over the existing frames and as amazing as it is, the only way to do it that I know of in EmguCV is this counter-intuitive use of ROI. If anyone knows a better way of copying one image on top of another (should be easy, right?) please let me know.

The next few lines draw the statistics on the image: tracking points, emojis, and demographic data. Emojis are stored in files located in the resources path (see above, in my case I just copied them locally) with file names matching the SDK emoji codes. A simple function transforms these codes into file names. Finally, the modified frame is sent to the EmguCV viewer. That’s it!

let featurePointToPoint (fp : FeaturePoint) = Point(int fp.X, int fp.Y)
let mutable finished = false
let makeFileName i j = Path.ChangeExtension(Path.Combine(resources, String.Format("{0}{1}", i, j)), ".png")


Image Copy

The following two functions do the magic of copying emojis on top of the image:

let copyImage (src : Image<Bgr, byte>) (dest : Image<Rgb, byte>) (topLeft : Point) =
let prevRoi = dest.ROI
dest.ROI <- Rectangle(topLeft, src.Size)
CvInvoke.cvCopy(src.Ptr, dest.Ptr, IntPtr.Zero)
dest.ROI <- prevRoi

let loadIntoImage (img : Image<Rgb, byte>) (file : string) (topLeft : Point) (size : Size) =
let src = new Image<Bgr, byte>(size)
CvInvoke.Resize(new Image<Bgr, byte>(file), src, size)
copyImage src img topLeft


copyImage first sets the ROI of the destination, then issues a legacy cvCopy call. It operates on pointer structures which is so ugly! There really should be a better way.

Categories: F#

Zooming Through Euler Path: Supercharging with GPU

So, continuing where we left off:

For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch()
let N = 1024 * 1024
let k = 7
let avgedges k = [1..k] |> List.map float |> List.average
let gr = StrGraph.GenerateEulerGraph(N * 10, k)
printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges))

let eulerCycle = findEulerTimed gr // GPU-based algorithm
sw.Restart()
let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm
sw.Stop()
let cpu = float sw.ElapsedMilliseconds
printfn "CPU: Euler cycle generated in %A" sw.Elapsed


And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4
Generated euler graph in 00:00:19.7520656, edges: 41,944,529
Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00
1. Predecessors computed in 00:00:03.2146705
2. Partitioned linear graph in 00:00:06.4475982
Partitions of LG: 6
3. Circuit graph generated in 00:00:31.4655218
4. Swips implemented in 00:00:00.2189634

GPU: Euler cycle generated in 00:00:41.3474044
CPU: Euler cycle generated in 00:01:02.9022833


And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.
Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!
While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.
Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15
let k = 5
let gr = StrGraph.GenerateEulerGraph(N, k)
gr.Visualize(edges=true)


1. We walk it as we like, computing edge predecessors. For two edges $e_1 = (u_1, v_1), e_2 = (u_2, v_2),\ e_1$ is a predecessor of $e_2$ iff $v_1 \equiv u_2$, i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) =

let rowIndex = arrayCopy gr.RowIndex
let ends = gr.ColIndex

let predecessors = Array.create gr.NumEdges -1

[|0..ends.Length - 1|]
|> Array.iter
(fun i ->
predecessors.[rowIndex.[ends.[i]]] <- i
rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1
)

predecessors


2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so: predecessors.[List.head result] :: result. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:

This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])=
let allVertices = HashSet<int>(end')
let colors = Array.create end'.Length -1
let mutable color = 0

while allVertices.Count > 0 do
let mutable v = allVertices.First()
while colors.[v] < 0 do
allVertices.Remove v |> ignore
colors.[v] <- color
v <- end'.[v]
color <- color + 1
colors, color


So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a vertex two partitions have in common.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition
gcGraph.Visualize()


4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge $(u_1, v_1)$ is repeated as $(v_1, u_1)$ All we actually need is a spanning tree of this graph:

gcGraph.Visualize(spanningTree=true)


Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity
let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors
finalGraph.Reverse.Visualize()


And it’s a beautiful circle, we are done!

Why not Break out That Cognac?

let N = 1024 * 1024
let i = 1

let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N)
let eulerCycle = findEulerTimed gr

Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00
1. Predecessors computed in 00:00:00.3479258
2. Partitioned linear graph in 00:02:48.3658898
Partitions of LG: 45514
# of partitions: 45514 (CPU generation of CG)
3. Circuit graph generated in 00:00:34.1632645
4. Swips implemented in 00:00:00.1707746
GPU: Euler cycle generated in 00:03:23.0505569


This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the HashSet implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a HashSet to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as GenerateEulerGraph and GenerateEulerGraphAlt. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

Categories: bioinformatics, CUDA, F# Tags: , , , ,

Walking the Euler Path: PIN Cracking and DNA Sequencing

Continuing on to some cool applications of Eulerian paths.

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) =
let mutable curVertex = defaultArg start 0

let stack = Stack<int>()
let connections = Dictionary<int, int []>()
let start = curVertex
let mutable cycle = []
let mutable first = true

while stack.Count > 0 || first do
first <- false
let connected = connections.[curVertex]
if connected.Length = 0 then
cycle <- curVertex :: cycle
curVertex <- stack.Pop()
else
stack.Push curVertex
connections.[curVertex] <- connected.[1..]
curVertex <- connected.[0]
if not (connections.ContainsKey curVertex) then

let path = start::cycle
if path.Length <> this.NumEdges + 1 then []
else
start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])


Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal =
let start = rowIndex.[ordinal]
let end' = rowIndex.[ordinal + 1] - 1
colIndex.[start..end']


De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

1. Contains all sub-sequences of length n, and
2. Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: 01011100. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2]
let suffix (s:string) = s.[1..]
let prefSuf s = prefix s, suffix s // shorthand

let numToBinary len n =
let rec numToBinaryRec n len acc =
if len = 0 then acc
else
numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc))
numToBinaryRec n len ""

let binaryDebruijnSeq n =
if n <= 0 then failwith "n should be positive"
let finish = pown 2 n
let gr =
[0..finish-1]
|> List.map (numToBinary n >> prefSuf)
|> List.groupBy fst
|> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e )))
|> DirectedGraph<string>.FromStrings

let debruinSeq = gr.FindEulerPath()
let debruinNum =
debruinSeq
|> List.windowed 2
|> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"")

gr.Visualize(euler = true, eulerLabels = debruinNum)



Here the function binaryDeruijnSeq computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: $p\ ->\ s_1,\ s_2...s_n$, connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a DirectedGraph class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

Categories: bioinformatics, F#, Graphs

Walking the Euler Path: GPU for the Road

Continuation of the previous posts:

GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “alea iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite Aela.CUDA library.

GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) =
use dSorted = sortGpu dArr
use dGrouped = worker.Malloc<int>(dSorted.Length)

let lp = LaunchParam(divup dSorted.Length blockSize, blockSize)
worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr

compactGpuWithKernel <@createDistinctMap @> dGrouped

1. We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
2. We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
3. Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000))
GpuDistinct.distinct sample


Here is the comparison:

Length: 2,097,152
CPU distinct: 00:00:00.0262776
GPU distinct: 00:00:02.9162098
Length: 26,214,400
CPU distinct: 00:00:00.5622276
GPU distinct: 00:00:03.2298218
Length: 262,144,000
CPU distinct: 00:00:03.8712437
GPU distinct: 00:00:05.7540822


Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000
CPU distinct: 00:00:04.2019013
GPU distinct: 00:00:06.7728424


The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

Computing Graph Properties

Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true)

euler.Visualize(spanningTree=true, washNonSpanning=false)


Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing IEnumerable<DirectedGraph>  to accomplish this, right? Wrong! What we want is the AsEnumerable property. Makes things clean and easy. It uses Seq.init method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices
(fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])


Now we can also do ourselves a favor and decorate our class with the StructuredFormatDisplay("{AsEnumerable}") to enable F# Interactive pretty printing of our graph:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal :



Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph =
seq
[("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]);
("3", [|"5"; "6"; "7"|]); ...]


We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray


to completely actualize the sequence and see the textual representation of the entire graph.

“Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given $G=(V, E)$, $G_r=(V_r, E_r):\ V_r=V,\ E_r=\{(v_i, v_j) | (v_j, v_i) \in V \}$

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple $(v_i, v_j)$, we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse =
lazy (
let allExistingRows = [0..rowIndex.Length - 1]

let subSeq =
if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse
let start, end' =
let dStart, dEnd = getEdgesGpu rowIndex colIndex
sortStartEnd dStart dEnd

Seq.zip end' start
else
asOrdinalsEnumerable ()
|> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i)))
|> Seq.collect id

let grSeq =
subSeq
|> Seq.groupBy fst
|> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray)

let allRows : seq<int * int []> =
allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||])
|> fun col -> col.Union grSeq
|> Seq.sortBy fst

let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length
let revColIndex = allRows |> Seq.collect snd

DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal)
)

member this.Reverse = reverse.Force()


On line 35, .Force() will only call the computation once and cache the result. Each subsequent call to .Force() will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” grSeq will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|];

let gr = StrGraph.FromStrings s
gr.Visualize()

gr.Reverse.Visualize()


Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to $|V| = 10 \cdot 2^{10} \cdot 2^{10}$.

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: start and end' that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.


[<Kernel;ReflectedDefinition>]
let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) =
let idx = blockIdx.x * blockDim.x + threadIdx.x
if idx < len - 1 then
for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do
start.[vertex] <- idx
end'.[vertex] <- colIndex.[vertex]



Here is the test:

let mutable N = 10 * 1024 * 1024
let k = 5

sw.Restart()
let gr = StrGraph.GenerateEulerGraph(N, k)
sw.Stop()

printfn "Graph: %s vertices, %s edges generated in %A"
(String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed

sw.Restart()
let starts, ends = getEdges gr.RowIndex gr.ColIndex

sw.Stop()
printfn "GPU edges: %A" sw.Elapsed

sw.Restart()

gr.OrdinalEdges

sw.Stop()
printfn "CPU edges: %A" sw.Elapsed



And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697
GPU edges: 00:00:01.5234606
CPU edges: 00:00:16.5161326


Finally. I’m happy to take the win!

Categories: bioinformatics, CUDA, F# Tags: , , , ,

Visualizing Graphs

September 18, 2016 1 comment

Previously

Walking the Eule Path: Intro

Generating and Visualizing Graphs

I can hardly overemphasize the importance of visusalizations. Many a bug had been immediately spotted just by looking at a visual of a complex data structure. I therefore decided to add visuals to the project as soon as the DirectedGraph class was born.

Code & Prerequisits

Code is on GitHub.

1. GraphViz: install and add the bin directory to the PATH
2. EmguCV v3.1: install and add the bin directory to the PATH

DrawGraph

This is a small auxiliary component I wrote to make all future visualizations possible. And here is a sidebar. I didn’t want to write this component. I am not a fan of re-writing something that was written a hundred times before me, so the first thing I did was look for something similar I could use. Sure enough, I found a few things. How can I put it? Software engineering is great, but boy, do we tend to overengineer things! I know, I’m guilty of the same thing myself. All I wanted from the library was an ability to receive a text file written in GraphViz DSL, and get on the output a .png containing the picture of the graph. Just a very simple GraphViz driver, nothing more.

One library had me instantiate 3 (three!) classes, another developed a whole API of its own to build the GraphViz file… I ended up writing my own component, it has precisely 47 lines of code. the last 4 lines are aliasing a single function that does exactly what I wanted. It creates the png file and then immediately invokes the EmguCV image viewer to show it. After we’re done, it cleans up after itself, deleting the temporary png file. Here it is.

Taking it for a Ride

Just to see this work…
Another digression. Love the new feature that generates all the “#r” instructions for F# scripts and sticks them into one file! Yes, this one! Right-click on “References” in an F# project:

.

And the generated scripts auto-update as you recompile with new references! A+ for the feature, thank you so much.

Comes with a small gotcha, though: sometimes it doesn’t get the order of references quite right and then errors complaining of references not being loaded appear in the interactive. I spent quite a few painful hours wondering how is it that this reference was not loaded, when here it is! Then I realized: it was being loaded AFTER it was required by references coming after it).

#load "load-project-release.fsx"
open DrawGraph

createGraph "digraph{a->b; b->c; 2->1; d->b; b->b; a->d}" "dot.exe" None



Cool. Now I can take this and use my own function to generate a graph from a string adjacency list, visualize it, and even view some of its properties. Sort of make the graph “palpable”:

let sparse = ["a -> b, c, d"; "b -> a, c"; "d -> e, f"; "e -> f"; "1 -> 2, 3"; "3 -> 4, 5"; "x -> y, z"; "2 -> 5"]
let grs = StrGraph.FromStrings sparse

grs.Visualize(clusters = true)


StrGraph.FromStrings does exactly what it says: it generates a graph from a sequence of strings, formatted like the sparse list above.
My Visualize function is a kitchen sink for all kinds of visuals, driven by its parameters. In the above example, it invokes graph partitioning to clearly mark connected components.

It is important to note, that this functionality was added to the visualizer not because I wanted to see connected components more clearly, but as a quick way to ensure that my partitioning implementation was indeed working correctly.

Generating Data and Looking at It

Now we have a class that builds graphs and even lets us look at them, so where do we get these graphs? The easiest thing (seemed at the time) was to create them.

Enter FsCheck. It’s not the easiest library to use, there is a learning curve and getting used to things takes time, but it’s very helpful. Their documentation is quite good too. The idea is to write a generator for your type and then use that generator to create as many samples as you like:

#load "load-project-release.fsx"

open Graphs
open FsCheck
open System
open DataGen

let grGen = graphGen 3 50
let gr = grGen.Sample(15, 5).[2]

gr.Visualize(into=3, out= 3)


This produces something like:

My function graphGen len num generates a graph of text vertices where len is the length of a vertex name and num is the number of vertices. It returns an FsCheck generator that can then be sampled to get actual graphs. This was a one-off kind of experiment, so it’s in a completely separate module:


//DataGen.fs

module DataGen
open FsCheck
open System
open Graphs

let nucl = Gen.choose(int 'A', int 'Z') |> Gen.map char

let genVertex len =  Gen.arrayOfLength len nucl |> Gen.map (fun c -> String(c))
let vertices len number = Gen.arrayOfLength number (genVertex len) |> Gen.map Array.distinct

let graphGen len number =
let verts = vertices len number
let rnd = Random(int DateTime.UtcNow.Ticks)
let pickFrom = verts |> Gen.map (fun lst -> lst.[rnd.Next(lst.Length)])
let pickTo = Gen.sized (fun n -> Gen.listOfLength (if n = 0 then 1 else n) pickFrom)

Gen.sized
<|
(fun n ->
Gen.map2
(fun from to' ->
from, (to' |> Seq.reduce (fun acc v -> acc + ", " + v))) pickFrom pickTo
|>
Gen.arrayOfLength (if n = 0 then 1 else n)
|> Gen.map (Array.distinctBy fst)
|> Gen.map (fun arr ->  arr |> Array.map (fun (a, b) -> a + " -> " + b))
)
|> Gen.map StrGraph.FromStrings



This whole module cascades different FsCheck generators to create a random graph.
The simplest of them nucl, generates a random character. (Its name comes from the fact that originally I wanted to limit the alphabet to just four nucleotide characters A, C, G, T). Then this generator is used by genVertex to generate a random string vertex, and finally vertices creates an array of distinct random vertices.

graphGen creates a sequence of strings that FromStrings (above) understands. It first creates a string of “inbound” vertices and then adds an outbound vertex to each such string.

Sampling is a little tricky, for instance, the first parameter to the Sample function, which, per documentation, controls sample size, in this case is responsible for complexity and connectivity of the resulting graphs.

On to Euler…

The script above also specifies a couple of optional parameters to the visualizer: into will mark any vertex that has into or more inbound connections in green. And out will do the same for outbound connections and yellow. If the same vertex possesses both properties, it turns blue.

Inspired by all this success, I now want to write a function that would generate Eulerian graphs. The famous theorem states that being Eulerian (having an Euler cycle) for a directed graph is equivalent to being strongly connected and having in-degree of each vertex equal to its out-degree. Thus, the above properties of the visualizer are quite helpful in confirming that the brand new generator I have written for Eulerain graphs (GenerateEulerGraph) is at the very least on track:


let gre = StrGraph.GenerateEulerGraph(10, 5)
gre.Visualize(into=3, out=3)



Very encouraging! Whatever has at least 3 edges out, has at least 3 edges in. Not a definitive test, but the necessary condition of having only blue and transparent vertices in the case of an Eulerian graph is satisfied.

In the next post – more about Eulerian graphs, de Brujin sequences, building (and visualizing!) de Bruijn graphs, used for DNA sequence assembly.

Categories: CUDA, data visualization, F#, Graphs

Walking the Euler Path: Intro

Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row $i$ will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with INumeric included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs

/// <summary>
/// Sparse matrix implementation with CSR and CSC storage
/// </summary>
[<StructuredFormatDisplay("{PrintMatrix}")>]
type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) =

....

static member CreateMatrix (row : 'a []) (isCSR : bool) =
let ops = GlobalAssociations.GetNumericAssociation<'a>()
let colIdx, vals =
Array.zip [|0..row.Length - 1|] row
|> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0)
|> Array.unzip

SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)


The idea is to use the GlobalAssociations to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to CreateMatrix and then rows can be appended to it by calling AddValues row. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix $|M| = m \cdot n$, then the answer is given by the equation: $m \cdot n > 2 \cdot e + m + 1$, here $e$ is the number of non-zero elements in the matrix.

For a graph $G=(V, E)$ the set V takes a place of rows, and E – that of columns. The above inequality becomes: $v \cdot e > e + v + 1 \ (v = |V|,\ e = |E|)$, so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this =
let rowIndex  = rowIndex.ToArray()
let colIndex = colIndex.ToArray()
let nEdges = colIndex.Length
let verticesNameToOrdinal = verticesNameToOrdinal
let nVertices = verticesNameToOrdinal.Count

// vertices connected to the ordinal vertex
let getVertexConnections ordinal =
let start = rowIndex.[ordinal]
let end' = rowIndex.[ordinal + 1] - 1
colIndex.[start..end']


This is not very useful, however, since it assumes that we already have rowIndex for the CSR type “R” and colIndex for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file Generators.fs.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.
In the next post I’ll talk about visualizing things, and vsiualization methods really have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)


Visualize(myGraph, euler=true)


So we use type augmentations, for instance, going back to the generators:


//Generators.fs

type Graphs.DirectedGraph<'a when 'a:comparison> with
/// <summary>
/// Create the graph from a file
/// </summary>
/// <param name="fileName"></param>
static member FromFile (fileName : string) =

if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file"

DirectedGraph<string>.FromStrings(lines)



which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

GPU Split & Sort With Alea.CUDA

June 11, 2016 1 comment

Code

The complete source: here

Intro

Starting to play with graph algorithms on the GPU (who wants to wait for nvGraph from NVIDIA, right?) As one of the precursor problems – sorting large arrays of non-negative integers came up. Radix sort is a simple effective algorithm quite suitable for GPU implementation.

The idea is simple. We take a set A of fixed size elements with lexicographical ordering defined. We represent each element in a numeric system of a given radix (e.g., 2). We then scan each element from right to left and group the elements based on the digit within the current scan window, preserving the original order of the elements. The algorithm is described here in more detail.

Here is a the pseudocode for radix = 2 (easy to extend for any radix):

k = sizeof(int)
n = array.length
for i = 0 to k
all_0s = []
all_1s = []
for j = 0 to n
if bit(i, array[j]) == 0
array = all_0s + all_1s



GPU Implementation

This is a poster problem for the “split” GPU primitive. Split is what <code>GroupBy</code> is in LINQ (or <code>GROUP BY</code> in TSQL) where a sequence of entries are split (grouped) into a number of categories. The code above applies split k times to an array of length n, each time the categories are are “0” and “1”, and splitting/grouping are done based on the current digit of an entry.

The particular case of splitting into just two categories is easy to implement on the GPU. Chapter 39 of GPU Gems describes such implementation (search for “radix sort” on the page).

The algorithm described there shows how to implement the pseudocode above by computing the position of each member of the array in the new merged array (line 10 above) without an intermediary of accumulating lists. The new position of each member of the array is computed based on the exclusive scan where the value of scanned[i] = scanned[i – 1] + 1 when array[i-1] has 0 in the “current” position. (scanned[0] = 0). Thus, by the end of the scan, we know where in the new array the “1” category starts (it’s scanned[n – 1] + is0(array[n – 1]) – total length of the “0” category, and the new address of each member of the array is computed from the scan: for the “0” category – it is simply the value of scanned (each element of scanned is only increased when a 0 bit is encountered), and start_of_1 + (i – scanned[i]) for each member of the “1” category: its position in the original array minus the number of “0” category members up to this point, offset by the start of the “1” category.

The algorithm has two parts: sorting each block as described above and then merging the results. In our implementation we skip the second step, since Alea.CUDA DeviceScanModule does the merge for us (inefficiently so at each iteration, but makes for simpler, more intuitive code).

 Coding with Alea.CUDA The great thing about Alea.CUDA libray v2 is the Alea.CUDA.Unbound namespace that implements all kinds of scans (and reducers, since a reducer is just a scan that throws away almost all of its results). let sort (arr : int []) = let len = arr.Length if len = 0 then [||] else let gridSize = divup arr.Length blockSize let lp = LaunchParam(gridSize, blockSize) // reducer to find the maximum number & get the number of iterations // from it. use reduceModule = new DeviceReduceModule<int>(target, <@ max @>) use reducer = reduceModule.Create(len) use scanModule = new DeviceScanModule<int>(target, <@ (+) @>) use scanner = scanModule.Create(len) use dArr = worker.Malloc(arr) use dBits = worker.Malloc(len) use numFalses = worker.Malloc(len) use dArrTemp = worker.Malloc(len) // Number of iterations = bit count of the maximum number let numIter = reducer.Reduce(dArr.Ptr, len) |> getBitCount let getArr i = if i &&& 1 = 0 then dArr else dArrTemp let getOutArr i = getArr (i + 1) for i = 0 to numIter - 1 do // compute significant bits worker.Launch <@ getNthSignificantReversedBit @> lp (getArr i).Ptr i len dBits.Ptr // scan the bits to compute starting positions further down scanner.ExclusiveScan(dBits.Ptr, numFalses.Ptr, 0, len) // scatter worker.Launch <@ scatter @> lp (getArr i).Ptr len numFalses.Ptr dBits.Ptr (getOutArr i).Ptr (getOutArr (numIter - 1)).Gather() let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x)) This is basically it. One small optimization - instead of looping 32 times (length of int in F#), we figure out the bit count the largest element and iterate fewer times. Helper functions/kernels are straightforward enough: let worker = Worker.Default let target = GPUModuleTarget.Worker worker let blockSize = 512 [<Kernel; ReflectedDefinition>] let getNthSignificantReversedBit (arr : deviceptr<int>) (n : int) (len : int) (revBits : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then revBits.[idx] <- ((arr.[idx] >>> n &&& 1) ^^^ 1) [<Kernel; ReflectedDefinition>] let scatter (arr : deviceptr<int>) (len: int) (falsesScan : deviceptr<int>) (revBits : deviceptr<int>) (out : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then let totalFalses = falsesScan.[len - 1] + revBits.[len - 1] // when the bit is equal to 1 - it will be offset by the scan value + totalFalses // if it's equal to 0 - just the scan value contains the right address let addr = if revBits.[idx] = 1 then falsesScan.[idx] else totalFalses + idx - falsesScan.[idx] out.[addr] <- arr.[idx] let getBitCount n = let rec getNextPowerOfTwoRec n acc = if n = 0 then acc else getNextPowerOfTwoRec (n >>> 1) (acc + 1) getNextPowerOfTwoRec n 0 Generating Data and Testing with FsCheck I find FsCheck quite handy for quick testing and generating data with minimal setup in FSharp scripts: let genNonNeg = Arb.generate<int> |> Gen.filter ((<=) 0) type Marker = static member arbNonNeg = genNonNeg |> Arb.fromGen static member Sorting Correctly arr = sort arr = Array.sort arr Arb.registerByType(typeof<Marker>) Check.QuickAll(typeof<Marker>) Just a quick note: the first line may seem weird at first glance: looks like the filter condition is dropping non-positive numbers. But a more attentive second glance reveals that it is actually filtering out negative numbers (currying, prefix notation should help convince anyone in doubt). Large sets test. I have performance tested this on large datasets (10, 000, 000 - 300, 000, 000) integers, any more and I'm out of memory on my 6 Gb Titan GPU. The chart below goes up to 100,000,000. I have generated data for these tests on the GPU as well, using Alea.CUDA random generators, since FsCheck generator above is awfully slow when it comes to large enough arrays. The code comes almost verbatim from this blog post: let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x)) except I'm converting the generated array to integers in the [0; n] range. Since my radix sort work complexity is O(k * n) (step complexity O(k))  where k = bitCount(max input) I figured this would give an adequate (kinda) comparison with the native F# Array.orderBy . Optimization It is clear from the above chart, that there is a lot of latency in our GPU implementation - the cost we incur for the coding nicety: by using DeviceScanModule and DeviceReduceModule we bracket the merges we would otherwise have to do by hand. Hence the first possible optimization: bite the bullet, do it right, with a single merge at the end of the process and perform a block-by-block sort with BlockScanModule 
 Categories: CUDA, F# Tags: Alea.CUDA, sorting 
 Older Entries 
 RSS feed Google Youdao Xian Guo Zhua Xia My Yahoo! newsgator Bloglines iNezha Twitter Source Code The source code mentioned in this blog is available on my GitHub fierval A software developer passionate about functional programming, philosophy, languages, life. View Full Profile → Categories bioinformatics C# Clojure CoffeeScript computation expression CPP CUDA d3 d3js data visualization F# FParsec FSharpChart Graphs LINQ monad Parallel Push Puzzle TDD TPL Recent Posts Getting Emotional with Affectiva, F#, and Emgu Zooming Through Euler Path: Supercharging with GPU Walking the Euler Path: PIN Cracking and DNA Sequencing Walking the Euler Path: GPU for the Road Visualizing Graphs Search for: 
 Top Blog at WordPress.com. 
 
 //<![CDATA[ var infiniteScroll = {"settings":{"id":"main","ajaxurl":"https:\/\/viralfsharp.com\/?infinity=scrolling","type":"scroll","wrapper":true,"wrapper_class":"infinite-wrap","footer":true,"click_handle":"1","text":"Older posts","totop":"Scroll back to top","currentday":"05.01.17","order":"DESC","scripts":[],"styles":[],"google_analytics":false,"offset":0,"history":{"host":"viralfsharp.com","path":"\/page\/%d\/","use_trailing_slashes":true,"parameters":""},"query_args":{"error":"","m":"","p":0,"post_parent":"","subpost":"","subpost_id":"","attachment":"","attachment_id":0,"name":"","static":"","pagename":"","page_id":0,"second":"","minute":"","hour":"","day":0,"monthnum":0,"year":0,"w":0,"category_name":"","tag":"","cat":"","tag_id":"","author":"","author_name":"","feed":"","tb":"","paged":0,"meta_key":"","meta_value":"","preview":"","s":"","sentence":"","title":"","fields":"","menu_order":"","embed":"","category__in":[],"category__not_in":[],"category__and":[],"post__in":[],"post__not_in":[],"post_name__in":[],"tag__in":[],"tag__not_in":[],"tag__and":[],"tag_slug__in":[],"tag_slug__and":[],"post_parent__in":[],"post_parent__not_in":[],"author__in":[],"author__not_in":[],"posts_per_page":7,"ignore_sticky_posts":false,"suppress_filters":false,"cache_results":false,"update_post_term_cache":true,"lazy_load_term_meta":true,"update_post_meta_cache":true,"post_type":"","nopaging":false,"comments_per_page":"50","no_found_rows":false,"order":"DESC"},"last_post_date":"2016-06-11 11:49:56","stats":"blog=31838193&v=wpcom&tz=-8&user_id=0&subd=viralfsharp&x_pagetype=infinite"}}; //]]> /* <![CDATA[ */ var WPGroHo = {"my_hash":""}; /* ]]> */ //initialize and attach hovercards to all gravatars jQuery( document ).ready( function( $) { if (typeof Gravatar === "undefined"){ return; } if ( typeof Gravatar.init !== "function" ) { return; } Gravatar.profile_cb = function( hash, id ) { WPGroHo.syncProfileData( hash, id ); }; Gravatar.my_hash = WPGroHo.my_hash; Gravatar.init( 'body', '#wp-admin-bar-my-account' ); }); /* <![CDATA[ */ var HighlanderComments = {"loggingInText":"Logging In\u2026","submittingText":"Posting Comment\u2026","postCommentText":"Post Comment","connectingToText":"Connecting to %s","commentingAsText":"%1$s: You are commenting using your %2$s account.","logoutText":"Log Out","loginText":"Log In","connectURL":"https:\/\/viralfsharp.wordpress.com\/public.api\/connect\/?action=request","logoutURL":"https:\/\/viralfsharp.wordpress.com\/wp-login.php?action=logout&_wpnonce=84e8939476","homeURL":"https:\/\/viralfsharp.com\/","postID":"992","gravDefault":"identicon","enterACommentError":"Please enter a comment","enterEmailError":"Please enter your email address here","invalidEmailError":"Invalid email address","enterAuthorError":"Please enter your name here","gravatarFromEmail":"This picture will show whenever you leave a comment. Click to customize it.","logInToExternalAccount":"Log in to use details from one of these accounts.","change":"Change","changeAccount":"Change Account","comment_registration":"0","userIsLoggedIn":"","isJetpack":"0","text_direction":"ltr"}; /* ]]> */ Viral F# Blog at WordPress.com. /*<![CDATA[*/if(typeof(addLoadEvent)!='undefined'){addLoadEvent(function(){if(top==self){i=document.createElement('img');i.src='https://botd2.wordpress.com/botd.gif?blog=31838193&post=0&lang=en&date=1483653642&ip=54.211.226.90&url=https://viralfsharp.com/';i.id='botd2';i.style.width='0px';i.style.height='0px';i.style.overflow='hidden';document.body.appendChild(i);}});}/*]]>*/ Post to Cancel (function(){ var corecss = document.createElement('link'); var themecss = document.createElement('link'); var corecssurl = "https://s0.wp.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.9b"; if ( corecss.setAttribute ) { corecss.setAttribute( "rel", "stylesheet" ); corecss.setAttribute( "type", "text/css" ); corecss.setAttribute( "href", corecssurl ); } else { corecss.rel = "stylesheet"; corecss.href = corecssurl; } document.getElementsByTagName("head")[0].insertBefore( corecss, document.getElementById("syntaxhighlighteranchor") ); var themecssurl = "https://s0.wp.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?m=1363304414h&amp;ver=3.0.9b"; if ( themecss.setAttribute ) { themecss.setAttribute( "rel", "stylesheet" ); themecss.setAttribute( "type", "text/css" ); themecss.setAttribute( "href", themecssurl ); } else { themecss.rel = "stylesheet"; themecss.href = themecssurl; } //document.getElementById("syntaxhighlighteranchor").appendChild(themecss); document.getElementsByTagName("head")[0].insertBefore( themecss, document.getElementById("syntaxhighlighteranchor") ); })(); SyntaxHighlighter.config.strings.expandSource = '+ expand source'; SyntaxHighlighter.config.strings.help = '?'; SyntaxHighlighter.config.strings.alert = 'SyntaxHighlighter\n\n'; SyntaxHighlighter.config.strings.noBrush = 'Can\'t find brush for: '; SyntaxHighlighter.config.strings.brushNotHtmlScript = 'Brush wasn\'t configured for html-script option: '; SyntaxHighlighter.defaults['pad-line-numbers'] = false; SyntaxHighlighter.defaults['toolbar'] = false; SyntaxHighlighter.all(); // Infinite scroll support jQuery( function($ ) { \$( document.body ).on( 'post-load', function() { SyntaxHighlighter.highlight(); } ); } ); /* <![CDATA[ */ var actionbardata = {"siteID":"31838193","siteName":"Viral F#","siteURL":"https:\/\/viralfsharp.com","icon":"<img alt='' src='https:\/\/secure.gravatar.com\/blavatar\/22841278f0cb39a8bef48b164cf87ba0?s=36' class='avatar avatar-36' height='36' width='36' \/>","canManageOptions":"","canCustomizeSite":"","isFollowing":"","themeSlug":"pub\/inove","signupURL":"https:\/\/wordpress.com\/start\/","loginURL":"https:\/\/viralfsharp.wordpress.com\/wp-login.php?redirect_to=https%3A%2F%2Fviralfsharp.com%2F2017%2F01%2F05%2Fgetting-emotional-with-affectiva-f-and-emgu%2F","themeURL":"","xhrURL":"https:\/\/viralfsharp.com\/wp-admin\/admin-ajax.php","nonce":"7bf665727a","isSingular":"","isFolded":"","isLoggedIn":"","isMobile":"","subscribeNonce":"<input type=\"hidden\" id=\"_wpnonce\" name=\"_wpnonce\" value=\"d46506b88f\" \/>","referer":"https:\/\/viralfsharp.com\/","canFollow":"1","statusMessage":"","customizeLink":"https:\/\/viralfsharp.wordpress.com\/wp-admin\/customize.php?url=https%3A%2F%2Fviralfsharp.wordpress.com%2F","i18n":{"view":"View site","follow":"Follow","following":"Following","edit":"Edit","login":"Log in","signup":"Sign up","customize":"Customize","report":"Report this content","themeInfo":"Get theme: INove","shortlink":"Copy shortlink","copied":"Copied","followedText":"New posts from this site will now appear in your <a href=\"https:\/\/wordpress.com\/\">Reader<\/a>","foldBar":"Collapse this bar","unfoldBar":"Expand this bar","editFollows":"Manage sites I follow","editSubs":"Manage subscriptions","viewReader":"View site in the Reader","subscribe":"Sign me up","enterEmail":"Enter your email address","followers":"Join 28 other followers","alreadyUser":"Already have a WordPress.com account? <a href=\"https:\/\/viralfsharp.wordpress.com\/wp-login.php?redirect_to=https%3A%2F%2Fviralfsharp.com%2F2017%2F01%2F05%2Fgetting-emotional-with-affectiva-f-and-emgu%2F\">Log in now.<\/a>"}}; /* ]]> */ /* <![CDATA[ */ var jetpackCarouselStrings = {"widths":[370,700,1000,1200,1400,2000],"is_logged_in":"","lang":"en","ajaxurl":"https:\/\/viralfsharp.com\/wp-admin\/admin-ajax.php","nonce":"4bae25b893","display_exif":"1","display_geo":"1","single_image_gallery":"1","single_image_gallery_media_file":"","background_color":"black","comment":"Comment","post_comment":"Post Comment","write_comment":"Write a Comment...","loading_comments":"Loading Comments...","download_original":"View full size <span class=\"photo-size\">{0}<span class=\"photo-size-times\">\u00d7<\/span>{1}<\/span>","no_comment_text":"Please be sure to submit some text with your comment.","no_comment_email":"Please provide an email address to comment.","no_comment_author":"Please provide your name to comment.","comment_post_error":"Sorry, but there was an error posting your comment. Please try again later.","comment_approved":"Your comment was approved.","comment_unapproved":"Your comment is in moderation.","camera":"Camera","aperture":"Aperture","shutter_speed":"Shutter Speed","focal_length":"Focal Length","comment_registration":"0","require_name_email":"1","login_url":"https:\/\/viralfsharp.wordpress.com\/wp-login.php?redirect_to=https%3A%2F%2Fviralfsharp.com%2F2016%2F12%2F26%2Fzooming-through-euler-path-supercharging-with-gpu%2F","blog_id":"31838193","local_comments_commenting_as":"<fieldset><label for=\"email\">Email (Required)<\/label> <input type=\"text\" name=\"email\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-email-field\" \/><\/fieldset><fieldset><label for=\"author\">Name (Required)<\/label> <input type=\"text\" name=\"author\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-author-field\" \/><\/fieldset><fieldset><label for=\"url\">Website<\/label> <input type=\"text\" name=\"url\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-url-field\" \/><\/fieldset>","reblog":"Reblog","reblogged":"Reblogged","reblog_add_thoughts":"Add your thoughts here... (optional)","reblogging":"Reblogging...","post_reblog":"Post Reblog","stats_query_args":"blog=31838193&v=wpcom&tz=-8&user_id=0&subd=viralfsharp","is_public":"1","reblog_enabled":""}; /* ]]> */ // <![CDATA[ (function() { try{ if ( window.external &&'msIsSiteMode' in window.external) { if (window.external.msIsSiteMode()) { var jl = document.createElement('script'); jl.type='text/javascript'; jl.async=true; jl.src='/wp-content/plugins/ie-sitemode/custom-jumplist.php'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(jl, s); } } }catch(e){} })(); // ]]> jQuery.extend( infiniteScroll.settings.scripts, ["jquery","jquery-core","jquery-migrate","mobile-useragent-info","postmessage","jquery_inview","jetpack_resize","spin","jquery.spin","swfobject","videopress","grofiles-cards","wpgroho","jquery.autoresize","highlander-comments","syntaxhighlighter-core","syntaxhighlighter-brush-fsharp","devicepx","jetpack_likes_queuehandler","the-neverending-homepage","wpcom-masterbar-js","wpcom-actionbar-bar","jetpack-carousel","twitter-widgets","twitter-widgets-infinity","twitter-widgets-pending","tiled-gallery"] ); jQuery.extend( infiniteScroll.settings.styles, ["wpcom-smileys","jetpack_likes","the-neverending-homepage","infinity-inove","wpcom-core-compat-playlist-styles","mp6hacks","wpcom-bbpress2-staff-css","noticons","geo-location-flair","reblogging","wpcom-actionbar-bar","h4-global","highlander-comments","highlander-comments-ie7","jetpack-carousel","gravatar-profile-widget","tiled-gallery","jetpack-carousel-ie8fix","gravatar-card-services"] ); var _comscore = _comscore || []; _comscore.push({ c1: "2", c2: "7518284" }); (function() { var s = document.createElement("script"), el = document.getElementsByTagName("script")[0]; s.defer = true; s.src = (document.location.protocol == "https:" ? "https://sb" : "http://b") + ".scorecardresearch.com/beacon.js"; el.parentNode.insertBefore(s, el); })(); _tkq = window._tkq || []; _stq = window._stq || []; _tkq.push(['storeContext', {'blog_id':'31838193','blog_tz':'-8','user_lang':'en','blog_lang':'en','user_id':'0'}]); _stq.push(['view', {'blog':'31838193','v':'wpcom','tz':'-8','user_id':'0','subd':'viralfsharp'}]); _stq.push(['extra', {'crypt':'UE40eW5QN0p8M2Y/RE1LVmwrVi5vQS5fVFtfdHBbPyw1VXIrU3hWLHhmcmw0bWUwNiVnR3N1V2dDZTM4MGFxVndRQl1HNkJtOEQ0dVBbbk9jP2g4T29lWFFNMGt1ei1xeTRYWlo9TUMwdldzWVdIRklHP05OPW93LWsraFImQU1RaUJiS1EwWSs0LF9XMUVfL1UsbDgrZG9bWy1RSmZoX0t4ZXpza01NL2lZRnJxazFbLVI4P2VuL0RUWU9iRXpqVHVxWnxUZj1beCtbW1ZKci1SfHdJZ2E3aCtGKzJmd05uJklMUGI4Py9uXUxpNy5Dd0RGakhmdGxnXy9bLi9KPXUvSW84cXA4WzBlUVNsYU9ldiU3YjNoT25YP1UyLGJtRnZndkZYdVAwUA=='}]); _stq.push([ 'clickTrackerInit', '31838193', '0' ]); if ( 'object' === typeof wpcom_mobile_user_agent_info ) { wpcom_mobile_user_agent_info.init(); var mobileStatsQueryString = ""; if( false !== wpcom_mobile_user_agent_info.matchedPlatformName ) mobileStatsQueryString += "&x_" + 'mobile_platforms' + '=' + wpcom_mobile_user_agent_info.matchedPlatformName; if( false !== wpcom_mobile_user_agent_info.matchedUserAgentName ) mobileStatsQueryString += "&x_" + 'mobile_devices' + '=' + wpcom_mobile_user_agent_info.matchedUserAgentName; if( wpcom_mobile_user_agent_info.isIPad() ) mobileStatsQueryString += "&x_" + 'ipad_views' + '=' + 'views'; if( "" != mobileStatsQueryString ) { new Image().src = document.location.protocol + '//pixel.wp.com/g.gif?v=wpcom-no-pv' + mobileStatsQueryString + '&baba=' + Math.random(); } }