## D3 Off-Label

#### Auto-writing Software

Good software writes itself. We all know the patterns and following them makes things easy. Not because we have memorized any monumental volume on design patterns, but because years of experience taught us what should be done on an almost intuitive level.

Also in a negative sense. We avoid duplication, globals, encapsulation violations. We know that planting a global variable somewhere to “facilitate” object interaction as innocent as it appears at the moment, throws a huge wrench into extensibility and new feature development. (This is known by a majestic name: legacy). Even if in a moment of weakness we allowed ourselves to relax unnecessarily the software itself would soon make us realize the error of our ways by resisting any attempt to extend it. And then with a sigh we go back regretting a momentary weakness and do things right.

In this case, however, software design dictated the user experience.

#### The Context

Last year, I did a brief stint on the DNA Storage Project at Microsoft Research. My task was to create an app that would streamline the process of encoding, synthesizing then decoding and analyzing the data on DNA molecules.

#### The UX Failure

Our intrepid UX designers sketched a pretty simple UX. The workflow appeared uncomplicated enough with just a few interacting entities. It was natural to split it into the “synthesis” and “analysis” parts, and then a few hierarchical grids or grids with links should have done the job. The MEAN stack fit the need and Azure was to host all the storage and backend processing.

Everything worked great until it came to the UX. No matter how much I huffed and puffed it would not yield. Whatever I did, I could not build the MVVM workflow. The model on the “synthesis” and the “analysis” side consisted of pretty much the same entities (files, pools, runs), however, their semantics while similar in some aspects were drastically different in others. While it seemed natural to have a model with a few abstract base classes wrapping the commonalities, the View-Model became an insurmountable obstacle: different View interactions demanded a lot of “similar but different” code in the VM; duplication was everywhere. The code was coming out entwined, and unmaintainable.

#### d3.select.enter

The inspiration came from the House episode Failure to Communicate. House is in Baltimore justifying some of his cases to a Medicare review officer. One of the cases involved prescribing Viagra to a woman with heart problems because she did not tolerate Nitroglycerin. Medicare was refusing to cover the off-label use.

Using D3 for UX development is an off-label use of D3. Perhaps I could have picked a different framework, but this was the one I was familiar with. The idea was to throw away MVVM and turn the entire experience in a data-driven “video-game” as one of the UX designers on the project called it. Then all the conceptual difficulties simply dissolved. With D3, I could easily streamline the peculiarities of the workflow, and, while new challenges emerged (entities interactions, drag & drop, etc), the software resumed its course “writing itself”.

Here is what this proof-of-concept prototype looked like.

## Detecting Faces with Dlib from F#. IFSharp Notebook

### The Choices

I have experimented with OpenCV and Dlib face detection in my computer vision pipeline. Both work well, but the Dlib one worked better: it is more sensitive with (in my case) almost no false positives right out of the box!

Dlib uses several HOG filters that account for profile as well as frontal views. HOG detectors train quickly and are very effective. OpenCV uses Haar Cascades and doesn’t have the same versatility out of the box: you need separate data files for profiles and frontal views. In the case of OpenCV, you also need to experiment with the parameters quite a bit to get it where you want it to be.

Both libraries allow for custom trained detectors, but in my case it did not come to that: Dlib detector was sufficient.

### Using Dlib in F# Code

Dlib is a C++ library, also available for Python. No love for .NET.

The step-by-step of calling dlib face detector from F# code is in the IFSharp Notebook, hosted on my GitHub. Here .NET, EmguCV (OpenCV), and Dlib all work happily together.

(Took Azure Notebooks for a spin, works pretty well).

## HashSet, Graph, Cognac

The post on applying GPU to finding Eulerian path mentioned a stumbling block: partitioning a very specific kind of graph.

In general, partitioning is a hard problem, NP-hard actually. The graphs we are dealing with are very specific, though, and may be partitioned in time (E – the set of graph edges). Our graphs are of the following form: If we are “lucky” it consists of just one or very few (directed) loops, if not – we have lots of such disconnected loops. And it is this unlucky case that kept the bottle of good cognac sealed all these months (only a descent solution to the partitioning problem would break it out).

To partition such a graph, i.e., color every loop in its distinct color, all we need to do is walk the graph from any vertex and once the loop is detected, we pick a vertex at random from the set of those we haven’t visited yet and start walking again. We repeat until everything is visited:

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

This is great, except it doesn’t work. The problem is revealed when a graph is very large and very fragmented. This is when the code at line 7 fails us. The problem is, we expect it to work in O(1): how hard can it be to retrieve the “first” element?! Well, since we are dealing with a data structure that does not have an intrinsic notion of order, it may be quite hard. In fact, the complexity here is (actually , but for our graphs ), thus the complexity of the code above is .

The following code actually works:

let partitionLinear (end' : int [])= let colors = Array.create end'.Length -1 let mutable color = 0 let mutable num = end'.Length let mutable curIdx = 0 let mutable nextIdx = num - 1 while num > 0 && curIdx >= 0 do let mutable v = curIdx while colors.[v] < 0 do colors.[v] <- color v <- end'.[v] color <- color + 1 num <- num - 1 while nextIdx >= 0 && colors.[nextIdx] >= 0 do nextIdx <- nextIdx - 1 curIdx <- nextIdx colors, color

In the worst case it is still , however, this worst case is unlikely and we expect the performance close to in general.

This won’t cure the performance problems of the GPU algorithm, which still relies on the number of graph partitions to be somewhat reasonable, but it will enable it to run in some respectable time. Perhaps time to break out the cognac after all!

## Getting Emotional with Affectiva, F#, and Emgu

I’ve been playing with Affectiva emotion, demographics, and face detection SDK, found it excellent, however, their sample gallery lacks a sample in F#! So here we are to correct that.

I just wanted a simple F# script that would let me take all kinds of the SDK options for a ride. The script itself is 130 lines. Out of that about 30 lines is just a boilerplate to load all the relevant libraries, setup the environment, etc.

Finally, here I am goofing off in front of my webcam.

### Setup

Not much in terms of setup. So, yes, regular things for downloading/installing EmguCV, OpenCV, and installing Affectiva SDK.

Then all this needs to be reflected in the script:

open System Environment.CurrentDirectory <- @"C:\Program Files\Affectiva\Affdex SDK\bin\release" #r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.UI.dll" #r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.UI.GL.dll" #r "../packages/EmguCV.3.1.0.1/lib/net30/Emgu.CV.World.dll" #r "../packages/OpenTK.1.1.2225.0/lib/net20/OpenTK.dll" #r "System.Drawing.dll" #r "System.Windows.Forms.dll" #r @"C:\Program Files\Affectiva\Affdex SDK\bin\release\Affdex.dll" open Affdex open Emgu.CV open Emgu.CV.CvEnum open System.IO open System.Collections.Generic open Emgu.CV.UI open Emgu.CV.Structure open System.Drawing open System.Linq open System.Threading open System.Diagnostics let classifierPath = @"C:\Program Files\Affectiva\Affdex SDK\data" let resources = Path.Combine(__SOURCE_DIRECTORY__, "Resources")

Just loading libraries, no big deal. Except we need to make sure Affdex.dll finds its dependencies, hence setting the current path at the beginning.

### Initializing the Detector

let detector = new CameraDetector() try detector.setClassifierPath(classifierPath) detector.setDetectAllEmotions(true); detector.setDetectAllExpressions(false); detector.setDetectAllEmojis(true); detector.setDetectGender(true); detector.setDetectGlasses(true); detector.setDetectEngagement(true); detector.setDetectValence(true); detector.setDetectAttention(true); detector.setDetectAge(true); detector.setImageListener(imageListener) detector.setProcessStatusListener(processStatusListener) detector.start(); sw.Start() while not finished do Thread.Sleep(1000) sw.Stop() finally detector.Dispose()

Here `setDetectGlasses`

is my favorite. Check it out in the video.

I’m using `CameraDetector`

to capture video from the webcam, if I needed to capture a file video I’d use `VideoDetector`

. Setting properties is easy, albeit slightly confusing at first – all these subtle differences between valence and attention… It makes sense when you get used to it. My favorite is `setDetectAllEmojis`

. The SDK comes with quite a few emojis that can be used to reflect what’s going on in the video.

The `VideoDetector`

is set up in a similar way, except you also need to issue `detector.``process``()`

to start running, camera detector does it automatically.

I would also like to use `use`

instead of `let`

to instantiate the disposable detector, but cannot do it in the script, so true to an instinct for plugging memory leaks before they spring, I wrapped it in the `try..finally`

– not at all necessary in a script, and I don’t do it for EmguCV elements anyway. This is not a production code practice.

### Fun Part: Processing Results

As processed frames start coming in, we hook up to the detector image listener (`detector.setImageListener()`

) which will feed us images and all kinds of fun stats as they come in. Also, `setProcessStatusListener`

will tell us when things are done or errors occur.

let imageListener = { new ImageListener with member this.onImageCapture (frame : Affdex.Frame) = () member this.onImageResults(faces : Dictionary<int, Face>, frame : Affdex.Frame) = let img = new Image<Rgb, byte>(frame.getWidth(), frame.getHeight()); img.Bytes <- frame.getBGRByteArray() let faces = faces |> Seq.map (fun kvp -> kvp.Key, kvp.Value) |> Seq.toArray // draw tracking points faces.ToList().ForEach(fun (idx, face) -> let points = face.FeaturePoints |> Array.map featurePointToPoint let tl, br = Point(points.Min(fun p -> p.X), points.Min(fun p -> p.Y)), Point(points.Max(fun p -> p.X), points.Max(fun p -> p.Y)) let rect = Rectangle(tl, Size(Point(br.X - tl.X, br.Y - tl.Y))) CvInvoke.Rectangle(img, rect, Bgr(Color.Green).MCvScalar, 2) // tracking points points.AsParallel().ForAll(fun p -> CvInvoke.Circle(img, p, 2, Bgr(Color.Red).MCvScalar, -1) ) // age let age = string face.Appearance.Age CvInvoke.PutText(img, age, Point(rect.Right + 5, rect.Top), FontFace.HersheyComplex, 0.5, Bgr(Color.BlueViolet).MCvScalar, 1) // gender & appearance let gender = int face.Appearance.Gender // glasses let glasses = int face.Appearance.Glasses let appearanceFile = makeFileName gender glasses loadIntoImage img appearanceFile (Point(rect.Right + 5, rect.Top + 15)) iconSize // emoji if face.Emojis.dominantEmoji <> Affdex.Emoji.Unknown then let emofile = Path.ChangeExtension(Path.Combine(resources, (int >> string) face.Emojis.dominantEmoji), ".png") loadIntoImage img emofile (Point(rect.Left, rect.Top - 50)) iconSize ) viewer.Image <- img.Mat } let processStatusListener = { new ProcessStatusListener with member this.onProcessingException ex = () member this.onProcessingFinished () = finished <- true }

Nothing all that tricky about this code. F# object expression comes in handy for quickly creating an object that implements an interface. `onImageResults`

is the key function here. It processes everything and sends it to the EmguCV handy viewer, which is launched at the start of script execution and runs asynchronously (I like how it doesn’t force me to modify its UI elements on the same thread that created it. This is totally cheating and smells buggy, but it’s so convenient for scripting!)

// Create our simplistic UI let viewer = new ImageViewer() let sd = async { return (viewer.ShowDialog()) |> ignore } Async.Start(sd)

In the first couple of lines we transform the captured frame to EmguCV-understandable format. I am using `Image`

rather than the recommended `Mat`

class, because I want to splat emojis over the existing frames and as amazing as it is, the only way to do it that I know of in EmguCV is this counter-intuitive use of ROI. If anyone knows a **better way of copying one image on top of another (should be easy, right?) please let me know.**

The next few lines draw the statistics on the image: tracking points, emojis, and demographic data. Emojis are stored in files located in the `resources`

path (see above, in my case I just copied them locally) with file names matching the SDK emoji codes. A simple function transforms these codes into file names. Finally, the modified frame is sent to the EmguCV viewer. That’s it!

let featurePointToPoint (fp : FeaturePoint) = Point(int fp.X, int fp.Y) let mutable finished = false let makeFileName i j = Path.ChangeExtension(Path.Combine(resources, String.Format("{0}{1}", i, j)), ".png")

### Image Copy

The following two functions do the magic of copying emojis on top of the image:

let copyImage (src : Image<Bgr, byte>) (dest : Image<Rgb, byte>) (topLeft : Point) = let prevRoi = dest.ROI dest.ROI <- Rectangle(topLeft, src.Size) CvInvoke.cvCopy(src.Ptr, dest.Ptr, IntPtr.Zero) dest.ROI <- prevRoi let loadIntoImage (img : Image<Rgb, byte>) (file : string) (topLeft : Point) (size : Size) = let src = new Image<Bgr, byte>(size) CvInvoke.Resize(new Image<Bgr, byte>(file), src, size) copyImage src img topLeft

`copyImage`

first sets the ROI of the destination, then issues a legacy `cvCopy`

call. It operates on pointer structures which is so ugly! There really should be a better way.

## Zooming Through Euler Path: Supercharging with GPU

So, continuing where we left off:

- Walking the Euler Path: Intro
- Visualizing Graphs
- Walking the Euler Path: GPU for the Road
- Walking the Euler Path: PIN Cracking and DNA Sequencing

### For the Win

And finally I ran the GPU-enabled algorithm for finding the Euler path.

let sw = Stopwatch() let N = 1024 * 1024 let k = 7 let avgedges k = [1..k] |> List.map float |> List.average let gr = StrGraph.GenerateEulerGraph(N * 10, k) printfn "Generated euler graph in %A, edges: %s" sw.Elapsed (String.Format("{0:N0}", gr.NumEdges)) let eulerCycle = findEulerTimed gr // GPU-based algorithm sw.Restart() let eulerVert = gr.FindEulerCycle() // Hierholzer algorithm sw.Stop() let cpu = float sw.ElapsedMilliseconds printfn "CPU: Euler cycle generated in %A" sw.Elapsed

And the results:

Generating euler graph: vertices = 10,485,760; avg out/vertex: 4 Generated euler graph in 00:00:19.7520656, edges: 41,944,529 Euler graph: vertices - 10,485,760.00, edges - 41,944,529.00 1. Predecessors computed in 00:00:03.2146705 2. Partitioned linear graph in 00:00:06.4475982 Partitions of LG: 6 3. Circuit graph generated in 00:00:31.4655218 4. Swips implemented in 00:00:00.2189634 GPU: Euler cycle generated in00:00:41.3474044CPU: Euler cycle generated in00:01:02.9022833

And I was like: WOW! Finally! Victory is mine! This is awesome! I’m awesome, etc. Victory dance, expensive cognac.

Then, after the euphoria subsided a little, I decided to make the mandatory chart:

Well, this was sobering!

While the CPU series line displays expected behavior, something is definitely not right with the GPU series: there is obviously some variable at work that I am not taking into account. So, from the beginning.

### The Algorithm

I owe the algorithm to this master thesis, which actually implements the algorithm proposed by B. Awerbuch, A. Israeli and Y. Shiloach, “Finding euler circuits in logarithmic parallel time,” in Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 249-257.

The algorithm as I see it may be split into 4 stages (even 3, but 4 is slightly more convenient implementation-wise). Let’s illustrate.

Start with an Euler graph like the one below. It has 15 vertices with an average of 3 edges/vertex in one direction (maxOutOrInEdges = k, we have 44 edges in this case):

let N = 15 let k = 5 let gr = StrGraph.GenerateEulerGraph(N, k) gr.Visualize(edges=true)

1. We walk it as we like, computing edge predecessors. For two edges is a predecessor of iff , i.e. One edge begins where its predecessor ends. In our representation it’s easy to construct the array of predecessors:

let predecessors (gr : DirectedGraph<'a>) = let rowIndex = arrayCopy gr.RowIndex let ends = gr.ColIndex let predecessors = Array.create gr.NumEdges -1 [|0..ends.Length - 1|] |> Array.iter (fun i -> predecessors.[rowIndex.[ends.[i]]] <- i rowIndex.[ends.[i]] <- rowIndex.[ends.[i]] + 1 ) predecessors

2. At this point, if we are lucky, we have the representation of an Euler cycle as edges of the graph. We just need to walk the array we have “backwards”, seeding the final list of edges with edge 0, constructing the list recursively like so:` predecessors.[List.head result] :: result`

. Alternatively, we may generate a graph out of the result and reverse it. (directions of the arrows need to be reversed since this is a predecessor graph. Euler cycles of the graph, where all directions are reversed are the same as those of the original one, reversed.)

In case we aren’t lucky, we consider our predecessor array to be a graph, where each edge of the original graph becomes a vertex and identify partitions of the graph:

This is the weak point of the algorithm. Partitioning a graph is, in general, a hard problem (NP-complete, to be precise), however, in this case, due to a very simple structure of the predecessor graph, the complexity is linear in the number of edges of the original graph: O(|E|).

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

So, now the goal is to join all the circles above into one circle, this is done in the crucial step 3

3. We further collapse the graph based on partitioning. Now, each partition becomes a vertex of the new graph. Edges of this new “circuit graph” are vertices of the original graph, such that each edge represents a **vertex two partitions have in common**.

This is the only part of the algorithm where the GPU is used and is very effective. Incidentally, I took the code almost verbatim from the original thesis, however, the author for some reason preferred not to implement this step on the GPU.

The idea is simple: we loop over the original graph vertex-by-vertex and try to figure out whether edges entering this vertex belong to different partitions (have different colors in the terminology of the code above). Each vertex is processed in a CUDA kernel:

let gcGraph, links, validity = generateCircuitGraph gr.RowIndex partition maxPartition gcGraph.Visualize()

4. This graph is greatly over-determined: we don’t need ALL vertices that partitions have in common (represented by edges here). Also, it’s important to note that this graph is not directed: if partition 0 has a vertex in common with partition 1, then this is the same vertex partition 1 has in common with partition 0. In our implementation this un-directionality is reflected by over-directionality: every edge is repeated as All we actually need is a spanning tree of this graph:

gcGraph.Visualize(spanningTree=true)

Alright, this is much better – ignore directions. The output of step 3 gives us vertices of the original graph where our partitions intersect. We now need to swap edges of our original predecessor array around these vertices, so that each partition is not closed off on itself, but merges with its neighbor (it’s but a small correction to our original predecessor walk). We do this one-by-one, so partition 0 merges first with 1, then with 2. And 2 – with 3. And 1 with 4.

let fixedPredecessors = fixPredecessors gcGraph links edgePredecessors validity let finalGraph = StrGraph.FromVectorOfInts fixedPredecessors finalGraph.Reverse.Visualize()

And it’s a beautiful circle, we are done!

### Why not Break out That Cognac?

let N = 1024 * 1024 let i = 1 let gr = StrGraph.GenerateEulerGraphAlt(N * i, 3 * i * N) let eulerCycle = findEulerTimed gr

Euler graph: vertices - 1,048,575.00, edges - 3,145,727.00 1. Predecessors computed in 00:00:00.3479258 2. Partitioned linear graph in00:02:48.3658898Partitions of LG: 45514 # of partitions:45514 (CPU generation of CG)3. Circuit graph generated in 00:00:34.1632645 4. Swips implemented in 00:00:00.1707746 GPU: Euler cycle generated in 00:03:23.0505569

This is not very impressive. What’s happening? Unfortunately graph structure holds the key together with the `HashSet`

implementation.

The deeper the graph the better it will fare in the new algorithm. The bottleneck is the partitioning stage. Even though its complexity is theoretically O(|E|), I am using a `HashSet`

to restart partitioning when needed and that presents a problem, as accessing it is not always O(1)!

The methods for Euler graph generation are implemented as `GenerateEulerGraph`

and `GenerateEulerGraphAlt`

. The first one “pleases the code”, and generates graphs that are very deep even when the number of edges is large. Usually I get less than 10 partitions, which means that every time I generate predecessors, I’m pretty much guaranteed to be “almost there” as far as finding a cycle. The second method tends to generate very shallow graphs, as the example above shows: I got a fairly large number of partitions while the number of edges is only around 3 million. So while the rest of the algorithm performance is pretty descent, computing partitions just kills the whole thing.

Store the cognac for another time.

## Walking the Euler Path: PIN Cracking and DNA Sequencing

Continuing on to some cool applications of Eulerian paths.

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

### Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) = let mutable curVertex = defaultArg start 0 let stack = Stack<int>() let connections = Dictionary<int, int []>() let start = curVertex let mutable cycle = [] connections.Add(curVertex, this.GetConnectedVertices curVertex) let mutable first = true while stack.Count > 0 || first do first <- false let connected = connections.[curVertex] if connected.Length = 0 then cycle <- curVertex :: cycle curVertex <- stack.Pop() else stack.Push curVertex connections.[curVertex] <- connected.[1..] curVertex <- connected.[0] if not (connections.ContainsKey curVertex) then connections.Add(curVertex, this.GetConnectedVertices curVertex) let path = start::cycle if path.Length <> this.NumEdges + 1 then [] else start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal = let start = rowIndex.[ordinal] let end' = rowIndex.[ordinal + 1] - 1 colIndex.[start..end']

### De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

- Contains all sub-sequences of length n, and
- Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: `01011100`

. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2] let suffix (s:string) = s.[1..] let prefSuf s = prefix s, suffix s // shorthand let numToBinary len n = let rec numToBinaryRec n len acc = if len = 0 then acc else numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc)) numToBinaryRec n len "" let binaryDebruijnSeq n = if n <= 0 then failwith "n should be positive" let finish = pown 2 n let gr = [0..finish-1] |> List.map (numToBinary n >> prefSuf) |> List.groupBy fst |> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e ))) |> DirectedGraph<string>.FromStrings let debruinSeq = gr.FindEulerPath() let debruinNum = debruinSeq |> List.windowed 2 |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"") gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function `binaryDeruijnSeq`

computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a `DirectedGraph`

class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

### PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

### DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

## Walking the Euler Path: GPU for the Road

Continuation of the previous posts:

#### GPU Digression

I was going to talk about something else this week but figured I’d take advantage of the free-hand format and digress a bit.

Continuing the travel metaphor and remembering Julius Cesar’s “*alea* iacta”, we’ll talk about GPU algorithms, for which I invariably use my favorite *Aela*.CUDA library.

#### GPU Distinct

I have already talked about sorting & splitting non-negative integer arrays on the GPU. Another one in this small library is implementing distinct on the GPU. It is using the same ubiquitous scan algorithm as before:

let distinctGpu (dArr : DeviceMemory<int>) = use dSorted = sortGpu dArr use dGrouped = worker.Malloc<int>(dSorted.Length) let lp = LaunchParam(divup dSorted.Length blockSize, blockSize) worker.Launch <@ distinctSortedNums @> lp dSorted.Ptr dSorted.Length dGrouped.Ptr compactGpuWithKernel <@createDistinctMap @> dGrouped

- We first sort the array, so all non-distinct values get grouped together. (Using radix sort on the GPU), step complexity O(k), where k – maximum number of bits across all numbers in the array
- We then replace all values in the group except the first one with 0. One kernel invocation, so O(1) step complexity
- Compact: a variation on scan algorithm with O(log n) steps

So we have the O(log n) step and O(n) work complexity for this version of distinct. The regular linear distinct is O(n). So, is it worth it?

Here is how we test:

let mutable sample = Array.init N (fun i -> rnd.Next(0, 1000)) GpuDistinct.distinct sample

Here is the comparison:

Length: 2,097,152 CPU distinct: 00:00:00.0262776 GPU distinct: 00:00:02.9162098 Length: 26,214,400 CPU distinct: 00:00:00.5622276 GPU distinct: 00:00:03.2298218 Length: 262,144,000 CPU distinct: 00:00:03.8712437 GPU distinct: 00:00:05.7540822

Is all this complexity worth it? It’s hard to say, because as it is obvious from the above numbers, there is a lot of latency in the Alea.CUDA scan, which makes its application useful only once we have an array sufficiently large to hide this latency.

I could not do much in terms any further comparison – ran out of GPU memory before I ran out of .NET object size limitation.

The final comparison:

Length: 300,000,000 CPU distinct: 00:00:04.2019013 GPU distinct: 00:00:06.7728424

The CPU time increase ratio is 1.11, while the GPU increase was 1.18, while the increase in the size of our data is 1.14 – so not really informative: all we can see is that the work complexity is indeed O(n) in both cases, and that’s certainly nothing new. We could responsibly claim, however, that if it weren’t for the latency, our GPU implementation would be faster. Perhaps switching to C++ would confirm this statement.

### Computing Graph Properties

#### Motivation

Remember, for the visuals, we wanted to clearly identify vertices with certain numbers of incoming/outgoing edges. Another case: implementing the spanning tree algorithm, it is necessary to “convert” the directed graph to undirected. This is not a real conversion, we would just need to make sure that if (a -> b) exists in the graph, it means that (a b), i.e. – edges are connected no matter the direction. Our spanning tree should be using “weak” connectivity:

let euler = StrGraph.GenerateEulerGraph(8, 3, path=true) euler.Visualize(spanningTree=true, washNonSpanning=false)

Here red edges mark the “spanning” tree, this graph is “almost” strongly connected – it has an Euler path.

#### Graph as an Iterable

We need an ability to iterate over the vertices of our graph. So, we should be implementing `IEnumerable<DirectedGraph> `

to accomplish this, right? Wrong! What we want is the `AsEnumerable`

property. Makes things clean and easy. It uses `Seq.init`

method – which comes very handy any time we need to turn our data structure into an iterable quickly and cleanly.

member this.AsEnumerable = Seq.init nVertices (fun n -> nameFromOrdinal n, this.[nameFromOrdinal n])

Now we can also do ourselves a favor and decorate our class with the `StructuredFormatDisplay("{AsEnumerable}")`

to enable F# Interactive pretty printing of our graph:

[<StructuredFormatDisplay("{AsEnumerable}")>] type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal :

Now if we just type the name of an instantiated graph in the interactive, we’ll get something like:

val it : DirectedGraph = seq [("0", [|"2"|]); ("1", [|"2"|]); ("2", [|"3"; "4"; "5"|]); ("3", [|"5"; "6"; "7"|]); ...]

We can further improve on what we see by calling

gr.AsEnumerable |> Seq.toArray

to completely actualize the sequence and see the textual representation of the entire graph.

#### “Reverse” Graph

So, if we want all the above goodies (number of in/out edges per vertex, spanning tree), we need to extract the array of actual edges, as well as be able to compute the “reverse” graph. The “reverse” graph is defined as follows:

Given ,

That is for every edge of the original graph, the edges of the new one are created by reversing the original edges’ direction. In order to reverse the edges direction we must first obtain the edges themselves. If an edge is represented as a tuple , we can flip it, group by the first element, sort and thus obtain the two structures needed for the new, “reverse”, incidence matrix.

This can get time-consuming, that’s why we use F# lazy values to only invoke the computation once, when we actually need it:

let reverse = lazy ( let allExistingRows = [0..rowIndex.Length - 1] let subSeq = if hasCuda.Force() && rowIndex.Length >= gpuThresh then //use CUDA to reverse let start, end' = let dStart, dEnd = getEdgesGpu rowIndex colIndex sortStartEnd dStart dEnd Seq.zip end' start else asOrdinalsEnumerable () |> Seq.map (fun (i, verts) -> verts |> Seq.map (fun v -> (v, i))) |> Seq.collect id let grSeq = subSeq |> Seq.groupBy fst |> Seq.map (fun (key, sq) -> key, sq |> Seq.map snd |> Seq.toArray) let allRows : seq<int * int []> = allExistingRows.Except (grSeq |> Seq.map fst) |> Seq.map (fun e -> e, [||]) |> fun col -> col.Union grSeq |> Seq.sortBy fst let revRowIndex = allRows |> Seq.scan (fun st (key, v) -> st + v.Length) 0 |> Seq.take rowIndex.Length let revColIndex = allRows |> Seq.collect snd DirectedGraph(revRowIndex, revColIndex, verticesNameToOrdinal) ) member this.Reverse = reverse.Force()

On line 35, `.Force()`

will only call the computation once and cache the result. Each subsequent call to `.Force()`

will retrieve the cached value.

It’s worth mentioning what code on line 24 is doing. By now we have the array of all “terminal” vertices, which will become the new “outgoing” ones. However if the original graph had vertices with nothing going into them, they will have nothing going out of them in the current graph, and thus the new “reversed” `grSeq`

will be incomplete. We need to add another vertex with 0 outgoing edges:

let s = [|"a -> b, c, d, e"|]; let gr = StrGraph.FromStrings s gr.Visualize() gr.Reverse.Visualize()

#### Reversing on the GPU

The code above makes use of the GPU when it detects that the GPU is present and the graph is sufficiently large to warrant the GPU involvement. Right now, I am setting the threshold to .

I am only making this decision for generating the edges array, which is created on the GPU as two arrays: `start`

and `end'`

that hold the edge nodes. Further, this tuple of arrays in converted into the array of tuples – a data structure more suited for representing an edge.

It is possible to delegate more to the GPU if we know for sure we are not going to get into the situation handled on line 24 above. And we won’t, if we are dealing with Euler graphs. For now, let’s compare performance of just finding the edges part. The step complexity for the GPU implementation is O(1), this is a pleasantly parallel task, so things are easy.

[<Kernel;ReflectedDefinition>] let toEdgesKernel (rowIndex : deviceptr<int>) len (colIndex : deviceptr<int>) (start : deviceptr<int>) (end' : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len - 1 then for vertex = rowIndex.[idx] to rowIndex.[idx + 1] - 1 do start.[vertex] <- idx end'.[vertex] <- colIndex.[vertex]

Here is the test:

let mutable N = 10 * 1024 * 1024 let k = 5 sw.Restart() let gr = StrGraph.GenerateEulerGraph(N, k) sw.Stop() printfn "Graph: %s vertices, %s edges generated in %A" (String.Format("{0:N0}", gr.NumVertices)) (String.Format("{0:N0}", gr.NumEdges)) sw.Elapsed sw.Restart() let starts, ends = getEdges gr.RowIndex gr.ColIndex sw.Stop() printfn "GPU edges: %A" sw.Elapsed sw.Restart() gr.OrdinalEdges sw.Stop() printfn "CPU edges: %A" sw.Elapsed

And the output:

Graph: 10,485,760 vertices, 31,458,372 edges generated in 00:00:18.9789697 GPU edges: 00:00:01.5234606 CPU edges: 00:00:16.5161326

Finally. I’m happy to take the win!