## HashSet, Graph, Cognac

The post on applying GPU to finding Eulerian path mentioned a stumbling block: partitioning a very specific kind of graph.

In general, partitioning is a hard problem, NP-hard actually. The graphs we are dealing with are very specific, though, and may be partitioned in time (E – the set of graph edges). Our graphs are of the following form: If we are “lucky” it consists of just one or very few (directed) loops, if not – we have lots of such disconnected loops. And it is this unlucky case that kept the bottle of good cognac sealed all these months (only a descent solution to the partitioning problem would break it out).

To partition such a graph, i.e., color every loop in its distinct color, all we need to do is walk the graph from any vertex and once the loop is detected, we pick a vertex at random from the set of those we haven’t visited yet and start walking again. We repeat until everything is visited:

let partitionLinear (end' : int [])= let allVertices = HashSet<int>(end') let colors = Array.create end'.Length -1 let mutable color = 0 while allVertices.Count > 0 do let mutable v = allVertices.First() while colors.[v] < 0 do allVertices.Remove v |> ignore colors.[v] <- color v <- end'.[v] color <- color + 1 colors, color

This is great, except it doesn’t work. The problem is revealed when a graph is very large and very fragmented. This is when the code at line 7 fails us. The problem is, we expect it to work in O(1): how hard can it be to retrieve the “first” element?! Well, since we are dealing with a data structure that does not have an intrinsic notion of order, it may be quite hard. In fact, the complexity here is (actually , but for our graphs ), thus the complexity of the code above is .

The following code actually works:

let partitionLinear (end' : int [])= let colors = Array.create end'.Length -1 let mutable color = 0 let mutable num = end'.Length let mutable curIdx = 0 let mutable nextIdx = num - 1 while num > 0 && curIdx >= 0 do let mutable v = curIdx while colors.[v] < 0 do colors.[v] <- color v <- end'.[v] color <- color + 1 num <- num - 1 while nextIdx >= 0 && colors.[nextIdx] >= 0 do nextIdx <- nextIdx - 1 curIdx <- nextIdx colors, color

In the worst case it is still , however, this worst case is unlikely and we expect the performance close to in general.

This won’t cure the performance problems of the GPU algorithm, which still relies on the number of graph partitions to be somewhat reasonable, but it will enable it to run in some respectable time. Perhaps time to break out the cognac after all!

## Walking the Euler Path: PIN Cracking and DNA Sequencing

Continuing on to some cool applications of Eulerian paths.

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

### Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) = let mutable curVertex = defaultArg start 0 let stack = Stack<int>() let connections = Dictionary<int, int []>() let start = curVertex let mutable cycle = [] connections.Add(curVertex, this.GetConnectedVertices curVertex) let mutable first = true while stack.Count > 0 || first do first <- false let connected = connections.[curVertex] if connected.Length = 0 then cycle <- curVertex :: cycle curVertex <- stack.Pop() else stack.Push curVertex connections.[curVertex] <- connected.[1..] curVertex <- connected.[0] if not (connections.ContainsKey curVertex) then connections.Add(curVertex, this.GetConnectedVertices curVertex) let path = start::cycle if path.Length <> this.NumEdges + 1 then [] else start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal = let start = rowIndex.[ordinal] let end' = rowIndex.[ordinal + 1] - 1 colIndex.[start..end']

### De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

- Contains all sub-sequences of length n, and
- Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: `01011100`

. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2] let suffix (s:string) = s.[1..] let prefSuf s = prefix s, suffix s // shorthand let numToBinary len n = let rec numToBinaryRec n len acc = if len = 0 then acc else numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc)) numToBinaryRec n len "" let binaryDebruijnSeq n = if n <= 0 then failwith "n should be positive" let finish = pown 2 n let gr = [0..finish-1] |> List.map (numToBinary n >> prefSuf) |> List.groupBy fst |> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e ))) |> DirectedGraph<string>.FromStrings let debruinSeq = gr.FindEulerPath() let debruinNum = debruinSeq |> List.windowed 2 |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"") gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function `binaryDeruijnSeq`

computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a `DirectedGraph`

class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

### PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

### DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

## Visualizing Graphs

#### Previously

### Generating and Visualizing Graphs

I can hardly overemphasize the importance of visusalizations. Many a bug had been immediately spotted just by looking at a visual of a complex data structure. I therefore decided to add visuals to the project as soon as the `DirectedGraph`

class was born.

#### Code & Prerequisits

Code is on GitHub.

- GraphViz: install and add the bin directory to the PATH
- EmguCV v3.1: install and add the bin directory to the PATH

#### DrawGraph

This is a small auxiliary component I wrote to make all future visualizations possible. And here is a sidebar. I didn’t want to write this component. I am not a fan of re-writing something that was written a hundred times before me, so the first thing I did was look for something similar I could use. Sure enough, I found a few things. How can I put it? Software engineering is great, but boy, do we tend to overengineer things! I know, I’m guilty of the same thing myself. All I wanted from the library was an ability to receive a text file written in GraphViz DSL, and get on the output a .png containing the picture of the graph. Just a very simple GraphViz driver, nothing more.

One library had me instantiate 3 (three!) classes, another developed a whole API of its own to build the GraphViz file… I ended up writing my own component, it has precisely 47 lines of code. the last 4 lines are aliasing a single function that does exactly what I wanted. It creates the png file and then immediately invokes the EmguCV image viewer to show it. After we’re done, it cleans up after itself, deleting the temporary png file. Here it is.

#### Taking it for a Ride

Just to see this work…

Another digression. *Love* the new feature that generates all the “#r” instructions for F# scripts and sticks them into one file! Yes, this one! Right-click on “References” in an F# project:

And the generated scripts auto-update as you recompile with new references! A+ for the feature, thank you so much.

Comes with a small gotcha, though: sometimes it doesn’t get the order of references quite right and then errors complaining of references not being loaded appear in the interactive. I spent quite a few painful hours wondering how is it that this reference was not loaded, when here it is! Then I realized: it was being loaded AFTER it was required by references coming after it).

#load "load-project-release.fsx" open DrawGraph createGraph "digraph{a->b; b->c; 2->1; d->b; b->b; a->d}" "dot.exe" None

Cool. Now I can take this and use my own function to generate a graph from a string adjacency list, visualize it, and even view some of its properties. Sort of make the graph “palpable”:

let sparse = ["a -> b, c, d"; "b -> a, c"; "d -> e, f"; "e -> f"; "1 -> 2, 3"; "3 -> 4, 5"; "x -> y, z"; "2 -> 5"] let grs = StrGraph.FromStrings sparse grs.Visualize(clusters = true)

`StrGraph.FromStrings`

does exactly what it says: it generates a graph from a sequence of strings, formatted like the `sparse`

list above.

My `Visualize`

function is a kitchen sink for all kinds of visuals, driven by its parameters. In the above example, it invokes graph partitioning to clearly mark connected components.

It is important to note, that this functionality was added to the visualizer not because I wanted to see connected components more clearly, but as a quick way to ensure that my partitioning implementation was indeed working correctly.

#### Generating Data and Looking at It

Now we have a class that builds graphs and even lets us look at them, so where do we get these graphs? The easiest thing (seemed at the time) was to create them.

Enter FsCheck. It’s not the easiest library to use, there is a learning curve and getting used to things takes time, but it’s very helpful. Their documentation is quite good too. The idea is to write a generator for your type and then use that generator to create as many samples as you like:

#load "load-project-release.fsx" open Graphs open FsCheck open System open DataGen let grGen = graphGen 3 50 let gr = grGen.Sample(15, 5).[2] gr.Visualize(into=3, out= 3)

This produces something like:

My function `graphGen len num`

generates a graph of text vertices where `len`

is the length of a vertex name and `num`

is the number of vertices. It returns an FsCheck generator that can then be sampled to get actual graphs. This was a one-off kind of experiment, so it’s in a completely separate module:

//DataGen.fs module DataGen open FsCheck open System open Graphs let nucl = Gen.choose(int 'A', int 'Z') |> Gen.map char let genVertex len = Gen.arrayOfLength len nucl |> Gen.map (fun c -> String(c)) let vertices len number = Gen.arrayOfLength number (genVertex len) |> Gen.map Array.distinct let graphGen len number = let verts = vertices len number let rnd = Random(int DateTime.UtcNow.Ticks) let pickFrom = verts |> Gen.map (fun lst -> lst.[rnd.Next(lst.Length)]) let pickTo = Gen.sized (fun n -> Gen.listOfLength (if n = 0 then 1 else n) pickFrom) Gen.sized <| (fun n -> Gen.map2 (fun from to' -> from, (to' |> Seq.reduce (fun acc v -> acc + ", " + v))) pickFrom pickTo |> Gen.arrayOfLength (if n = 0 then 1 else n) |> Gen.map (Array.distinctBy fst) |> Gen.map (fun arr -> arr |> Array.map (fun (a, b) -> a + " -> " + b)) ) |> Gen.map StrGraph.FromStrings

This whole module cascades different FsCheck generators to create a random graph.

The simplest of them `nucl`

, generates a random character. (Its name comes from the fact that originally I wanted to limit the alphabet to just four nucleotide characters A, C, G, T). Then this generator is used by `genVertex`

to generate a random string vertex, and finally `vertices`

creates an array of distinct random vertices.

`graphGen`

creates a sequence of strings that `FromStrings`

(above) understands. It first creates a string of “inbound” vertices and then adds an outbound vertex to each such string.

Sampling is a little tricky, for instance, the first parameter to the `Sample`

function, which, per documentation, controls sample size, in this case is responsible for complexity and connectivity of the resulting graphs.

#### On to Euler…

The script above also specifies a couple of optional parameters to the visualizer: `into`

will mark any vertex that has `into`

or more inbound connections in green. And `out`

will do the same for outbound connections and yellow. If the same vertex possesses both properties, it turns blue.

Inspired by all this success, I now want to write a function that would generate Eulerian graphs. The famous theorem states that being Eulerian (having an Euler cycle) for a directed graph is equivalent to being strongly connected and having in-degree of each vertex equal to its out-degree. Thus, the above properties of the visualizer are quite helpful in confirming that the brand new generator I have written for Eulerain graphs (`GenerateEulerGraph`

) is at the very least on track:

let gre = StrGraph.GenerateEulerGraph(10, 5) gre.Visualize(into=3, out=3)

Very encouraging! Whatever has at least 3 edges out, has at least 3 edges in. Not a definitive test, but the necessary condition of having only blue and transparent vertices in the case of an Eulerian graph is satisfied.

In the next post – more about Eulerian graphs, de Brujin sequences, building (and visualizing!) de Bruijn graphs, used for DNA sequence assembly.

## Walking the Euler Path: Intro

### Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

### Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

### Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

#### Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

#### Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with `INumeric`

included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs /// <summary> /// Sparse matrix implementation with CSR and CSC storage /// </summary> [<StructuredFormatDisplay("{PrintMatrix}")>] type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) = .... static member CreateMatrix (row : 'a []) (isCSR : bool) = let ops = GlobalAssociations.GetNumericAssociation<'a>() let colIdx, vals = Array.zip [|0..row.Length - 1|] row |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0) |> Array.unzip SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the `GlobalAssociations`

to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to `CreateMatrix`

and then rows can be appended to it by calling `AddValues row`

. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

#### Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix , then the answer is given by the equation: , here is the number of non-zero elements in the matrix.

For a graph the set V takes a place of rows, and E – that of columns. The above inequality becomes: , so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:

[<StructuredFormatDisplay("{AsEnumerable}")>] type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this = let rowIndex = rowIndex.ToArray() let colIndex = colIndex.ToArray() let nEdges = colIndex.Length let verticesNameToOrdinal = verticesNameToOrdinal let nVertices = verticesNameToOrdinal.Count // vertices connected to the ordinal vertex let getVertexConnections ordinal = let start = rowIndex.[ordinal] let end' = rowIndex.[ordinal + 1] - 1 colIndex.[start..end']

This is not very useful, however, since it assumes that we already have `rowIndex`

for the CSR type “R” and `colIndex`

for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file `Generators.fs`

.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.

In the next post I’ll talk about visualizing things, and vsiualization methods *really* have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)

instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:

//Generators.fs type Graphs.DirectedGraph<'a when 'a:comparison> with /// <summary> /// Create the graph from a file /// </summary> /// <param name="fileName"></param> static member FromFile (fileName : string) = if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file" let lines = File.ReadLines(fileName) DirectedGraph<string>.FromStrings(lines)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.