Home > F#, CUDA, bioinformatics, Graphs > Walking the Euler Path: Intro

Walking the Euler Path: Intro

September 17, 2016 Leave a comment Go to comments

Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row i will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with INumeric included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs

/// <summary>
/// Sparse matrix implementation with CSR and CSC storage
/// </summary>
[<StructuredFormatDisplay("{PrintMatrix}")>]
type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) =

....

static member CreateMatrix (row : 'a []) (isCSR : bool) =
    let ops = GlobalAssociations.GetNumericAssociation<'a>()
    let colIdx, vals = 
        Array.zip [|0..row.Length - 1|] row 
        |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0)
        |> Array.unzip

    SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the GlobalAssociations to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to CreateMatrix and then rows can be appended to it by calling AddValues row. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix |M| = m \cdot n, then the answer is given by the equation: m \cdot n > 2 \cdot e + m + 1, here e is the number of non-zero elements in the matrix.

For a graph G=(V, E) the set V takes a place of rows, and E – that of columns. The above inequality becomes: v \cdot e > e + v + 1 \ (v = |V|,\ e = |E|), so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this =
    let rowIndex  = rowIndex.ToArray()
    let colIndex = colIndex.ToArray()
    let nEdges = colIndex.Length
    let verticesNameToOrdinal = verticesNameToOrdinal
    let nVertices = verticesNameToOrdinal.Count

    // vertices connected to the ordinal vertex
    let getVertexConnections ordinal =
        let start = rowIndex.[ordinal]
        let end' = rowIndex.[ordinal + 1] - 1
        colIndex.[start..end']

This is not very useful, however, since it assumes that we already have rowIndex for the CSR type “R” and colIndex for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file Generators.fs.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.
In the next post I’ll talk about visualizing things, and vsiualization methods really have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)

instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:


//Generators.fs

type Graphs.DirectedGraph<'a when 'a:comparison> with
    /// <summary>
    /// Create the graph from a file
    /// </summary>
    /// <param name="fileName"></param>
    static member FromFile (fileName : string) =

        if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file"

        let lines = File.ReadLines(fileName)
        DirectedGraph<string>.FromStrings(lines)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

  1. No comments yet.
  1. September 18, 2016 at 5:32 am
  2. September 21, 2016 at 3:08 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: