Archive

Posts Tagged ‘visualization’

Visualizing Graphs

September 18, 2016 1 comment

Previously

Walking the Eule Path: Intro

Generating and Visualizing Graphs

I can hardly overemphasize the importance of visusalizations. Many a bug had been immediately spotted just by looking at a visual of a complex data structure. I therefore decided to add visuals to the project as soon as the DirectedGraph class was born.

Code & Prerequisits

Code is on GitHub.

  1. GraphViz: install and add the bin directory to the PATH
  2. EmguCV v3.1: install and add the bin directory to the PATH

DrawGraph

This is a small auxiliary component I wrote to make all future visualizations possible. And here is a sidebar. I didn’t want to write this component. I am not a fan of re-writing something that was written a hundred times before me, so the first thing I did was look for something similar I could use. Sure enough, I found a few things. How can I put it? Software engineering is great, but boy, do we tend to overengineer things! I know, I’m guilty of the same thing myself. All I wanted from the library was an ability to receive a text file written in GraphViz DSL, and get on the output a .png containing the picture of the graph. Just a very simple GraphViz driver, nothing more.

One library had me instantiate 3 (three!) classes, another developed a whole API of its own to build the GraphViz file… I ended up writing my own component, it has precisely 47 lines of code. the last 4 lines are aliasing a single function that does exactly what I wanted. It creates the png file and then immediately invokes the EmguCV image viewer to show it. After we’re done, it cleans up after itself, deleting the temporary png file. Here it is.

Taking it for a Ride

Just to see this work…
Another digression. Love the new feature that generates all the “#r” instructions for F# scripts and sticks them into one file! Yes, this one! Right-click on “References” in an F# project:

generaterefs.

And the generated scripts auto-update as you recompile with new references! A+ for the feature, thank you so much.

Comes with a small gotcha, though: sometimes it doesn’t get the order of references quite right and then errors complaining of references not being loaded appear in the interactive. I spent quite a few painful hours wondering how is it that this reference was not loaded, when here it is! Then I realized: it was being loaded AFTER it was required by references coming after it).

#load "load-project-release.fsx"
open DrawGraph

createGraph "digraph{a->b; b->c; 2->1; d->b; b->b; a->d}" "dot.exe" None

initial

Cool. Now I can take this and use my own function to generate a graph from a string adjacency list, visualize it, and even view some of its properties. Sort of make the graph “palpable”:

let sparse = ["a -> b, c, d"; "b -> a, c"; "d -> e, f"; "e -> f"; "1 -> 2, 3"; "3 -> 4, 5"; "x -> y, z"; "2 -> 5"]
let grs = StrGraph.FromStrings sparse

grs.Visualize(clusters = true)

clusters

StrGraph.FromStrings does exactly what it says: it generates a graph from a sequence of strings, formatted like the sparse list above.
My Visualize function is a kitchen sink for all kinds of visuals, driven by its parameters. In the above example, it invokes graph partitioning to clearly mark connected components.

It is important to note, that this functionality was added to the visualizer not because I wanted to see connected components more clearly, but as a quick way to ensure that my partitioning implementation was indeed working correctly.

Generating Data and Looking at It

Now we have a class that builds graphs and even lets us look at them, so where do we get these graphs? The easiest thing (seemed at the time) was to create them.

Enter FsCheck. It’s not the easiest library to use, there is a learning curve and getting used to things takes time, but it’s very helpful. Their documentation is quite good too. The idea is to write a generator for your type and then use that generator to create as many samples as you like:

#load "load-project-release.fsx"

open Graphs
open FsCheck
open System
open DataGen

let grGen = graphGen 3 50
let gr = grGen.Sample(15, 5).[2]

gr.Visualize(into=3, out= 3)

This produces something like:

gengraph

My function graphGen len num generates a graph of text vertices where len is the length of a vertex name and num is the number of vertices. It returns an FsCheck generator that can then be sampled to get actual graphs. This was a one-off kind of experiment, so it’s in a completely separate module:


//DataGen.fs

module DataGen
open FsCheck
open System
open Graphs

let nucl = Gen.choose(int 'A', int 'Z') |> Gen.map char

let genVertex len =  Gen.arrayOfLength len nucl |> Gen.map (fun c -> String(c))
let vertices len number = Gen.arrayOfLength number (genVertex len) |> Gen.map Array.distinct

let graphGen len number =
    let verts = vertices len number
    let rnd = Random(int DateTime.UtcNow.Ticks)
    let pickFrom = verts |> Gen.map (fun lst -> lst.[rnd.Next(lst.Length)])
    let pickTo = Gen.sized (fun n -> Gen.listOfLength (if n = 0 then 1 else n) pickFrom)

    Gen.sized
    <| 
    (fun n ->
        Gen.map2 
            (fun from to' -> 
                from, (to' |> Seq.reduce (fun acc v -> acc + ", " + v))) pickFrom pickTo
        |>
        Gen.arrayOfLength (if n = 0 then 1 else n)
        |> Gen.map (Array.distinctBy fst)
        |> Gen.map (fun arr ->  arr |> Array.map (fun (a, b) -> a + " -> " + b))
    )
    |> Gen.map StrGraph.FromStrings

This whole module cascades different FsCheck generators to create a random graph.
The simplest of them nucl, generates a random character. (Its name comes from the fact that originally I wanted to limit the alphabet to just four nucleotide characters A, C, G, T). Then this generator is used by genVertex to generate a random string vertex, and finally vertices creates an array of distinct random vertices.

graphGen creates a sequence of strings that FromStrings (above) understands. It first creates a string of “inbound” vertices and then adds an outbound vertex to each such string.

Sampling is a little tricky, for instance, the first parameter to the Sample function, which, per documentation, controls sample size, in this case is responsible for complexity and connectivity of the resulting graphs.

On to Euler…

The script above also specifies a couple of optional parameters to the visualizer: into will mark any vertex that has into or more inbound connections in green. And out will do the same for outbound connections and yellow. If the same vertex possesses both properties, it turns blue.

Inspired by all this success, I now want to write a function that would generate Eulerian graphs. The famous theorem states that being Eulerian (having an Euler cycle) for a directed graph is equivalent to being strongly connected and having in-degree of each vertex equal to its out-degree. Thus, the above properties of the visualizer are quite helpful in confirming that the brand new generator I have written for Eulerain graphs (GenerateEulerGraph) is at the very least on track:


let gre = StrGraph.GenerateEulerGraph(10, 5)
gre.Visualize(into=3, out=3)

eulerinout

Very encouraging! Whatever has at least 3 edges out, has at least 3 edges in. Not a definitive test, but the necessary condition of having only blue and transparent vertices in the case of an Eulerian graph is satisfied.

In the next post – more about Eulerian graphs, de Brujin sequences, building (and visualizing!) de Bruijn graphs, used for DNA sequence assembly.

Walking the Euler Path: Intro

September 17, 2016 2 comments

Source Code

I’m thinking about a few posts in these series going very fast through the project. The source is on my GitHub, check out the tags since the master branch is still work in progress.

Experimenting with Graph Algorithms with F# and GPU

Graphs play their role in bioinformatics which is my favorite area of computer science and software engineering lately. This relationship was the biggest motivator behind this project.

I have been experimenting with a few graph algorithms trying to parallelize them. This is interesting because these algorithms usually resist parallelization since they are fast in their serial version running in O(|E|) or O(|E| + |V|) time (E – the set of edges, V – the set of vertices of the graph). And of course I use any excuse to further explore the F# language.

Representation

The object of this mini-study is a directed unweighted graph. The choice to represent it is simple: adjacency list or incidence matrix. Since I had CUDA in mind from the start, the latter was chosen, and since I had large graphs in mind, hundreds of millions, possibly billions of edges (limited only by the .NET object size: is it still a problem? I haven’t checked, and by the size of my GPU memory), sparse matrix data structure was picked.

Sparse Matrix Implementation

I first wrote a very bare-bones sparse matrix class, just to get my feet wet. Of all possible representations for a sparse matrix, I chose CSR (or CSC which is the transposition of CSR), the idea is intuitive and works great for a directed graph incidence matrix.

Briefly (taking CSR – Compressed Sparse Row as an example), we represent our matrix in 3 arrays: V, C, R. V – the array of non-zero values, written left-to-right, top-to-bottom. C – the array of column indices of the values in V. And C – the “boundary”, or “row index” array, built as follows: We start by recording the number of non-zero values per row in each element of R, starting with R[1]. R[0] = 0. Then we apply the scan operation (like the F# Seq.scan) to the row array, to produce the final result. The resulting array contains m + 1 (m – number of rows in the matrix) elements, its last entry equals the total number of non-zero values in the matrix). This array is used as a “slicer” or “indexer” into the column/value arrays: non-zero columns of row i will be located in arrays V and C at the indices starting from R[i] and ending at R[i + 1] – 1. This is all pretty intuitive.

Overcoming F# Strong Typing

F# is a combination of strong typing and dynamic generic resolution, which makes it a challenge when you need to write a template for which it is natural to be resolved at compile time. Then sweet memories of C++ or Python invade… There exists a way to overcome all that and it is not pretty. To implement it I needed the old F# PowerPack with INumeric included. Then I just coded the pattern explained in the blog post:

// SparseMatrix.fs

/// <summary>
/// Sparse matrix implementation with CSR and CSC storage
/// </summary>
[<StructuredFormatDisplay("{PrintMatrix}")>]
type SparseMatrix<'a> (ops : INumeric<'a>, row : 'a seq, rowIndex : int seq, colIndex : int seq, rowSize, isCSR : bool) =

....

static member CreateMatrix (row : 'a []) (isCSR : bool) =
    let ops = GlobalAssociations.GetNumericAssociation<'a>()
    let colIdx, vals = 
        Array.zip [|0..row.Length - 1|] row 
        |> Array.filter (fun (i, v) -> ops.Compare(v, ops.Zero) <> 0)
        |> Array.unzip

    SparseMatrix(ops, vals, [0; vals.Length], colIdx, row.Length, isCSR)

The idea is to use the GlobalAssociations to smooth-talk the compiler into letting you do what you want. The pattern is to not directly use the constructor to create your object, but a static method instead, by means of which this “compiler-whispering” is hidden from the user.

My sparse matrix is built dynamically: it is first created with a single row through a call to CreateMatrix and then rows can be appended to it by calling AddValues row. The idea is to allow creation and storage of huge matrices dynamically. These matrices may be stored in large files for which representation in dense format in memory may not be feasible.

Representing the graph

So, at which point does it make sense to use a sparse matrix instead of a dense one in CSR/CSC? It’s easy to figure out:

If we have a matrix |M| = m \cdot n, then the answer is given by the equation: m \cdot n > 2 \cdot e + m + 1, here e is the number of non-zero elements in the matrix.

For a graph G=(V, E) the set V takes a place of rows, and E – that of columns. The above inequality becomes: v \cdot e > e + v + 1 \ (v = |V|,\ e = |E|), so our sparse structure becomes very economical for large, not to mention “really huge” graphs. (We don’t have the values array anymore, since all our values are just 0s and 1s).

And so the graph is born:


[<StructuredFormatDisplay("{AsEnumerable}")>]
type DirectedGraph<'a when 'a:comparison> (rowIndex : int seq, colIndex : int seq, verticesNameToOrdinal : IDictionary<'a, int>) as this =
    let rowIndex  = rowIndex.ToArray()
    let colIndex = colIndex.ToArray()
    let nEdges = colIndex.Length
    let verticesNameToOrdinal = verticesNameToOrdinal
    let nVertices = verticesNameToOrdinal.Count

    // vertices connected to the ordinal vertex
    let getVertexConnections ordinal =
        let start = rowIndex.[ordinal]
        let end' = rowIndex.[ordinal + 1] - 1
        colIndex.[start..end']

This is not very useful, however, since it assumes that we already have rowIndex for the CSR type “R” and colIndex for the “C” arrays. It's like saying: "You want a graph? So, create a graph!". I would like to have a whole bunch of graph generators, and I do. I placed them all into the file Generators.fs.

This is a good case for using type augmentations. When we need to implement something that “looks good” on the object, but doesn’t really belong to it.
In the next post I’ll talk about visualizing things, and vsiualization methods really have nothing to do with the graph itself. Nevertheless, it is natural to write:

myGraph.Visualize(euler=true)

instead of:

Visualize(myGraph, euler=true)

So we use type augmentations, for instance, going back to the generators:


//Generators.fs

type Graphs.DirectedGraph<'a when 'a:comparison> with
    /// <summary>
    /// Create the graph from a file
    /// </summary>
    /// <param name="fileName"></param>
    static member FromFile (fileName : string) =

        if String.IsNullOrWhiteSpace fileName || not (File.Exists fileName) then failwith "Invalid file"

        let lines = File.ReadLines(fileName)
        DirectedGraph<string>.FromStrings(lines)

which creates a graph by reading a text file and calling another generator method at the end. This method actually calls the constructor to create an instance of the object. Keeps everything clean and separate.

This post was intended to briefly construct the skeleton. In the next we’ll put some meat on the bones and talk about visualizing stuff.

Visualizing Crime with d3: Hooking up Data and Colors, Part 2

June 17, 2013 1 comment

In the previous post, we derived a class from BubbleChart and this got us started on actually visualizing some meaningful data using bubbles.

There are a couple of things to iron out before a visual can appear.

Color Schemes

I am using Cynthia Brewer color schemes, available for download in colorbrewer.css. This file is available on my GitHub as well.

It consists of entries like:

.Spectral .q0-3{fill:rgb(252,141,89)}
.Spectral .q1-3{fill:rgb(255,255,191)}
.Spectral .q2-3{fill:rgb(153,213,148)}
.Spectral .q0-4{fill:rgb(215,25,28)}
.Spectral .q1-4{fill:rgb(253,174,97)}
.Spectral .q2-4{fill:rgb(171,221,164)}
.Spectral .q3-4{fill:rgb(43,131,186)}
.Spectral .q0-5{fill:rgb(215,25,28)}
.Spectral .q1-5{fill:rgb(253,174,97)}

Usage is simple: you pick a color scheme and add it to the class of your parent element that will contain the actual SVG elements displayed, e.g.: Spectral. Then, one of the “qin classes are assigned to these child elements to get the actual color.

So for instance:

The main SVG element on the Crime Explorer visual looks like this:

<svg class="Spectral" id="svg_vis">...</svg>

Then, each of the “circle” elements inside this SVG container will have one of the qi-9 (I am using 9 total colors to display this visualization so i ranges from 0..8).

<circle r="9.664713682964603" class="q2-9" stroke-width="2" stroke="#b17943" id="city_0" cx="462.4456905180483" cy="574.327856528298"></circle>

(Note the class=”q2-9″ attribute above).

All of this is supported by the BubbleChart class with some prodding.
You need to:

  1. Pass the color scheme to the constructor of the class derived from BubbleChart upon instantiation:
    allStates = new AllStates('vis', crime_data, 'Spectral')
    
  2. Implement a function called color_class in the derived class, that will produce a string of type “qi-n”, given an i. The default function supplied with the base class always returns “q1-6”.
    @color_class =
          d3.scale.threshold().domain(@domain).range(("q#{i}-9" for i in [8..0]))
    

    In my implementation, I am using the d3 threshold scale to map a domain of values to the colors I need based on certain thresholds. The range is reversed only because I want “blue” colors to come out on lower threshold values, and “red” – on higher ones (less crime is “better”, so I use red for higher values). See AllStates.coffe for a full listing.How this is hooked up tho the actual data is discussed in the next section.

Data Protocol

This is key: data you pass to the BubbleChart class must comply with the following requirements:

  1. It must be an array (not an associative array, a regular array). Each element of this array will be displayed as a circle (“bubble”) on the screen.
  2. Each element must contain the following fields:
    • id – this is a UNIQUE id of the element. It is used by BubbleChart to do joins (see d3 documentation for what these are)
    • value – this is what the “value” of each data element is, and it is used to compute the radius of each bubble
    • group – indicates the “color group” to which the bubble belongs. This is what is fed to the color_class function to determine the color of each individual bubble

With all these conditions satisfied, the array of data is now ready to be displayed.

Displaying It

Now that it is all done, showing the visual is simple:

allStates = new AllStates('vis', crime_data, 'Spectral')
allStates.create_vis()
allStates.display()

Next time: displaying the auxiliary elements: color and size legends, the search box.

Visualizing Crime with d3: Intro

April 18, 2013 1 comment

Figure a blog without pictures or conversations is just boring, so, here it is.

Robbery in Cali

Robbery in Cali

Lately, I have been dealing a lot with data visualization. This was a brand new area for me and while we do use F# for data extraction, all of the front end is done using d3, an amazing toolkit by Mike Bostock.

First of all, I owe the fact that my projects got off the ground to Mike and Jim Vallandigham.  Jim taught me all I know about how to draw bubble “charts” and use d3 force layouts. His blog is invaluable for anyone making first steps in the area of data visualization. Code snippets I am going to post here are due to Jim’s and Mike’s generosity. So, thank you Mike and Jim.

One may ask, if there are already tutorials on how to put together visuals, why assault the world with more musings (as a Candace Bushnell character once wrote). The answer is: my goal in these posts is not to exploring creation of visuals, but rather sharing experiences on how to put together projects that involve visualizations.

These are very different problems, since your task is not just to create a single document or web page for a single purpose, but to create something that can dynamically build these documents or pages, and maybe, within each such document provide different views of the same data. Questions of:

  • design
  • reuse
  • coding practices

come up right away, not to mention general problems:

  • What are data visualizations?
  • What are they used for?
  • Are they needed at all?

So, for these posts, we will be building a project that visualizes crime statistics in the US for the year 2008. The data source for this is found here and the complete solution will look like this.

The approximate plan for the next few posts:

  • Thinking about visualizations and what they are
  • Preparations
    • Getting Data (retrieving, massaging, formatting)
    • Getting the tools together (CofeeScript, d3, ColorBrewer, Knockout.js, Twitter Bootstrap, jQuery, jQuery bbq)
  • Building the visuals
    • Laying out “single” charts
    • Laying out multiple charts on the same page
    • A word about maps
  • Lessons learned: architecting for reuse, etc