Archive

Posts Tagged ‘dna sequencing’

Walking the Euler Path: PIN Cracking and DNA Sequencing

November 8, 2016 Leave a comment

Continuing on to some cool applications of Eulerian paths.

  1. Intro
  2. Visualization
  3. GPU/CUDA Injection

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) =
    let mutable curVertex = defaultArg start 0

    let stack = Stack<int>()
    let connections = Dictionary<int, int []>()
    let start = curVertex
    let mutable cycle = []
    connections.Add(curVertex, this.GetConnectedVertices curVertex)
    let mutable first = true

    while stack.Count > 0 || first do
        first <- false
        let connected = connections.[curVertex]
        if connected.Length = 0 then
            cycle <- curVertex :: cycle
            curVertex <- stack.Pop()
        else
            stack.Push curVertex
            connections.[curVertex] <- connected.[1..]
            curVertex <- connected.[0]
            if not (connections.ContainsKey curVertex) then
                connections.Add(curVertex, this.GetConnectedVertices curVertex)

    let path = start::cycle
    if path.Length <> this.NumEdges + 1 then []
    else
        start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal =
    let start = rowIndex.[ordinal]
    let end' = rowIndex.[ordinal + 1] - 1
    colIndex.[start..end']

De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

  1. Contains all sub-sequences of length n, and
  2. Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

debruijn

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: 01011100. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2]
let suffix (s:string) = s.[1..]
let prefSuf s = prefix s, suffix s // shorthand

let numToBinary len n =
    let rec numToBinaryRec n len acc =
        if len = 0 then acc
        else
           numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc))
    numToBinaryRec n len ""

let binaryDebruijnSeq n =
    if n <= 0 then failwith "n should be positive"
    let finish = pown 2 n
    let gr =
        [0..finish-1]
        |> List.map (numToBinary n >> prefSuf)
        |> List.groupBy fst
        |> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e )))
        |> DirectedGraph<string>.FromStrings

    let debruinSeq = gr.FindEulerPath()
    let debruinNum = 
        debruinSeq 
        |> List.windowed 2 
        |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"")

    gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function binaryDeruijnSeq computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: p\ ->\ s_1,\ s_2...s_n , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a DirectedGraph class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.