Archive

Posts Tagged ‘bioinformatics’

Walking the Euler Path: PIN Cracking and DNA Sequencing

November 8, 2016 Leave a comment

Continuing on to some cool applications of Eulerian paths.

  1. Intro
  2. Visualization
  3. GPU/CUDA Injection

The goal of this little graph experiment remains exploration of accelerating Eulerian path finding on the GPU. This is the final introductory post.

Eulerian Path

Hierholzer algorithm works great. It’s linear in the number of edges, so as fast as we can possibly have. The idea is simple: pick a vertex, walk the graph, removing used edges from consideration and adding visited vertices to a stack, once we circle back to a vertex without edges – pop it from the stack and pre-pend it to the path. Once the stack is empty and all edges have been traversed – we have the path/cycle.

member this.FindEulerCycle (?start) =
    let mutable curVertex = defaultArg start 0

    let stack = Stack<int>()
    let connections = Dictionary<int, int []>()
    let start = curVertex
    let mutable cycle = []
    connections.Add(curVertex, this.GetConnectedVertices curVertex)
    let mutable first = true

    while stack.Count > 0 || first do
        first <- false
        let connected = connections.[curVertex]
        if connected.Length = 0 then
            cycle <- curVertex :: cycle
            curVertex <- stack.Pop()
        else
            stack.Push curVertex
            connections.[curVertex] <- connected.[1..]
            curVertex <- connected.[0]
            if not (connections.ContainsKey curVertex) then
                connections.Add(curVertex, this.GetConnectedVertices curVertex)

    let path = start::cycle
    if path.Length <> this.NumEdges + 1 then []
    else
        start::cycle |> List.map (fun i -> verticesOrdinalToNames.[i])

Here we don’t check for pre-conditions on whether the graph has an Eulerian path/cycle, since this check is expensive enough that failure of the algorithm can serve as such a check.

Getting connected vertices (outgoing edges) is as fast as getting a sub-range. We only do it once for every vertex, then these are stored in a dictionary and mutated as we remove “used” edges, so the graph itself remains immutable. In our representation, getting outgoing edges is easy:

let getVertexConnections ordinal =
    let start = rowIndex.[ordinal]
    let end' = rowIndex.[ordinal + 1] - 1
    colIndex.[start..end']

De Bruijn Sequence

On a seemingly unrelated, but actually intimately related topic. Given an alphabet of m characters, create a cyclical sequence which:

  1. Contains all sub-sequences of length n, and
  2. Does not have any repeating sub-sequences of length n

The sequence is cyclical in a sense that we recover all its subsequences of length n by sliding a cyclical window over the sequence. So, for example, for the binary alphabet and n=3:

debruijn

We can traverse the graph in order of the marked edges and record each edge label, thus getting the sequence: 01011100. This is a cyclical sequence, we just broke it in an arbitrary way. Sliding the n=3 length window we’ll get all the 3-digit binary numbers.

We get the sequence by first constructing the De Bruijn Graph from our sequence of numbers. The graph is constructed by taking all the sequences of length n – 1 and connecting them “prefix-to-suffix”, where for each sequence of length n, prefix (suffix) is the subsequence of the first (last) n – 1 characters of this sequence. So, for instance, in the above example, vertex ’00’ is a prefix of ‘001’, while ’01’ is its suffix. So while ’00’ and ’01’ are both vertices, they are linked with the edge that is labelled by the character necessary to create the entire number of length n (001) by moving from prefix to suffix: 00 -> 01, label: 1.

The resulting graph has a Eulerian cycle as it is easy enough to see by induction. We recover the sequence by traversing the cycle, and since we traverse all the edges only once, we’ll get exactly what we are looking for.

let prefix (s:string) = s.[..s.Length - 2]
let suffix (s:string) = s.[1..]
let prefSuf s = prefix s, suffix s // shorthand

let numToBinary len n =
    let rec numToBinaryRec n len acc =
        if len = 0 then acc
        else
           numToBinaryRec (n >>> 1) (len - 1) (String.Format("{0}{1}", n &&& 0x1, acc))
    numToBinaryRec n len ""

let binaryDebruijnSeq n =
    if n <= 0 then failwith "n should be positive"
    let finish = pown 2 n
    let gr =
        [0..finish-1]
        |> List.map (numToBinary n >> prefSuf)
        |> List.groupBy fst
        |> List.map (fun (v, prefSuf) -> v + " -> " + (prefSuf |> List.map snd |> List.reduce (fun st e -> st + "," + e )))
        |> DirectedGraph<string>.FromStrings

    let debruinSeq = gr.FindEulerPath()
    let debruinNum = 
        debruinSeq 
        |> List.windowed 2 
        |> List.mapi (fun i [p; s] -> "\"" + (i + 1).ToString() + ":" + s.[s.Length - 1].ToString() + "\"")

    gr.Visualize(euler = true, eulerLabels = debruinNum)

Here the function binaryDeruijnSeq computes a prefix and a suffix of all n-digit binary numbers, then groups prefixes together and builds a collection of graph strings in my notation: p\ ->\ s_1,\ s_2...s_n , connecting a prefix to all its suffixes. After that, the collection is converted into an instance of a DirectedGraph class, the Eulerian cycle is found and visualized in such a way, that starting from the green vertex, moving along the edges that mark the Eulerian cycle, we recover the De Bruijn sequnce by recording the edge labels.

PIN Cracking

If we have a device protected by a 4-digit pin, such that punching in a few numbers in a sequence will unlock the device as long as there is a correct subsequence punched, we can use the De Bruijn approach above to generate a 10,000 long sequence that will necessarily yield the correct PIN in only 10,000 punches, as opposed to 40,000. See this article that describes it in some detail.

DNA Sequencing

My favorite application, of course, is to DNA sequencing. DNA is sequenced from a bunch of reads. The reads are not very long – maybe around 300 nucleotides, maybe less. They are not always perfect either: some nucleotide or a few may not be produced correctly by the sequencer. Still, if we can gather enough of them together, align and error-correct, we could then build a De Bruijn graph much the same way as described above thus linking the reads together in a DNA sequence. This is of course a gross oversimplification, but it is the reason why I love Eulerian cycles and the source of my interest in speeding up algorithms of finding them.

In the future posts – more forays into the GPU-land in an attempt to speed up something already pretty fast and what came out of it.

Non-linear Thinking with CUDA.

September 8, 2015 1 comment

I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word.

The Problem

Given a set A = \{a_1, a_2 \ldots a_n\}, output a set S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}

In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.)

In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide.

So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like:

57 71 113 113 128
57 + 71 71 + 113 113 + 113 113 + 128 128 + 57
57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71

Code

On my GitHub.

Sequential Solutions

The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is O(n^3):

// brute force O(n^3) solution
let cyclospec (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |&gt; Seq.toArray

    let mutable parsum = 0

    (seq {
        yield 0
        yield pepArr |&gt; Array.sum
        for i = 0 to len - 2 do
            yield! [0..len-1]
                |> Seq.map
                    (fun j -&gt;
                        //F# 4.0 Feature!
                        parsum &lt;- 0
                        for ind = j to j + i do
                            parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind]
                        parsum
                        )
    }) |> Seq.sort |> Seq.toArray

Note the F# 4.0 feature: a variable mutated within a closure.

This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an O(n^2) solution:

let cyclospecOpt (peptide : int seq) =
    let len = peptide.Count()
    let pepArr = peptide |> Seq.toArray

    let output = Array.zeroCreate ((len - 1) * len)

    for i = 0 to len - 2 do
        [0..len-1] 
            |> Seq.iter 
                (fun j ->  
                    let ind = i + j
                    output.[i * len + j] <- 
                        if i = 0 then pepArr.[j] 
                        else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind]
                    )
    seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray

Ok, so far so good.

Multi-dimensional thinking with CUDA

So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of a_5 + a_6 + a_7, thus it’s a grid X of dimensions n x (n-1)( where x_{i, j} is the sum, starting from i-th element of the input set, and ranging over j elements.

This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about).

Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it:

[<Kernel; ReflectedDefinition>]
let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) =
    let ind = blockIdx.x * blockDim.x + threadIdx.x
    let lenElem = blockIdx.y * blockDim.y + threadIdx.y

    if ind < len && lenElem < len - 1 then
        let mutable parsum = 0
        for i = 0 to lenElem do
            let idx = ind + i
            parsum <- parsum + arr.[if idx >= len then idx - len else idx]
        out.[lenElem * len + ind] <- parsum

And we invoke it:

let cyclospecGpu (peptide : int []) =
    let blockSize = dim3(16, 16)
    let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y)
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data.

But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution:

[[]
let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)=
let ind = blockIdx.x * blockDim.x + threadIdx.x

if ind < len then
let idx = ind + lenElem
out.[lenElem * len + ind] = len then idx – len else idx]

Invocation:

let cyclospecOptGpu (peptide : int []) =
    let blockSize = 256
    let gridSize = divup peptide.Length blockSize
    let lp = LaunchParam(gridSize, blockSize)

    use dPeptide = worker.Malloc(peptide)
    use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1))
    for i = 0 to peptide.Length - 2 do
        worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i
    let output = dOutput.Gather()

    seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort

So, how did we do? (Here comes the mandatory chart):
cyc

And so:

  1. A little algorithm is a beautiful thing
  2. A little CUDA is an even more beautiful thing
  3. CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1.
  4. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically?

Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization:
cyc_nonlog

Motif Finding with Gibbs Sampling (F#)

May 17, 2015 Leave a comment

The Problem

“Motif finding is a problem of finding common substrings of specified length in a set of strings. In bioinformatics, this is useful for finding transcription binding sites” (recap here). The problem is succinctly stated on Rosalind.

Given a set of strings DNA of size t, find “most common” substrings of length k. “Most common” means, that substrings should deviate from each other as little as possible, but they may not necessarily be the same. With this in mind, scoring is defined as follows:

let colScore (col : int []) =
    col.Length - col.GroupBy(fun c -> c).Select(fun gr -> gr.Count()).Max()

let score (motifs : int [,]) =
    [0..Array2D.length2 motifs - 1] 
    |> List.map (fun col -> motifs.[*,col] |> colScore) 
    |> List.sum

To illustrate with the DNA alphabet (we only have 4 letters in the alphabet = {A, C, G, T}), the following:

ACC
ACG
ACT
CCT

will be scored as: 1 + 0 + 2 = 3. We add up scores of each column, where the score of the column is defined as col_length – max_repeating_letters

So, column 1 score is 1 because A repeats 3 times – it is the maximum repeating letter, etc. The problem is then to find an arrangement of k-length strings (called k-mers) that minimize the score.
 

Profile

Given a collection (matrix) of k-mers, a profile is built in order to determine what would be “the most probable” next k-mer if we were to augment our collection. A profile is a 4xk matrix, where P(i, j) is the probability of a given letter alphabet.[i] in the column (k-mer position) j. For the matrix above:

3/4 0 0
1/4 4/4 1/4
0 0 1/4
0 0 2/4

(it’s the number of letters occurring in the column, divided by the total number of k-mers, t)

Laplace Rule

We amend our simple profile calculation with the Laplace rule of succession. This is done to compensate for 0 and 1 probabilities: keeping those will have a disastrous effect on our algorithm, where we will never be able to pick a k-mer that, like in the example above, has anything other than “C” in the second position.

We apply the rule by always adding 4 strings to the collection, each consisting of one letter of the alphabet repeated k times. So the collection above becomes:

ACC
ACG
ACT
CCT
AAA
CCC
GGG
TTT

And the profile becomes:

4/8 1/8 1/8
2/8 5/8 2/8
1/8 1/8 2/8
1/8 1/8 3/8
// borrowed from StackOverflow. Stack two 2D arrays vertically (I miss Python!)
let stackV (a1: 'a[,]) (a2: 'a[,]) =
    let a1l1,a1l2,a2l1,a2l2 = (Array2D.length1 a1),(Array2D.length2 a1),(Array2D.length1 a2),(Array2D.length2 a2)
    if a1l2 <> a2l2 then failwith "arrays have different column sizes"
    let result = Array2D.zeroCreate (a1l1 + a2l1) a1l2
    Array2D.blit a1 0 0 result 0 0 a1l1 a1l2
    Array2D.blit a2 0 0 result a1l1 0 a2l1 a2l2
    result

let createProfile (motifs : int[,]) =
    let k = Array2D.length2 motifs
    let t = Array2D.length1 motifs + 4

    let appendLaplace () =
        let As = Array.zeroCreate k
        let rest = 
           [|yield As 
             for i = 1 to 3 do 
                 yield As |> Array.map (fun a -> a + i) |]
           |> array2D
        stackV motifs rest
    
    let m = appendLaplace()
    let profile = 
        [|0..k - 1|] 
        |> Array.map 
            (fun col -> m.[0..,col]
                .GroupBy(fun c -> c)
                .OrderBy(fun gr -> gr.Key)
                .Select(fun gr -> float(gr.Count()) / float t)
                .ToArray())
            
    Array2D.init 4 k (fun i j -> profile.[j].[i])
    

Here we append the motif matrix (represented by a 2D integer array in F#, where a row consists of integers in the range [0..3], corresponding to the letters of our DNA alphabet) with 4 additional entries as described above and retrieve the resulting profile matrix.
 

Consensus k-mer

The profile matrix is used to pick a k-mer that fits the profile best. For instance, if we have built a profile based on motifs extracted from t – 1 DNA strings, and need to pick a k-mer from the t-th DNA string that fits the profile best, we first compute probabilities of all k-mers in string t based on the profile matrix:

Pr(k-mer|profile) = profile[k-mer[0], 0] * profile[k-mer[1], 1] *..* profile[k-mer[k-1], k-1]

(Here is where Laplace rule comes in handy: we would really be in trouble if there were 0 entries in the profilie matrix. Many k-mers would have Pr(k-mer|profile) = 0 and would short-circuit the algorithm).

and then pick the k-mer with the highest probability.

Code below computes probabilities of a set of k-mers given a profile.

let kmerProbs (kmers : string []) (profile : float [,]) =
    let k = Array2D.length2 profile
    let probs = 
        kmers 
        |> Array.map 
            (fun kmer -> 
                kmer, kmer.ToCharArray() 
                    |> Array.map (fun c -> alphabet.[c])
                    |> Array.mapi (fun i ind -> profile.[ind, i])
                    |> Array.fold (fun state p -> state * p) 1.)
    probs

 

Gibbs Sampling

Given a discrete distribution, we want to sample from it:

  1. Pick a sample s from the uniform distribution [0, n)
  2. Lookup its probability, ps
  3. Sample from a uniform [0, 1], pu
  4. If pu <= ps – accept the sample and return it, otherwise repeat.

One thing to note here is that our probabilities do not necessarily sum up to 1 by design. So we first normalize them, by dividing each probability value by the sum of all probabilities.

The function below returns a motif sampled from the distribution we built from a profile:

let gibbsSample (rnd : Random) (probs : (string * float) []) =
    
    let sumall = probs |> Array.sumBy (fun (s, p) -> p)
    let prs = probs |> Array.map (fun (k, v) -> k, v / sumall)
    let len = probs.Length

    let rec rollUntilAccepted value =
        let n = rnd.NextDouble()
        if n <= snd prs.[value] then value
        else
            rollUntilAccepted (rnd.Next(0, len))

    fst probs.[rollUntilAccepted (rnd.Next(0, len))]

The Algorithm

Since solving the problem exactly by brute force is not feasible in practice with long strings and huge collections, approximations are used to get solutions that are “close enough”.

The Gibbs sampling based solution starts from a collection of random k-mers, each picked from one string of DNA and tries to improve on this original pick, performing a sequence of iterations, using Gibbs sampling.

Since the solution evidently depends on the initial collection picked, this is re-run many times to pick the collection with the minimal score.

Briefly the algorithm is:
Given a collection of strings (DNA) of size t, length k,

  1. Start from a collection of motifs (k-mers) of size t randomly picked from DNA, one motif from one string of DNA
  2. Store the score of this collection
  3. Loop many times:
    1. Select a random number j in [0, t)
    2. Exclude j-th motif from the motif collection and build the profile of the remaining collection
    3. Compute probability distribution of motifs in DNA[j] using the profile matrix from the previous step
    4. Sample from the distribution and replace j-th motif with the obtained sample
    5. Memorize best score, best collection of motifs

Here are the above steps translated to F#:

let gibbsSamplingMotifSearch (dna : string []) k iters rndStarts =
    let t = dna.Length
    //prepopulate the list of k-mers for all DNA strings
    let kmers = 
       dna 
       |> Array.map(fun s -> 
                        [0..s.Length - k]
                         .Select(fun i -> s.Substring(i, k)).Distinct().ToArray())

    let len = dna.[0].Length - k

    // single pass of the algorithm (single "random start")
    let runSingle () = 
        let firstMotifs = 
            [|1..t|] 
            |> Array.map (fun i -> rnd.Next(0, len))
            |> Array.mapi (fun i p -> dna.[i].Substring(p, k) |> toInts)

        let bestMotifs = array2D firstMotifs
        let bestScore = score bestMotifs

        // repeat the sampling step n times
        let rec runSampler (motifs : int [,]) (bestMotifs : int [,]) bestScore n =
            if n = 0 then bestScore, bestMotifs
            else
                let except = rnd.Next(0, t)  // excluded motif
                // exclude from the collection
                let profileMotifs = 
                   stackV motifs.[0..except-1, *] motifs.[except+1.., *] 
                let profile = createProfile profileMotifs // create profile from the rest
                let probs = kmerProbs kmers.[except] profile // build probability distribution 
                let motif = array2D [|gibbsSample rnd probs |> toInts|] // sample to obtain a replacement
                let curMotifs = 
                    motifs.[except+1.., *] 
                    |> stackV motif  
                    |> stackV motifs.[0..except-1, *] // replace  
                let curScore = score motifs // obtain the current score

                //recursive step, memorizing best results
                runSampler curMotifs 
                     (if curScore < bestScore then curMotifs else bestMotifs) 
                     (if curScore < bestScore then curScore else bestScore) 
                     (n - 1) 
        runSampler bestMotifs bestMotifs bestScore iters

    let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()

 

Notes

Here is the header for this script:


open System
open System.Linq
open System.IO

#load "..\packages\MathNet.Numerics.FSharp.3.6.0\MathNet.Numerics.fsx"

open MathNet.Numerics.Random

let rndSeed = RandomSeed.Robust()
let rnd = Random.mersenneTwisterSeed rndSeed

I am using MathNet.Numerics random generator because it happens to be thread safe (which the .NET one is not). Thread safety is crucial because:

Line 46 above:

    let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()

Since all single runs of the algorithm are created equal, it is easy to parallelize and improve your running time up to as many times as you have processor cores. This comes almost for free from .NET if you care enough to add AsParallel().

I do freely mix LINQ and native F# functionality. Perhaps many a purist would scorn me, but I find it works great in practice. Sometimes generic .NET structures which LINQ likes so much (and which I am not using here, not explicitly anyway) are simply faster because they are mutable (another heresy in the functional world, I realize, however it is faster to add an element to a collection than to rebuild the entire collection in order to add one element. For long collections it adds up).

I don’t use the query{} syntax in F# 3.1. I do like it, it is just that I’m used to lambdas and chaining. A matter of taste, I suppose.

The code above only has one explicit “for” loop (Line 17 in Laplace Rule code), tail recursion rocks!

And the final observation: I have seen lots of folks note how F# lets you do things succinctly. I would like to add, that it also enables you to be “right” the first time much more often than many other languages. (Meaning, that the functions you write have a much better chance to have fewer bugs the very first time you launch them than if you were to write them in a non-functional language).