The Problem
“Motif finding is a problem of finding common substrings of specified length in a set of strings. In bioinformatics, this is useful for finding transcription binding sites” (recap here). The problem is succinctly stated on Rosalind.
Given a set of strings DNA of size t, find “most common” substrings of length k. “Most common” means, that substrings should deviate from each other as little as possible, but they may not necessarily be the same. With this in mind, scoring is defined as follows:
let colScore (col : int []) = col.Length - col.GroupBy(fun c -> c).Select(fun gr -> gr.Count()).Max() let score (motifs : int [,]) = [0..Array2D.length2 motifs - 1] |> List.map (fun col -> motifs.[*,col] |> colScore) |> List.sum
To illustrate with the DNA alphabet (we only have 4 letters in the alphabet = {A, C, G, T}), the following:
ACC
ACG
ACT
CCT
will be scored as: 1 + 0 + 2 = 3. We add up scores of each column, where the score of the column is defined as col_length – max_repeating_letters
So, column 1 score is 1 because A repeats 3 times – it is the maximum repeating letter, etc. The problem is then to find an arrangement of k-length strings (called k-mers) that minimize the score.
Profile
Given a collection (matrix) of k-mers, a profile is built in order to determine what would be “the most probable” next k-mer if we were to augment our collection. A profile is a 4xk matrix, where P(i, j) is the probability of a given letter alphabet.[i] in the column (k-mer position) j. For the matrix above:
3/4 | 0 | 0 |
1/4 | 4/4 | 1/4 |
0 | 0 | 1/4 |
0 | 0 | 2/4 |
(it’s the number of letters occurring in the column, divided by the total number of k-mers, t)
Laplace Rule
We amend our simple profile calculation with the Laplace rule of succession. This is done to compensate for 0 and 1 probabilities: keeping those will have a disastrous effect on our algorithm, where we will never be able to pick a k-mer that, like in the example above, has anything other than “C” in the second position.
We apply the rule by always adding 4 strings to the collection, each consisting of one letter of the alphabet repeated k times. So the collection above becomes:
ACC
ACG
ACT
CCT
AAA
CCC
GGG
TTT
And the profile becomes:
// borrowed from StackOverflow. Stack two 2D arrays vertically (I miss Python!) let stackV (a1: 'a[,]) (a2: 'a[,]) = let a1l1,a1l2,a2l1,a2l2 = (Array2D.length1 a1),(Array2D.length2 a1),(Array2D.length1 a2),(Array2D.length2 a2) if a1l2 <> a2l2 then failwith "arrays have different column sizes" let result = Array2D.zeroCreate (a1l1 + a2l1) a1l2 Array2D.blit a1 0 0 result 0 0 a1l1 a1l2 Array2D.blit a2 0 0 result a1l1 0 a2l1 a2l2 result let createProfile (motifs : int[,]) = let k = Array2D.length2 motifs let t = Array2D.length1 motifs + 4 let appendLaplace () = let As = Array.zeroCreate k let rest = [|yield As for i = 1 to 3 do yield As |> Array.map (fun a -> a + i) |] |> array2D stackV motifs rest let m = appendLaplace() let profile = [|0..k - 1|] |> Array.map (fun col -> m.[0..,col] .GroupBy(fun c -> c) .OrderBy(fun gr -> gr.Key) .Select(fun gr -> float(gr.Count()) / float t) .ToArray()) Array2D.init 4 k (fun i j -> profile.[j].[i])
Here we append the motif matrix (represented by a 2D integer array in F#, where a row consists of integers in the range [0..3], corresponding to the letters of our DNA alphabet) with 4 additional entries as described above and retrieve the resulting profile matrix.
Consensus k-mer
The profile matrix is used to pick a k-mer that fits the profile best. For instance, if we have built a profile based on motifs extracted from t – 1 DNA strings, and need to pick a k-mer from the t-th DNA string that fits the profile best, we first compute probabilities of all k-mers in string t based on the profile matrix:
Pr(k-mer|profile) = profile[k-mer[0], 0] * profile[k-mer[1], 1] *..* profile[k-mer[k-1], k-1]
(Here is where Laplace rule comes in handy: we would really be in trouble if there were 0 entries in the profilie matrix. Many k-mers would have Pr(k-mer|profile) = 0 and would short-circuit the algorithm).
and then pick the k-mer with the highest probability.
Code below computes probabilities of a set of k-mers given a profile.
let kmerProbs (kmers : string []) (profile : float [,]) = let k = Array2D.length2 profile let probs = kmers |> Array.map (fun kmer -> kmer, kmer.ToCharArray() |> Array.map (fun c -> alphabet.[c]) |> Array.mapi (fun i ind -> profile.[ind, i]) |> Array.fold (fun state p -> state * p) 1.) probs
Gibbs Sampling
Given a discrete distribution, we want to sample from it:
- Pick a sample s from the uniform distribution [0, n)
- Lookup its probability, ps
- Sample from a uniform [0, 1], pu
- If pu <= ps – accept the sample and return it, otherwise repeat.
One thing to note here is that our probabilities do not necessarily sum up to 1 by design. So we first normalize them, by dividing each probability value by the sum of all probabilities.
The function below returns a motif sampled from the distribution we built from a profile:
let gibbsSample (rnd : Random) (probs : (string * float) []) = let sumall = probs |> Array.sumBy (fun (s, p) -> p) let prs = probs |> Array.map (fun (k, v) -> k, v / sumall) let len = probs.Length let rec rollUntilAccepted value = let n = rnd.NextDouble() if n <= snd prs.[value] then value else rollUntilAccepted (rnd.Next(0, len)) fst probs.[rollUntilAccepted (rnd.Next(0, len))]
The Algorithm
Since solving the problem exactly by brute force is not feasible in practice with long strings and huge collections, approximations are used to get solutions that are “close enough”.
The Gibbs sampling based solution starts from a collection of random k-mers, each picked from one string of DNA and tries to improve on this original pick, performing a sequence of iterations, using Gibbs sampling.
Since the solution evidently depends on the initial collection picked, this is re-run many times to pick the collection with the minimal score.
Briefly the algorithm is:
Given a collection of strings (DNA) of size t, length k,
- Start from a collection of motifs (k-mers) of size t randomly picked from DNA, one motif from one string of DNA
- Store the score of this collection
- Loop many times:
- Select a random number j in [0, t)
- Exclude j-th motif from the motif collection and build the profile of the remaining collection
- Compute probability distribution of motifs in DNA[j] using the profile matrix from the previous step
- Sample from the distribution and replace j-th motif with the obtained sample
- Memorize best score, best collection of motifs
Here are the above steps translated to F#:
let gibbsSamplingMotifSearch (dna : string []) k iters rndStarts = let t = dna.Length //prepopulate the list of k-mers for all DNA strings let kmers = dna |> Array.map(fun s -> [0..s.Length - k] .Select(fun i -> s.Substring(i, k)).Distinct().ToArray()) let len = dna.[0].Length - k // single pass of the algorithm (single "random start") let runSingle () = let firstMotifs = [|1..t|] |> Array.map (fun i -> rnd.Next(0, len)) |> Array.mapi (fun i p -> dna.[i].Substring(p, k) |> toInts) let bestMotifs = array2D firstMotifs let bestScore = score bestMotifs // repeat the sampling step n times let rec runSampler (motifs : int [,]) (bestMotifs : int [,]) bestScore n = if n = 0 then bestScore, bestMotifs else let except = rnd.Next(0, t) // excluded motif // exclude from the collection let profileMotifs = stackV motifs.[0..except-1, *] motifs.[except+1.., *] let profile = createProfile profileMotifs // create profile from the rest let probs = kmerProbs kmers.[except] profile // build probability distribution let motif = array2D [|gibbsSample rnd probs |> toInts|] // sample to obtain a replacement let curMotifs = motifs.[except+1.., *] |> stackV motif |> stackV motifs.[0..except-1, *] // replace let curScore = score motifs // obtain the current score //recursive step, memorizing best results runSampler curMotifs (if curScore < bestScore then curMotifs else bestMotifs) (if curScore < bestScore then curScore else bestScore) (n - 1) runSampler bestMotifs bestMotifs bestScore iters let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()
Notes
Here is the header for this script:
open System open System.Linq open System.IO #load "..\packages\MathNet.Numerics.FSharp.3.6.0\MathNet.Numerics.fsx" open MathNet.Numerics.Random let rndSeed = RandomSeed.Robust() let rnd = Random.mersenneTwisterSeed rndSeed
I am using MathNet.Numerics random generator because it happens to be thread safe (which the .NET one is not). Thread safety is crucial because:
Line 46 above:
let scoresMotifs = [1..rndStarts].AsParallel().Select(fun _ -> runSingle()).ToArray()
Since all single runs of the algorithm are created equal, it is easy to parallelize and improve your running time up to as many times as you have processor cores. This comes almost for free from .NET if you care enough to add AsParallel().
I do freely mix LINQ and native F# functionality. Perhaps many a purist would scorn me, but I find it works great in practice. Sometimes generic .NET structures which LINQ likes so much (and which I am not using here, not explicitly anyway) are simply faster because they are mutable (another heresy in the functional world, I realize, however it is faster to add an element to a collection than to rebuild the entire collection in order to add one element. For long collections it adds up).
I don’t use the query{} syntax in F# 3.1. I do like it, it is just that I’m used to lambdas and chaining. A matter of taste, I suppose.
The code above only has one explicit “for” loop (Line 17 in Laplace Rule code), tail recursion rocks!
And the final observation: I have seen lots of folks note how F# lets you do things succinctly. I would like to add, that it also enables you to be “right” the first time much more often than many other languages. (Meaning, that the functions you write have a much better chance to have fewer bugs the very first time you launch them than if you were to write them in a non-functional language).