This blog chronicles my experiences learning F# via implementing Push: a programming language designed for use in evolving populations of programs geared for solving symbolic regression problems (genetic programming). It has since evolved to contain bits and pieces reflecting my thorny twisted software development path.

## GPU Split & Sort With Alea.CUDA

June 11, 2016 1 comment

### Code

The complete source: here

### Intro

Starting to play with graph algorithms on the GPU (who wants to wait for nvGraph from NVIDIA, right?) As one of the precursor problems – sorting large arrays of non-negative integers came up. Radix sort is a simple effective algorithm quite suitable for GPU implementation.

### Least Significant Digit Radix Sort

The idea is simple. We take a set A of fixed size elements with lexicographical ordering defined. We represent each element in a numeric system of a given radix (e.g., 2). We then scan each element from right to left and group the elements based on the digit within the current scan window, preserving the original order of the elements. The algorithm is described here in more detail.

Here is a the pseudocode for radix = 2 (easy to extend for any radix):

k = sizeof(int)
n = array.length
for i = 0 to k
all_0s = []
all_1s = []
for j = 0 to n
if bit(i, array[j]) == 0
array = all_0s + all_1s



### GPU Implementation

This is a poster problem for the “split” GPU primitive. Split is what <code>GroupBy</code> is in LINQ (or <code>GROUP BY</code> in TSQL) where a sequence of entries are split (grouped) into a number of categories. The code above applies split k times to an array of length n, each time the categories are are “0” and “1”, and splitting/grouping are done based on the current digit of an entry.

The particular case of splitting into just two categories is easy to implement on the GPU. Chapter 39 of GPU Gems describes such implementation (search for “radix sort” on the page).

The algorithm described there shows how to implement the pseudocode above by computing the position of each member of the array in the new merged array (line 10 above) without an intermediary of accumulating lists. The new position of each member of the array is computed based on the exclusive scan where the value of scanned[i] = scanned[i – 1] + 1 when array[i-1] has 0 in the “current” position. (scanned[0] = 0). Thus, by the end of the scan, we know where in the new array the “1” category starts (it’s scanned[n – 1] + is0(array[n – 1]) – total length of the “0” category, and the new address of each member of the array is computed from the scan: for the “0” category – it is simply the value of scanned (each element of scanned is only increased when a 0 bit is encountered), and start_of_1 + (i – scanned[i]) for each member of the “1” category: its position in the original array minus the number of “0” category members up to this point, offset by the start of the “1” category.

The algorithm has two parts: sorting each block as described above and then merging the results. In our implementation we skip the second step, since Alea.CUDA DeviceScanModule does the merge for us (inefficiently so at each iteration, but makes for simpler, more intuitive code).

 Coding with Alea.CUDA The great thing about Alea.CUDA libray v2 is the Alea.CUDA.Unbound namespace that implements all kinds of scans (and reducers, since a reducer is just a scan that throws away almost all of its results). let sort (arr : int []) = let len = arr.Length if len = 0 then [||] else let gridSize = divup arr.Length blockSize let lp = LaunchParam(gridSize, blockSize) // reducer to find the maximum number & get the number of iterations // from it. use reduceModule = new DeviceReduceModule<int>(target, <@ max @>) use reducer = reduceModule.Create(len) use scanModule = new DeviceScanModule<int>(target, <@ (+) @>) use scanner = scanModule.Create(len) use dArr = worker.Malloc(arr) use dBits = worker.Malloc(len) use numFalses = worker.Malloc(len) use dArrTemp = worker.Malloc(len) // Number of iterations = bit count of the maximum number let numIter = reducer.Reduce(dArr.Ptr, len) |> getBitCount let getArr i = if i &&& 1 = 0 then dArr else dArrTemp let getOutArr i = getArr (i + 1) for i = 0 to numIter - 1 do // compute significant bits worker.Launch <@ getNthSignificantReversedBit @> lp (getArr i).Ptr i len dBits.Ptr // scan the bits to compute starting positions further down scanner.ExclusiveScan(dBits.Ptr, numFalses.Ptr, 0, len) // scatter worker.Launch <@ scatter @> lp (getArr i).Ptr len numFalses.Ptr dBits.Ptr (getOutArr i).Ptr (getOutArr (numIter - 1)).Gather() let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x)) This is basically it. One small optimization - instead of looping 32 times (length of int in F#), we figure out the bit count the largest element and iterate fewer times. Helper functions/kernels are straightforward enough: let worker = Worker.Default let target = GPUModuleTarget.Worker worker let blockSize = 512 [<Kernel; ReflectedDefinition>] let getNthSignificantReversedBit (arr : deviceptr<int>) (n : int) (len : int) (revBits : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then revBits.[idx] <- ((arr.[idx] >>> n &&& 1) ^^^ 1) [<Kernel; ReflectedDefinition>] let scatter (arr : deviceptr<int>) (len: int) (falsesScan : deviceptr<int>) (revBits : deviceptr<int>) (out : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then let totalFalses = falsesScan.[len - 1] + revBits.[len - 1] // when the bit is equal to 1 - it will be offset by the scan value + totalFalses // if it's equal to 0 - just the scan value contains the right address let addr = if revBits.[idx] = 1 then falsesScan.[idx] else totalFalses + idx - falsesScan.[idx] out.[addr] <- arr.[idx] let getBitCount n = let rec getNextPowerOfTwoRec n acc = if n = 0 then acc else getNextPowerOfTwoRec (n >>> 1) (acc + 1) getNextPowerOfTwoRec n 0 Generating Data and Testing with FsCheck I find FsCheck quite handy for quick testing and generating data with minimal setup in FSharp scripts: let genNonNeg = Arb.generate<int> |> Gen.filter ((<=) 0) type Marker = static member arbNonNeg = genNonNeg |> Arb.fromGen static member Sorting Correctly arr = sort arr = Array.sort arr Arb.registerByType(typeof<Marker>) Check.QuickAll(typeof<Marker>) Just a quick note: the first line may seem weird at first glance: looks like the filter condition is dropping non-positive numbers. But a more attentive second glance reveals that it is actually filtering out negative numbers (currying, prefix notation should help convince anyone in doubt). Large sets test. I have performance tested this on large datasets (10, 000, 000 - 300, 000, 000) integers, any more and I'm out of memory on my 6 Gb Titan GPU. The chart below goes up to 100,000,000. I have generated data for these tests on the GPU as well, using Alea.CUDA random generators, since FsCheck generator above is awfully slow when it comes to large enough arrays. The code comes almost verbatim from this blog post: let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x)) except I'm converting the generated array to integers in the [0; n] range. Since my radix sort work complexity is O(k * n) (step complexity O(k))  where k = bitCount(max input) I figured this would give an adequate (kinda) comparison with the native F# Array.orderBy . Optimization It is clear from the above chart, that there is a lot of latency in our GPU implementation - the cost we incur for the coding nicety: by using DeviceScanModule and DeviceReduceModule we bracket the merges we would otherwise have to do by hand. Hence the first possible optimization: bite the bullet, do it right, with a single merge at the end of the process and perform a block-by-block sort with BlockScanModule 
 Categories: CUDA, F# Tags: Alea.CUDA, sorting 
 Capture Video in 2 Lines of Code March 21, 2016 3 comments Literally. Well almost. 2 meaningful lines + some boilerplate. This has got to be easier than even Python! Using EmguCV, a wrapper around OpenCV and F# Interactive: #I @"C:\<project directory>\packages\EmguCV.3.0.0\lib\net451" #r "Emgu.CV.dll" #r "Emgu.Util.dll" #r "System.Windows.Forms" open Emgu.CV open System.Windows.Forms let capture = new Capture() Application.Idle.Add(fun _ -> CvInvoke.Imshow("Camera", capture.QueryFrame())) Categories: F# Tags: emgucv, fsharp, imageprocessing, opencv, video Look-and-say: [Alea.]CUDA January 6, 2016 Leave a comment Continuing the Advent of Code theme from the previous post. Figured since this year is going to be my year of CUDA, this would be a good opportunity to take it for a ride. A good April 1st post, but why wait? So, how can we make this even faster than the already fast imperative solution? Naive Approach The complete script for this is on GitHub. The kernel computes one iteration of the look-and-say sequence. Every instance looks at a single group of repeating digits and outputs the result into an array. The new array is twice the size of the original one (representing a previous member of the sequence), every digit or the first digit of a group will produce two number (n, c) and every repeating digit will produce two “0”s on the output: 1 1 1 3 1 2 2 1 1 3 -> 3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3 [<Kernel; ReflectedDefinition>] let lookAndSayKernel (arr : deviceptr<int8>) (len : int) (out : deviceptr<int8>) = let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len then let c = arr.[ind] let idxOut = 2 * ind if ind = 0 || arr.[ind - 1] <> c then let mutable i = 1 while ind + i < len && c = arr.[ind + i] do i <- i + 1 out.[idxOut] <- int8 i out.[idxOut + 1] <- c else out.[idxOut] <- 0y out.[idxOut + 1] <- 0y In the naive approach we bring the resulting sequence into the memory and repeat: let lookAndSayGpu (s : string) n = let arr = s |> Seq.map (fun c -> (string>>int8) c) |> Seq.toArray let rec loop (arr : int8 []) n = if n = 0 then arr.Length else let blockSize = 512 let gridSize = divup arr.Length blockSize let lp = LaunchParam(gridSize, blockSize) use dArr = worker.Malloc(arr) use dOut = worker.Malloc(arr.Length * 2) worker.Launch <@lookAndSayKernel@> lp dArr.Ptr arr.Length dOut.Ptr let out = dOut.Gather() let arr = out |> Array.filter (fun b -> b > 0y) loop arr (n - 1) loop arr n This is almost not worth checking against the high performing imperative algorithm of the last post, but why not? Not unexpectedly we are still falling short of the imperative solution. Not a problem. There are a few very low hanging fruit here that are just waiting to be picked. A Better Solution The complete script for this is on GitHub. There are two problems with our solution so far: Two much memory being moved between the GPU and RAM A significant part of the algorithm is executed on the CPU with memory allocated and destroyed in the process (that would be the array compacting part) The following algorithm addresses both problems: GenerateSequences(str, n) { //Allocate the maximum amount of memory for all data structures used on the GPU len = str.Length initializeGpuMemory(str, len) for i = 1 to n {     lookAndSayKernel(len)     compactResultStream(2 * len)     len = getNewLength()     } } Of the 3 steps in this iterative loop, there is only one where in the implementation we move data from the GPU. It would be easy enough to not do it if necessary, though. At the core of this algorithm there is a stream compaction routine that is executed hopefully entirely on the GPU. Compaction is performed by copying only the values needed to the addresses in the output stream. These addresses are determined by running an inclusive scan on the array of integers that contains “1” in a meaningful position, and “0” in all the rest. The details of how scan is used for stream compaction and how scans are implemented on the GPU, are discussed in GPU Gems 3 (free from NVIDIA). I’m using a device scan that comes with Alea.CUDA, which, hopefully, minimizes memory copies between main memory and GPU. Now that the stream has been compacted, we get the length of the new sequence as the last entry in the address map, obtained during compaction (and explained in the linked chapter of GPU Gems 3). [<Kernel; ReflectedDefinition>] let lookAndSayKernelSimple (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) (flags : deviceptr<int>)= let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len then let c = arr.[ind] let idxOut = 2 * ind let prevUnrepeated = if ind = 0 || arr.[ind - 1] <> c then 1 else 0 flags.[idxOut] <- prevUnrepeated flags.[idxOut + 1] <- prevUnrepeated if prevUnrepeated = 1 then let mutable i = 1 while ind + i < len && c = arr.[ind + i] do i <- i + 1 out.[idxOut] <- i out.[idxOut + 1] <- c else out.[idxOut] <- 0 out.[idxOut + 1] <- 0 [<Kernel; ReflectedDefinition>] let copyScanned (arr : deviceptr<int>) (out : deviceptr<int>) (len : int) (flags : deviceptr<int>) (addrmap : deviceptr<int>) = let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len && flags.[ind] > 0 then out.[addrmap.[ind] - 1] <- arr.[ind] In the modified kernel, we also fill out the “flags” array which are scanned to obtain the address map for compacting the stream. The second kernel uses the result of the scan performed on these “flags” to produce the new compacted sequence. So for the above example: original: 1 1 1 3 1 2 2 1 1 3 new:     3 1 0 0 0 0 1 3 1 1 2 2 0 0 2 1 0 0 1 3 flags:     1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 map:      1 2 2 2 2 2 3 4 5 6 7 8 8 8 9 10 10 10 11 12 (inclusive scan of the flags) The last entry in the map is also the new length. let lookAndSayGpuScan (s : string) n = let maxLen = 20 * 1024 * 1024 let arr = s |> Seq.map (fun c -> (string>>int) c) |> Seq.toArray use dOut = worker.Malloc(maxLen) use dFlags = worker.Malloc(maxLen) use dAddressMap = worker.Malloc(maxLen) use dArr = worker.Malloc(maxLen) dArr.Scatter(arr) use scanModule = new DeviceScanModule<int>(GPUModuleTarget.Worker(worker), <@ (+) @>) scanModule.Create(100) |> ignore let sw = Stopwatch() let rec loop n len = if n = 0 then len else let blockSize = 512 let gridSize = divup len blockSize let lp = LaunchParam(gridSize, blockSize) use scan = scanModule.Create(2 * len) worker.Launch <@lookAndSayKernelSimple@> lp dArr.Ptr len dOut.Ptr dFlags.Ptr scan.InclusiveScan(dFlags.Ptr, dAddressMap.Ptr, 2 * len) let gridSize = divup (2 * len) blockSize let lp = LaunchParam(gridSize, blockSize) worker.Launch <@copyScanned@> lp dOut.Ptr dArr.Ptr (2 * len) dFlags.Ptr dAddressMap.Ptr let len = dAddressMap.GatherScalar(2 * len - 1) loop (n - 1) len sw.Start() let res = loop n s.Length sw.Stop() res, sw.ElapsedMilliseconds The complete code is above. Notice, that we are not cheating when we are measuring the speed of the internal loop and not including allocations. These can be factored out completely, so the only experiment that matters is what actually runs. Here are the comparisons: Now, we are finally fast! But it’s a mighty weird chart, probably because most of the time is spent in context switching and all other latency necessary to make this run… So, takeaways: GPU programming makes us think differently, cool GPU algorithms exist, writing for the GPU is tough but rewarding, and we can really blow a small problem out of any proportion considered descent. Categories: CUDA, F# Tags: adventofcode, Alea.CUDA, CUDA, f# Look-and-say: F# December 27, 2015 1 comment This holiday season my brief indulgence was solving the Advent of Code puzzles. One was about the look-and-say sequence. It starts with “1” and grows as follows: 1 is read off as “one 1” or 11. 11 is read off as “two 1s” or 21. 21 is read off as “one 2, then one 1” or 1211. 1211 is read off as “one 1, then one 2, then two 1s” or 111221. 111221 is read off as “three 1s, then two 2s, then one 1” or 312211. This has been analyzed. Members’ length grows exponentially, so $|\mathcal{L}_{i+1}| \approx |\mathcal{L}_{i}| \times 1.3$ Advent of Code Day 10 is asking to output the length of the 41st member of this sequence, starting from 1113122113. And then, in the second part – of the 51st. This is simple enough, however, if one is not careful… It may take a while. I wrote a solution fast enough, however I wasn’t satisfied with it at all: it looked too generic in style: it could be written this way in any language. The whole point of the indulgence being the aesthetic enjoyment of writing F#. I looked around and found the following post, which I am delighted to quote as an example of the author’s amazing style and command of the language let read (input : string) = input |> Seq.fold (fun acc x -> match acc with | (n, x')::tl when x = x' -> (n+1, x')::tl | _ -> (1, x)::acc) [] |> List.rev |> Seq.collect (fun (n, x) -> sprintf "%d%c" n x) |> fun xs -> System.String.Join("", xs) If you take a moment to read through it you will be admiring it as much as I was. Here is my boringly generic solution. let genNext (s : List<byte>) = let mutable i = 0y let mutable c = s.[0] let reps = List<byte>() for ch in s do if c <> ch then reps.Add(byte i) reps.Add(c) c <- ch i <- 0y i <- i + 1y reps.Add(byte i) reps.Add(c) reps If you look at mine – it is totally un-F#. It may as well have been written in C++! It is not motivated by a sense of style, or good taste, or sadly, by a deep sense of the language. It is, however, motivated by performance. The key to great performance is, of course, to know well ahead of time where your performance goes. In this case we do know it: since the structure holding the next sequence member grows in length exponentially, we would like to minimize all operations related to its manipulation. As far as reads – we cannot escape O(n) reads. Same for writes at every step. And now we are faced with a dilemma that a C++ (and possibly a C#) programmer would never face: are we being wasteful with our memory. In fact, so wasteful that it can bog us down? The answer in the first case is – yes, of course. In the second – no. The first solution that would satisfy any functional purist, creates and destroys its immutable structures every step of the way. Not only that, its input is necessarily a string and so we also need to serialize the sequence we come up with into a string for the sole purpose of deconstructing it again in the next iteration. At the same time, the second solution, using only an almost minimal amount of memory (it is provable, that the sequence never contains any numbers beyond 1, 2, and 3, so bytes are good), uses exactly one new structure which it returns. We actually don’t need to serialize it back to a string! Let’s see how these compare time wise (number of iterations over the sequence is along the X axis): It is not unexpected, still kinda staggering. To zoom into the boring solution (it is still exponential): So this raises a bunch of questions: as our technology evolves, are we going to be able to write more and more aesthetically impeccable (and shorter!) code without caring much for the real world? Or are performance concerns going to ever be looming somewhere very close, forcing more sensible choices? Who is the developer of the future? Is s/he going to be required to write great code but hold more down-to-earth alternatives in mind, effectively cancelling her immunity to trials and tribulations of machine machine-oriented languages like C++ or even C, something that Java set out to free us from ages ago? I don’t know, but I believe the answer for the nearest future at least, should be sought in the world of algorithms. Algorithmic and especially data structural efficiency would help us write in style while still holding on to great performance gains. Categories: F#, Puzzle Tags: adventofcode, algorithm, fsharp, performance, puzzle Non-linear Thinking with CUDA. September 8, 2015 1 comment I love GPU programming for precisely this: it forces and enables you to think about a solution in a non-linear fashion in more than one sense of the word. The Problem Given a set $A = \{a_1, a_2 \ldots a_n\}$, output a set $S_A = \{0,\ \sum\limits_{k=1}^{n} a_k,\ \sum\limits_{k=i}^{i + j \mod n} a_k | 1 \le i < n, 0 \le j \le n - 1\}$ In other words, pretend our array is cyclical, and we want all partial sums of the array elements, in order, starting from element i, ending with i + j. (We will be looping through the end, since the array is cyclical.) In the bioinformatics course, this happens to be a toy problem of generating a cyclo-spectrum from the weights of a cyclic peptide. So for a peptide with individual amino-acid masses [57; 71; 113; 113; 128], the first 3 iterations will look like: 57 71 113 113 128 57 + 71 71 + 113 113 + 113 113 + 128 128 + 57 57 + 71 + 113 71 + 113 + 113 113 + 113 + 128 113 + 128 + 57 128 + 57 + 71 Code On my GitHub. Sequential Solutions The above table immediately hints at a brute-force as well as an optimized solution. The brute-force solution is $O(n^3)$: // brute force O(n^3) solution let cyclospec (peptide : int seq) = let len = peptide.Count() let pepArr = peptide |&gt; Seq.toArray let mutable parsum = 0 (seq { yield 0 yield pepArr |&gt; Array.sum for i = 0 to len - 2 do yield! [0..len-1] |> Seq.map (fun j -&gt; //F# 4.0 Feature! parsum &lt;- 0 for ind = j to j + i do parsum &lt;- parsum + pepArr.[if ind >= len then ind - len else ind] parsum ) }) |> Seq.sort |> Seq.toArray Note the F# 4.0 feature: a variable mutated within a closure. This will work, but slowly. To significantly speed it up we only need to notice that there is no need to recompute values that have been precomputed already over, and over, and over again. So, memoization yields an $O(n^2)$ solution: let cyclospecOpt (peptide : int seq) = let len = peptide.Count() let pepArr = peptide |> Seq.toArray let output = Array.zeroCreate ((len - 1) * len) for i = 0 to len - 2 do [0..len-1] |> Seq.iter (fun j -> let ind = i + j output.[i * len + j] <- if i = 0 then pepArr.[j] else output.[(i - 1) * len + j] + pepArr.[if ind >= len then ind - len else ind] ) seq {yield 0; yield pepArr |> Array.sum; yield! output} |> Seq.sort |> Seq.toArray Ok, so far so good. Multi-dimensional thinking with CUDA So how do we look at this differently with CUDA? I believe solving this problem shows how data, for lack of a better term, fuses with time (represented by threads). If we take a closer look at the table above, we’ll see that the solution can be represented in a two-dimensional grid. For instance, at grid location (5, 3), we have a solution element that is constructed of $a_5 + a_6 + a_7$, thus it’s a grid X of dimensions n x (n-1)( where $x_{i, j}$ is the sum, starting from i-th element of the input set, and ranging over j elements. This data structure maps easily into CUDA with its 3D thread model (this is the element of time/data fusion I was talking about). Since we now have ALEA.Cuda, we can do this without leaving F#, and, even better, we can script it: [<Kernel; ReflectedDefinition>] let cyclospecKernel (arr : deviceptr<int>) (len : int) (out : deviceptr<int>) = let ind = blockIdx.x * blockDim.x + threadIdx.x let lenElem = blockIdx.y * blockDim.y + threadIdx.y if ind < len && lenElem < len - 1 then let mutable parsum = 0 for i = 0 to lenElem do let idx = ind + i parsum <- parsum + arr.[if idx >= len then idx - len else idx] out.[lenElem * len + ind] <- parsum And we invoke it: let cyclospecGpu (peptide : int []) = let blockSize = dim3(16, 16) let gridSize = dim3(divup peptide.Length blockSize.x, divup peptide.Length blockSize.y) let lp = LaunchParam(gridSize, blockSize) use dPeptide = worker.Malloc(peptide) use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1)) worker.Launch <@cyclospecKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr let output = dOutput.Gather() seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort Each kernel thread is computing its own element in the grid, which is flattened into a solution. Time has indeed merged with data. But what about our optimization? The easy way to implement it is to streamline invocations of a kernel, where each consecutive run will compute the new sum using the results of previous kernel invocations. There are variations on this theme, but it lacks the aesthetic value of the “brute force” CUDA solution: [[] let cyclospecOptKernel (arr : deviceptr) (len : int) (out : deviceptr) (lenElem : int)= let ind = blockIdx.x * blockDim.x + threadIdx.x if ind < len then let idx = ind + lenElem out.[lenElem * len + ind] = len then idx – len else idx] Invocation: let cyclospecOptGpu (peptide : int []) = let blockSize = 256 let gridSize = divup peptide.Length blockSize let lp = LaunchParam(gridSize, blockSize) use dPeptide = worker.Malloc(peptide) use dOutput : DeviceMemory<int> = worker.Malloc(peptide.Length * (peptide.Length - 1)) for i = 0 to peptide.Length - 2 do worker.Launch <@cyclospecOptKernel @> lp dPeptide.Ptr peptide.Length dOutput.Ptr i let output = dOutput.Gather() seq{yield 0; yield! output; yield peptide |> Seq.sum} |> Seq.toArray |> Array.sort So, how did we do? (Here comes the mandatory chart): And so: A little algorithm is a beautiful thing A little CUDA is an even more beautiful thing CUDA optimization performs no worse and at larger sizes of the input starts performing better, at some point massive amounts of computation slow us down even in parallel. See #1. And what is the GPU algorithm performance in O notation? It’s definitely non-linear, is it possible to estimate analytically? Here is another chart that only compares optimized algorithms. It’s not a semi-log chart for easier visualization: Categories: bioinformatics, CUDA, F# Tags: Alea.CUDA, bioinformatics, CUDA, f# Turnpike Problem (F#) September 6, 2015 Leave a comment The Problem This is a problem for which no polynomial time solution exists, although a pseudo-polynomial (polynomial in the degree of the maximum element) does exist. In this paper, the algorithm proposed for the sum-function with full information is simple enough. For the original problem with the difference funcition, the algorithm is simple as well: Problem. Given a set $D_A=\{a_i-a_j\},\ 1\le i, j\le n$, reconstruct the original set $A=\{a_i\},\ 1 \le i \le n$. (Naturally, we can only consider positive deltas: $D_A=\{a_i-a_j\},\ 1\le i, the set is sorted in the descending order). A simple algorithm exists. It performs in $O(2^n\cdot\log{n})$ time, but in practice its performance is closer to $O(n^2\cdot\log{n})$, as it "probably" won't hit the worst case. We first reduce our set $D_A$ to only positive values and assume they are sorted in the descending order. Then the pseudocode will run as follows: solution = [0; Da[0]] max_element = Da[0] remove (Da, Da[0]) while (Da not empty) { if solution = [0] return [] if distances(solution.Add(Da[0])) $\subseteq$ Da && not explored_this_path() remove(Da, distances) else if distances(solution.Add(max - Da[0])) $\subseteq$ Da && not explored_this_path() remove(Da, distances) else backtrack(solution, Da) } return solution; We start by adding 0 and the maximal element to the solution, and remove the latter from the Da. At every step, we have a new version of the solution and Da. We take $\max{D_{A_cur}}$ to be the possible next element of the set $A$. Then we compute pair-wise distances of the current sub-solution and see if they represent the subset of $D_{A_cur}$. If that fails – we then try $a_{max} - \max{D_{A_cur}}$ the same way. If that fails as well, we backtrack to the previous state of the solution and previous $D_{A_cur}$, and keep backtracking until we reach a state, where not all of alternatives have been tried. Once current Da is empty – we have a solution. If we have reached the root state and all alternatives have been tried – there is no solution. Essentially, we are building a binary tree, where each node contains the current solution, current Da, and an edge connecting it with one or both of the alternatives above. Hopefully, we do not need to keep building this tree for too long. Here is a nice illustration of this process. Note, that at every step we favor the remaining maximal element of $D_{A_cur}$ as opposed to $a_{max} - \max{D_{A_cur}}$. Doing the latter would produce a symmetric solution. F# Implementation The source for this implementation is here At every step, we keep building the following tree: type DecisionTree = | Empty | TreeNode of deltas : Dictionary<int, int> * solution : int list * visited: uint32 * maxElem : DecisionTree * maxDist : DecisionTree * prev : DecisionTree Every node contains the current deltas, current state of the solution, the number of times we have visited this node (for optimization), two possible branches and a pointer to the parent node. The function that implements the current algorithm: let rec insert (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prevNode : DecisionTree) maxSol = let insPrev () = let res, deltas, prevPrev = getPrev prevNode insert deltas res prevNode prevPrev maxSol match node with | Empty -> TreeNode(deltas, res, 0u, Empty, Empty, prevNode) | TreeNode(dct, rslt, visited, maxElem, maxDist, prev) when visited < 2u -> let curVisited = visit node let elem = if visited = 0u then keySeqMax deltas else maxSol - keySeqMax deltas let dists = stepOk elem res deltas if dists.Length > 0 then let newDeltas = removeDist deltas dists insert newDeltas (elem::res) (if visited = 0u then maxElem else maxDist) curVisited maxSol elif visited = 0u then insert deltas res curVisited prev maxSol else insPrev() | _ -> insPrev() The insert function implements each step of the algorithm and returns a reference to the currently inserted node. The driver is a loop, implemented, naturally as a tail-recursive function: let rec buildSolution (deltas : Dictionary<int, int>) (res : int list) (node : DecisionTree) (prev : DecisionTree) = if deltas.Count = 0 then res |> List.sort else let newNode = insert deltas res node prev maxSol let prevNode = node match newNode with | Empty -> [] // no solution | TreeNode(deltas, res, visited, maxElem, maxDist, prev) -> if visited >=2u && prev = Empty then [] // came all the way back, no solution else buildSolution deltas res newNode prevNode This algorithm is biased towards always selecting the maximum remaining element from the Da as the next element in the solution. So, for an input set [2; 2; 3; 3; 4; 5; 6; 7; 8; 10], this algorithm produces [0; 3; 6; 8; 10]. Symmetrically, [0; 2; 4; 7; 10] is also a solution. Categories: bioinformatics, F# Tags: binary tree, turnpike Decomposition Problem with F#, Dynamic Programming August 20, 2015 Leave a comment As my former boss and mentor used to say, in order to understand recursion one must first understand recursion. This is funny, ha-ha, but if we tweak it slightly we get a really useful statement: in order to understand dynamic programming, one must first understand recursion. Here is an example. Sources Github: http://github.com/fierval/BioInfo Decomposition Problem Given a number, $m$, and an (ordered) set $\mathcal{N} = \{n_i\}, \ n_i > 0, \ n_{i+1} \geq n_{i}$ output how many sets $\mathcal{A}=\{a_i\}, \ a_i \geq 0$ there are, such that $m=\sum\limits_i a_i\dot n_i$ Sequencing Antibiotics I was solving a “particular” case (the solution is nonetheless general, but just to explain function names appearing in the code), in Chapter 2 on Antibiotics sequencing in my bioinformatics book. Antibiotics sequencing is done through mass spectrum analysis, so if we have a spectrum of weight $m$, we may need to build the list of linear peptides of the same mass. In a toy version of this problem, we have an integer value of $m$ and a list of amino acid weights ($\mathcal{N}$) out of which antibiotics can be built: let weights = [|57; 71; 87; 97; 99; 101; 103; 113; 114; 115; 128; 129; 131; 137; 147; 156; 163; 186|] To Understand Recursion Given a number $m$, we assume that the problem is solved for $m-n_0, m-n_1$, etc until $m-n_i$, and we have values $k_0, k_1, ..., k_i$ for these solutions Our solution, then is simply $\sum\limits_i k_i$. We are done. If we look a bit closer, however, we’ll see that our recursion creates a tree with one root node (a solution for the case of $m$), branching exponentially: at the first level we get $l = |\mathcal{N}|$ new nodes, each of which also produces $l$ nodes at the next level. So this will pretty quickly become unfeasible. Even if we did have unlimited computing power, this is still not the most efficient use of it, since lots of quantities will be recursively computed multiple times as we descend down the recursion tree. Dynamic Programming We therefore turn to dynamic programming, which is sort of recursion backwards. We still use the idea above, but instead of going back we go forward. We pre-compute all the necessary values by starting from $n_0$, the minimal member of $\mathcal(N)$, marching until $m$. At each step we will perform exactly the recursive step described above, so once we reach higher values of $m$, all previous values of it will be ready! The Solution We allocate an array of 0 elements and length $m$ (this can actually be optimized to be of some constant length of $O(l)$, perhaps just of length $l$, I didn’t check, since as we move along old values become useless). We set all entries corresponding to $n_i$ to 1. The rest is pretty straight-forward. open System // array of amino acid weights let weights = [|57; 71; 87; 97; 99; 101; 103; 113; 114; 115; 128; 129; 131; 137; 147; 156; 163; 186|] /// Count the number of peptides of mass m let countNumPeptides m = let allCounts : int64 [] = Array.zeroCreate (max m weights.[weights.Length - 1] + 1) // initialize dynamic programming store Array.ForEach(weights, fun w -> allCounts.[w] <- 1L) // equivalent to a simple loop, but we are using tail recursion let rec fillDynArray n = if n > m then allCounts.[m] else // dynamic_count(N) = Sum(dynamic_count((N - weights[0]), dynamic_count(N - weights[1]),.., dynamic_count(N - weights.Last()) let step = weights |> Array.map (fun w -> if n - w < 0 then 0L else allCounts.[n - w]) |> Array.sum allCounts.[n] <- allCounts.[n] + step fillDynArray (n + 1) fillDynArray weights.[0], allCounts …and a mandatory chart I really enjoy working with FSharp.Charting (available through NuGet), and it would be cool to visualize how fast the number of solutions grows with $m$ increasing linearly. Notice above I am returning the entire array of counts we have generated, so since we are saving all of the solutions from $n_0$ to $m$, let’s see what we get: #load @"..\packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx" open FSharp.Charting let chartNumPeptides m = let _, counts = countNumPeptides m let series = Array.init m (fun i -> i, if counts.[i] > 0L then counts.[i] else 1L) Chart.Line series.[weights.[0]..] |> Chart.WithXAxis(Min=100., Max=1000.) |> Chart.WithYAxis(Log=true) Cheating a little here: if values are 0 I’m setting them to 1 since this is a semi-log chart (no log of 0). I think it’s easier than matplotlib! Categories: bioinformatics, F# Tags: dynamic programming, fsharp, FSharp.Charting Older Entries 
 RSS feed Google Youdao Xian Guo Zhua Xia My Yahoo! newsgator Bloglines iNezha Twitter Source Code The source code mentioned in this blog is available on my GitHub fierval A software developer passionate about functional programming, philosophy, languages, life. View Full Profile → Categories bioinformatics C# Clojure CoffeeScript computation expression CPP CUDA d3 d3js data visualization F# FParsec FSharpChart LINQ monad Parallel Push Puzzle TDD TPL Recent Posts GPU Split & Sort With Alea.CUDA Capture Video in 2 Lines of Code Look-and-say: [Alea.]CUDA Look-and-say: F# Non-linear Thinking with CUDA. Search for: 
 Top Blog at WordPress.com. 
 
 //<![CDATA[ var infiniteScroll = {"settings":{"id":"main","ajaxurl":"https:\/\/viralfsharp.com\/?infinity=scrolling","type":"scroll","wrapper":true,"wrapper_class":"infinite-wrap","footer":true,"click_handle":"1","text":"Older posts","totop":"Scroll back to top","currentday":"11.06.16","order":"DESC","scripts":[],"styles":[],"google_analytics":false,"offset":0,"history":{"host":"viralfsharp.com","path":"\/page\/%d\/","use_trailing_slashes":true,"parameters":""},"query_args":{"error":"","m":"","p":0,"post_parent":"","subpost":"","subpost_id":"","attachment":"","attachment_id":0,"name":"","static":"","pagename":"","page_id":0,"second":"","minute":"","hour":"","day":0,"monthnum":0,"year":0,"w":0,"category_name":"","tag":"","cat":"","tag_id":"","author":"","author_name":"","feed":"","tb":"","paged":0,"meta_key":"","meta_value":"","preview":"","s":"","sentence":"","title":"","fields":"","menu_order":"","embed":"","category__in":[],"category__not_in":[],"category__and":[],"post__in":[],"post__not_in":[],"post_name__in":[],"tag__in":[],"tag__not_in":[],"tag__and":[],"tag_slug__in":[],"tag_slug__and":[],"post_parent__in":[],"post_parent__not_in":[],"author__in":[],"author__not_in":[],"posts_per_page":7,"ignore_sticky_posts":false,"suppress_filters":false,"cache_results":false,"update_post_term_cache":true,"update_post_meta_cache":true,"post_type":"","nopaging":false,"comments_per_page":"50","no_found_rows":false,"order":"DESC"},"last_post_date":"2015-08-20 12:33:29","stats":"blog=31838193&v=wpcom&tz=-8&user_id=0&subd=viralfsharp&x_pagetype=infinite"}}; //]]> /* <![CDATA[ */ var WPGroHo = {"my_hash":""}; /* ]]> */ //initialize and attach hovercards to all gravatars jQuery( document ).ready( function( $) { if (typeof Gravatar === "undefined"){ return; } if ( typeof Gravatar.init !== "function" ) { return; } Gravatar.profile_cb = function( hash, id ) { WPGroHo.syncProfileData( hash, id ); }; Gravatar.my_hash = WPGroHo.my_hash; Gravatar.init( 'body', '#wp-admin-bar-my-account' ); }); /* <![CDATA[ */ var HighlanderComments = {"loggingInText":"Logging In\u2026","submittingText":"Posting Comment\u2026","postCommentText":"Post Comment","connectingToText":"Connecting to %s","commentingAsText":"%1$s: You are commenting using your %2\$s account.","logoutText":"Log Out","loginText":"Log In","connectURL":"https:\/\/viralfsharp.wordpress.com\/public.api\/connect\/?action=request","logoutURL":"https:\/\/viralfsharp.wordpress.com\/wp-login.php?action=logout&_wpnonce=e76e52392f","homeURL":"https:\/\/viralfsharp.com\/","postID":"654","gravDefault":"identicon","enterACommentError":"Please enter a comment","enterEmailError":"Please enter your email address here","invalidEmailError":"Invalid email address","enterAuthorError":"Please enter your name here","gravatarFromEmail":"This picture will show whenever you leave a comment. Click to customize it.","logInToExternalAccount":"Log in to use details from one of these accounts.","change":"Change","changeAccount":"Change Account","comment_registration":"0","userIsLoggedIn":"","isJetpack":"0","text_direction":"ltr"}; /* ]]> */ Viral F# Blog at WordPress.com. Follow Follow “Viral F#” Get every new post delivered to your Inbox. Build a website with WordPress.com Post to Cancel (function(){ var corecss = document.createElement('link'); var themecss = document.createElement('link'); var corecssurl = "https://s0.wp.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.9b"; if ( corecss.setAttribute ) { corecss.setAttribute( "rel", "stylesheet" ); corecss.setAttribute( "type", "text/css" ); corecss.setAttribute( "href", corecssurl ); } else { corecss.rel = "stylesheet"; corecss.href = corecssurl; } document.getElementsByTagName("head")[0].insertBefore( corecss, document.getElementById("syntaxhighlighteranchor") ); var themecssurl = "https://s0.wp.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?m=1363304414h&amp;ver=3.0.9b"; if ( themecss.setAttribute ) { themecss.setAttribute( "rel", "stylesheet" ); themecss.setAttribute( "type", "text/css" ); themecss.setAttribute( "href", themecssurl ); } else { themecss.rel = "stylesheet"; themecss.href = themecssurl; } //document.getElementById("syntaxhighlighteranchor").appendChild(themecss); document.getElementsByTagName("head")[0].insertBefore( themecss, document.getElementById("syntaxhighlighteranchor") ); })(); SyntaxHighlighter.config.strings.expandSource = '+ expand source'; SyntaxHighlighter.config.strings.help = '?'; SyntaxHighlighter.config.strings.alert = 'SyntaxHighlighter\n\n'; SyntaxHighlighter.config.strings.noBrush = 'Can\'t find brush for: '; SyntaxHighlighter.config.strings.brushNotHtmlScript = 'Brush wasn\'t configured for html-script option: '; SyntaxHighlighter.defaults['pad-line-numbers'] = false; SyntaxHighlighter.defaults['toolbar'] = false; SyntaxHighlighter.all(); /* <![CDATA[ */ var jetpackSlideshowSettings = {"spinner":"https:\/\/s1.wp.com\/wp-content\/mu-plugins\/shortcodes\/img\/slideshow-loader.gif","blog_id":"31838193","blog_subdomain":"viralfsharp","user_id":"0"}; /* ]]> */ /* <![CDATA[ */ var jetpackCarouselStrings = {"widths":[370,700,1000,1200,1400,2000],"is_logged_in":"","lang":"en","ajaxurl":"https:\/\/viralfsharp.com\/wp-admin\/admin-ajax.php","nonce":"b96c60ac89","display_exif":"1","display_geo":"1","background_color":"black","comment":"Comment","post_comment":"Post Comment","write_comment":"Write a Comment...","loading_comments":"Loading Comments...","download_original":"View full size <span class=\"photo-size\">{0}<span class=\"photo-size-times\">\u00d7<\/span>{1}<\/span>","no_comment_text":"Please be sure to submit some text with your comment.","no_comment_email":"Please provide an email address to comment.","no_comment_author":"Please provide your name to comment.","comment_post_error":"Sorry, but there was an error posting your comment. Please try again later.","comment_approved":"Your comment was approved.","comment_unapproved":"Your comment is in moderation.","camera":"Camera","aperture":"Aperture","shutter_speed":"Shutter Speed","focal_length":"Focal Length","comment_registration":"0","require_name_email":"1","login_url":"https:\/\/viralfsharp.wordpress.com\/wp-login.php?redirect_to=https%3A%2F%2Fviralfsharp.com%2F2016%2F06%2F11%2Fgpu-split-sort-with-alea-cuda%2F","local_comments_commenting_as":"<fieldset><label for=\"email\">Email (Required)<\/label> <input type=\"text\" name=\"email\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-email-field\" \/><\/fieldset><fieldset><label for=\"author\">Name (Required)<\/label> <input type=\"text\" name=\"author\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-author-field\" \/><\/fieldset><fieldset><label for=\"url\">Website<\/label> <input type=\"text\" name=\"url\" class=\"jp-carousel-comment-form-field jp-carousel-comment-form-text-field\" id=\"jp-carousel-comment-form-url-field\" \/><\/fieldset>","reblog":"Reblog","reblogged":"Reblogged","reblog_add_thoughts":"Add your thoughts here... (optional)","reblogging":"Reblogging...","post_reblog":"Post Reblog","stats_query_args":"blog=31838193&v=wpcom&tz=-8&user_id=0&subd=viralfsharp","is_public":"1","reblog_enabled":""}; /* ]]> */ // <![CDATA[ (function() { try{ if ( window.external &&'msIsSiteMode' in window.external) { if (window.external.msIsSiteMode()) { var jl = document.createElement('script'); jl.type='text/javascript'; jl.async=true; jl.src='/wp-content/plugins/ie-sitemode/custom-jumplist.php'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(jl, s); } } }catch(e){} })(); // ]]> jQuery.extend( infiniteScroll.settings.scripts, ["jquery","jquery-core","jquery-migrate","mobile-useragent-info","postmessage","jquery_inview","jetpack_resize","loggedout-subscribe","spin","jquery.spin","grofiles-cards","wpgroho","jquery.autoresize","highlander-comments","syntaxhighlighter-core","syntaxhighlighter-brush-fsharp","devicepx","jetpack_likes_queuehandler","the-neverending-homepage","jquery-cycle","jetpack-slideshow","swfobject","videopress","jetpack-carousel","twitter-widgets","twitter-widgets-infinity","twitter-widgets-pending","tiled-gallery"] ); jQuery.extend( infiniteScroll.settings.styles, ["wpcom-smileys","jetpack_likes","loggedout-subscribe","the-neverending-homepage","infinity-inove","jetpack-slideshow","wpcom-core-compat-playlist-styles","mp6hacks","wpcom-bbpress2-staff-css","noticons","geo-location-flair","reblogging","h4-global","highlander-comments","highlander-comments-ie7","gravatar-profile-widget","jetpack-carousel","tiled-gallery","gravatar-card-services","jetpack-carousel-ie8fix"] ); var _comscore = _comscore || []; _comscore.push({ c1: "2", c2: "7518284" }); (function() { var s = document.createElement("script"), el = document.getElementsByTagName("script")[0]; s.async = true; s.src = (document.location.protocol == "https:" ? "https://sb" : "http://b") + ".scorecardresearch.com/beacon.js"; el.parentNode.insertBefore(s, el); })(); _tkq = window._tkq || []; _stq = window._stq || []; _tkq.push(['storeContext', {'blog_id':'31838193','blog_tz':'-8','user_lang':'en','blog_lang':'en','user_id':'0'}]); _stq.push(['view', {'blog':'31838193','v':'wpcom','tz':'-8','user_id':'0','subd':'viralfsharp'}]); _stq.push(['extra', {'crypt':'UE40eW5QN0p8M2Y/RE1LVmwrVi5vQS5fVFtfdHBbPyw1VXIrU3hWLHhmcmw0bWUwNiVnR3N1V2dDZTM4MGFxVndRQl1HNkJtOEQ0dVBbbk9jP2g4T29lWFFNMGt1ei1xeTRYWlo9TUMwdldzWVdIRklHP05OPW93LWsraFImQU1RaUJiS1EwWSs0LF9XMUVfQVVNbDgrZG9bWy1RSmZoX0t4ZXpza01NL2lZRnJxazFbLVI4P2VuL0RUWU9iRXpqVHVxWnxUZj1beCtbW1ZKci1SfHdJZ2E3aCtGKzJmd05uJklMUGI4Py9uXUxpNy5Dd0RGakhmdGxnXy9bLi9KPXUvSW84cXA4W212MHA4SzFOWmVDd3JhWUswcVdXaTkuQVB8VFJrTTRUP0w3MWhlUyVKdEs0LC94LS4sUyU='}]); _stq.push([ 'clickTrackerInit', '31838193', '0' ]); if ( 'object' === typeof wpcom_mobile_user_agent_info ) { wpcom_mobile_user_agent_info.init(); var mobileStatsQueryString = ""; if( false !== wpcom_mobile_user_agent_info.matchedPlatformName ) mobileStatsQueryString += "&x_" + 'mobile_platforms' + '=' + wpcom_mobile_user_agent_info.matchedPlatformName; if( false !== wpcom_mobile_user_agent_info.matchedUserAgentName ) mobileStatsQueryString += "&x_" + 'mobile_devices' + '=' + wpcom_mobile_user_agent_info.matchedUserAgentName; if( wpcom_mobile_user_agent_info.isIPad() ) mobileStatsQueryString += "&x_" + 'ipad_views' + '=' + 'views'; if( "" != mobileStatsQueryString ) { new Image().src = document.location.protocol + '//pixel.wp.com/g.gif?v=wpcom-no-pv' + mobileStatsQueryString + '&baba=' + Math.random(); } }