## Fun with Alea.CUDA, F# Interactive, Charts

Source code for this post can be found on my GitHub.

It’s great to see technologies evolving over the years. Alea.CUDA has done so in leaps and bounds since the first time I laid eyes on it a couple of years ago. At the time the name seemed unfortunate and hinted at the “aleatic” programming paradigm, meaning that you could not reliably predict the results your code would produce in consequent runs (alea is a Latin dice). It has all changed since.

I listened to an NVIDA Webinar on the subject and immediately wanted to take Alea.CUDA development for a test drive. I wanted to do it the same way I would in Python: just like we use Numba and IPython console, use F# scripting with F# Interactive.

For a test subject, I picked the same 1D LBP described in my post here. The actual algorithm for extracting the local binary pattern and its use in face recognition is described in this paper.

The algorithm consists of two parts:

- Sliding a window of size 2*n across the array and computing the pattern at each location
- Computing the histogram of these values.

Here is my F# way of doing it:

let lbp (arr : int[]) n = if arr.Length < n * 2 + 1 then failwith ("Length should be at least " + (n * 2 + 1).ToString()) // computes a pattern around an array element i let calcOnePattern i (around : int [])= [0..around.Length - 1] |> Seq.fold (fun st j -> if around.[j] > i then st ||| (1 <<< j) else st) 0 // moves sliding windoe of size '2*n' through the array and computes the pattern for each // element at the center of a window let prep = [|n .. arr.Length - n - 1|] |> Array.map (fun i -> calcOnePattern arr.[i] (Array.concat [arr.[i - n..i - 1]; arr.[i + 1..i + n]])) // return histogram prep |> Array.fold (fun (st : int[]) v -> st.[v] <- st.[v] + 1; st) (Array.zeroCreate (1 <<< n * 2))

For comparison, here is the Python code (here p is 2 ** [0..(2n-1)]), in F# code I’m simply shifting left, so no such array is necessary.

def extract_1dlbp_cpu(input, neighborhood, p): hist = np.zeros(1 << (2 * neighborhood)) for i in range(neighborhood, len(input) - neighborhood): left = input[i - neighborhood : i] right = input[i + 1 : i + neighborhood + 1] both = np.r_[left, right] hist[np.sum(p [both >= input[i]])] += 1 return hist

## Putting it on GPU

I took one of Alea.CUDA provided sample scripts and modified it slightly. This is why my code is wrapped in a class. Here it is:

type LBP(arr : int [], ?n : int) = inherit GPUModule(GPUModuleTarget.DefaultWorker) let mutable arr = arr let n = defaultArg n 4 do if arr.Length < n * 2 + 1 then failwith ("Length should be at least " + (n * 2 + 1).ToString()) member this.Arr with get() = arr and set (value) = arr <- value [<Kernel;ReflectedDefinition>] member this.Kernel (arr:deviceptr<int>) (n:int) (hist:deviceptr<int>) (len : int)= let mutable i = blockIdx.x * blockDim.x + threadIdx.x let mutable res = 0 if i < len - 2 * n then i <- i + n for j = i - n to i - 1 do if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n))) for j = i + 1 to i + n do if arr.[j] >= arr.[i] then res <- res ||| (1 <<< (j - (i - n + 1))) __atomic_add (hist + res) 1 |> ignore member this.Compute () = let blockSize = 512 let gridSize = divup (arr.Length - 2 * n) blockSize let lp = LaunchParam(gridSize, blockSize) let hist = Array.zeroCreate (1 <<< n * 2) use d_arr = this.GPUWorker.Malloc(arr) use d_hist = this.GPUWorker.Malloc(hist) this.GPULaunch <@ this.Kernel @> lp d_arr.Ptr n d_hist.Ptr (arr.Length) d_hist.Gather()

There is not much here, at least in spirit, that any CUDA developer is not used to. We define a kernel and then launch it in a very familiar way, albeit wrapped in a slightly unfamiliar form. Nothing that is not easy to understand or adopt, though. F# quotations are used to interface with nvcc, so makes sense. Pointer arithmetic is wrapped nicely, so code like the one on line 25 above (__atomic_add) is possible. Works great, easy to use.

Here the kernel, as one would expect, computes a pattern around a single array element, at the same time contributing its result to the final histogram. Definitely room for optimization as this atomic_add is an obvious bottleneck, but I didn’t go any further since the results are quite spectacular already.

Also, I really like the “divup” function above, that figures out the number of blocks. Normally it is done as:

gridSize = (total + blockSize) / blockSize

… a small thing, but a nice convenience.

### Boilerplate

Since I was doing it in F# Interactive, some boilerplate was necessary. I installed my binaries through NuGet and here are the references. The “send to F# interactive” functionality that has been around for a while now is a great help to figure things out.

#r "System.Configuration.dll" #I @"packages\Alea.Cuda.2.1.2.3274\lib\net40" #I @"packages\NUnit.2.6.3\lib\" #I @"packages\FsUnit.1.3.0.1\Lib\Net40" #r "Alea.CUDA.dll" #r "nunit.framework.dll" #r "FsUnit.NUnit.dll" #load @"packages\FSharp.Charting.0.90.12\FSharp.Charting.fsx" open System open System.IO open Alea.CUDA open Alea.CUDA.Utilities open FsUnit open FSharp.Charting open System.Diagnostics Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(__SOURCE_DIRECTORY__, @"packages\Alea.Cuda.2.1.2.3274\private") Alea.CUDA.Settings.Instance.Resource.Path <- Path.Combine(__SOURCE_DIRECTORY__, @"release")

The last two lines come straight from Alea tutorial sample and are necessary to hook up the jit compiler. That’s it!

### Experiments & results

I used the following functions to drive the experiments:

// random array of length n let generate n = let rng = Random(int DateTime.Now.Ticks) Array.init n (fun _ -> rng.Next()) // experiment: run from 10 ** low to 10 ** high array length let experiment low high = if low >= high || low < 0 then failwith "must be: low < high, both non-negative" // salt it let arr = [|0..1000|] use lb = new LBP(arr) lb.Compute() |> ignore let sw = Stopwatch() let cpuTimes = Array.zeroCreate (high - low + 1) let gpuTimes = Array.zeroCreate (high - low + 1) for i = low to high do let range = int (10.0 ** float i) let arr = generate range lb.Arr <- arr // Run on CPU sw.Restart() printfn "Legnth: %d" range printfn "-----------" let h1 = lbp arr 4 sw.Stop() let idx = i - low cpuTimes.[idx] <- range, sw.Elapsed.TotalSeconds printfn "Computed on CPU: %0.5f sec" (snd cpuTimes.[idx]) //Run on GPU sw.Restart() let h2 = lb.Compute() sw.Stop() gpuTimes.[idx] <- range, float sw.Elapsed.TotalSeconds printfn "Computed on GPU: %0.5f sec" (snd gpuTimes.[idx]) printfn "" // make sure we are ok should equal h1 h2 Chart.Combine( [Chart.Line(cpuTimes, Name="CPU") Chart.Line(gpuTimes, Name="GPU") ] ) .WithYAxis(Log=true, Title = "sec") .WithXAxis(Min=10.**float low, Log=true, Title = "length") .WithLegend(InsideArea=true)

The helper “generate” generates a random integer array of arbitrary length, and the “experiment” function runs the experiments on arrays of sizes 10 ** [low..high]

with

experiment 3 7

I got:

Legnth: 1000

-----------

Computed on CPU: 0.00937 sec

Computed on GPU: 0.00285 sec

```
```Legnth: 10000

-----------

Computed on CPU: 0.02445 sec

Computed on GPU: 0.00308 sec

Legnth: 100000

-----------

Computed on CPU: 0.17697 sec

Computed on GPU: 0.00388 sec

Legnth: 1000000

-----------

Computed on CPU: 1.70085 sec

Computed on GPU: 0.00412 sec

`Legnth: 10000000`

-----------

Computed on CPU: 16.53772 sec

Computed on GPU: 0.03045 sec

Here the straight F# mode runs about twice as fast as pure Python:

Length: 1000

--------------

Finished on CPU: time: 0.02263s

```
```Length: 10000

--------------

Finished on CPU: time: 0.28701s

Length: 100000

--------------

Finished on CPU: time: 2.88549s

`Length: 1000000`

--------------

Finished on CPU: time: 29.97346s

The GPU performances are about comparable between Python and Numba and F# with Alea.CUDA. Same card: GTX Titan with 6Gb RAM, 14 SMPs

Of course a mandatory chart produced with FSharpChart and the following small scriptlet:

Chart.Combine( [Chart.Line(cpuTimes, Name="CPU") Chart.Line(gpuTimes, Name="GPU") ] ) .WithYAxis(Log=true, Title = "sec") .WithXAxis(Min=10.**float low, Log=true, Title = "length") .WithLegend(InsideArea=true)

(Still not sure, why my X axis starts from 100 and not from 1000, but I guess it’s minor :)).

A great experience doing all this overall: smooth and painless.

## Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

In the previous post I spoke briefly about motivations for implementing self-organizing maps in F# using GPU with CUDA. I have finally been able to outperform a single threaded C++ implementation by a factor of about 1.5. This is quite modest, but on the other hand rather impressive since we started out by being 60 times slower. At this point I am bidding farewell to F# and switching to C++. It will be interesting to see how this works out, but here are my initial algorithms.

So, we are parallelizing the following:

member this.GetBMU (node : Node) = let min = ref Double.MaxValue let minI = ref -1 let minJ = ref -1 this.somMap |> Array2D.iteri (fun i j e -> let dist = getDistance e node this.Metric if dist < !min then min := dist; minI := i; minJ := j) !minI, !minJ

Here somMap is just a two-dimensional array of Node-s. And a Node is simply an array of float’s (double’s in C#), of the size equal to dimensionality of our space.

The code for getDistance is also simple:

let getDistanceEuclidian (x : Node) (y : Node) = Math.Sqrt([0..x.Dimension - 1] |> Seq.fold(fun sq i -> sq + (x.[i] - y.[i]) ** 2.) 0.) let getDistance (x : Node) (y : Node) metric = if x.Dimension <> y.Dimension then failwith "Dimensions must match" else match metric with | Euclidian -> getDistanceEuclidian x y | Taxicab -> getDistanceTaxicab x y

In my parallel version I am only using Euclidian distance for simplicity.

### Parallel Algorithms

As I have mentioned above, Alea.cuBase (my F#-to-CUDA framework) does not support multidimensional arrays (as well as shared memory or calling a kernel from a kernel), so this is the price I had to pay for sticking to my favorite language. Given these constraints, here is what I came up with:

#### Node-by-node

The very first idea (that proved also the very best) is quite intuitive. To find BMUs for the entire set of nodes, just take each individual node, find its BMU in parallel, repeat.

The map is first flattened into a single dimension, where if the cell coordinates were (i, j, k), they are mapped to (i * j * k + j * k + k). All distance sub-calculations can be performed in one shot, and then one thread finishes them up. By “sub-calculation” I mean computing (node(i) – map(j)) ** 2. These are then added up to the distance squared (I don’t calculate sqrt, since I don’t need it to find the minimum).

So, here is the implementation:

let pDistances = cuda { let! kernel = <@ fun nodeLen len (node : DevicePtr<float>) (map : DevicePtr<float>) (distances : DevicePtr<float>) (minDist : DevicePtr<float>) (minIndex : DevicePtr<int>) -> // index into the original map, assuming // a node is a single entity let mapI = blockIdx.x * blockDim.x // actual position of the node component in the map let i = mapI + threadIdx.x if i < len then // index into the node let j = threadIdx.x % nodeLen distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j]) if threadIdx.x = 0 then __syncthreads() let mutable thread = 0 minIndex.[blockIdx.x] <- -1 // find the minimum among threads while mapI + thread < len && thread < blockDim.x do let k = mapI + thread let mutable sum = 0. for j = 0 to nodeLen - 1 do sum <- sum + distances.[k + j] if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then minDist.[blockIdx.x] <- sum minIndex.[blockIdx.x] <- k / nodeLen thread <- thread + nodeLen @> |> defineKernelFunc let diagnose (stats:KernelExecutionStats) = printfn "gpu timing: %10.3f ms %6.2f%% threads(%d) reg(%d) smem(%d)" stats.TimeSpan (stats.Occupancy * 100.0) stats.LaunchParam.BlockDim.Size stats.Kernel.NumRegs stats.Kernel.StaticSharedMemBytes return PFunc(fun (m:Module) (nodes : float [] list) (map : float []) -> let kernel = kernel.Apply m let nodeLen = nodes.[0].Length let chunk = map.Length let nt = (256 / nodeLen) * nodeLen // number of threads divisible by nodeLen let nBlocks = (chunk + nt - 1)/ nt //map.Length is a multiple of nodeLen by construction use dMap = m.Worker.Malloc(map) use dDist = m.Worker.Malloc<float>(map.Length) use dMinIndices = m.Worker.Malloc<int>(nBlocks * nodes.Length) use dMinDists = m.Worker.Malloc<float>(nBlocks * nodes.Length) use dNodes = m.Worker.Malloc(nodes.SelectMany(fun n -> n :> float seq).ToArray()) let lp = LaunchParam(nBlocks, nt) //|> Engine.setDiagnoser diagnose nodes |> List.iteri (fun i node -> kernel.Launch lp nodeLen chunk (dNodes.Ptr + i * nodeLen) dMap.Ptr dDist.Ptr (dMinDists.Ptr + i * nBlocks) (dMinIndices.Ptr + i * nBlocks)) let minDists = dMinDists.ToHost() let indices = dMinIndices.ToHost() let mins = (Array.zeroCreate nodes.Length) for i = 0 to nodes.Length - 1 do let baseI = i * nBlocks let mutable min = minDists.[baseI] mins.[i] <- indices.[baseI] for j = 1 to nBlocks - 1 do if minDists.[baseI + j] < min then min <-minDists.[baseI + j] mins.[i] <- indices.[baseI + j] mins ) }

To get as much parallelism as possible, I start with 256 threads per block (maximum on Kepler is 1024, but 256 gives the best performance). Since each block of threads is going to compute as many distances as possible, I try to allocate the maximum number of threads divisible by node dimensionality. In my case: 256 / 12 * 256 = 252. Pretty good, we get almost an optimal number of threads per block.

The number of blocks are computed from the length of the map. I want to split this in an optimal way since each block is scheduled on one SP, and I have 192 of those I don’t want them to idle. The algorithm leans towards parallelizing calculations relative to the map (see picture above, I want all those “arrows” to be executed in parallel), so the number of blocks will be (somArray.length + nt – 1) / nt (nt is the number of threads – 252, somArray.length is 200 * 200 * 12 = 480000, the formula above takes into account the fact that this number may not be a multiple of 252, in which case we will need one more incomplete block). My block size is 1905 – pretty good, CUDA devs recommend that to be at least twice the number of multiprocessors. It is necessary to hide latency, which is killer in this development paradigm (you need to rely on your PCI slot to transfer data).

One weird thing here is that I have to allocate a relatively large dDist arrray. This is where temporary values of (map(i) – node(j))**2 go. In reality (if I were writing in C++) I would not do that. I would just use the super-fast shared memory for this throw-away array. I could not get that to work with Alea, although the documentation says it is supported.

Another thing: the call to __syncthreads() is issued under a conditional statement. This, in general, is a horrible idea, because in the SIMT (single instruction multiple threads), threads march “in sync” so to speak, instruction by instruction. Thus, doing what I have done may lead to things hanging as some threads may take different branches and other threads will wait forever. Here, however, we are good, because the only way to go if the condition evaluates to false is to simply quit.

The kernel is called in a loop. One call for each node: 10000 passes (vs 10000 * 40000) in the sequential case. I also make sure that all of my memory is allocated once and everything I need is copied there. Not doing that leads to disastrous consequences, since all you would have in that case is pure latency.

The code is self-explanatory. Once everyone has computed their part of the distance, these parts are assembled by the 0-th thread of the block. That thread is responsible for calculating the final square of the distance, comparing the result of what we already have in the minDist array for this map node and storing that result.

#### Node-by-node optimized

And here lies the mistake that makes this algorithm lose out on performance: there is no need to delegate this work to the 0-th thread (looping over all 252 threads of the block). It is enough to delegate that to each “threadIdx.x % nodeLen = 0″‘s thread of the block, so now 21 threads do this work in parallel, each for only 12 threads. Algorithm re-written this way outperforms everything else I could come up with.

Here is the re-write of the kernel function:

let! kernel = <@ fun nodeLen len (node : DevicePtr<float>) (map : DevicePtr<float>) (distances : DevicePtr<float>) (minDist : DevicePtr<float>) (minIndex : DevicePtr<int>) -> // index into the original map, assuming // a node is a single entity let mapI = blockIdx.x * blockDim.x // actual position of the node component in the map let i = mapI + threadIdx.x if i < len then // index into the node let j = threadIdx.x % nodeLen distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j]) __syncthreads() if threadIdx.x % nodeLen = 0 then minIndex.[blockIdx.x] <- -1 // find the minimum among threads let k = mapI + threadIdx.x let mutable sum = 0. for j = k to k + nodeLen - 1 do sum <- sum + distances.[j] if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then minDist.[blockIdx.x] <- sum minIndex.[blockIdx.x] <- k / nodeLen @> |> defineKernelFunc

This kernel function only stores “winning” (minimal) distances within each section of the map. Now they need to be reduced to one value per node. There are 1905 blocks and 10000 nodes. 10000 minimum values are computed in one pass over the 1905 * 10000-length array of accumulated minimal distances:

let minDists = dMinDists.ToHost() let indices = dMinIndices.ToHost() let mins = (Array.zeroCreate nodes.Length) for i = 0 to nodes.Length - 1 do let baseI = i * nBlocks let mutable min = minDists.[baseI] mins.[i] <- indices.[baseI] for j = 1 to nBlocks - 1 do if minDists.[baseI + j] < min then min <-minDists.[baseI + j] mins.[i] <- indices.[baseI + j] mins

And we are done. The improved version beats all the rest of my algorithms. Since all of these performance improvements looked so “juicy”, I decided it was high time to abandon managed code and go back to the C++ world. Especially since writing C++ code is not going to be so different: no annoying loops to write, just express it linearly and reap the massively parallel goodness.

## Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 1: F#

`By 2017, it is expected that GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architecturesâ€™ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth limitations from an interface such as PCI Express for transfers between CPUs and GPUs, and the system-level energy overheads for both chip crossings and replicated chip infrastructure.`

```
```This is going to be all about parallelism, CUDA, performance, and new direction in software development. For me personally anyway. All the code mentioned below is here.

So, here we've got it. The need for SOMs. SOMs are wonderful for clustering and visualizing high dimensional data. Naturally, using the sacred principle of software development (best software engineers steal) I looked around for some existing code. Found this and this very fast. The first of these shows you very quickly what SOMs are and how to do build them step-by-step, while the second already has the C++ code that can just be taken as is and used for computing SOMs. Which was what I did.

In my experiments, I used around 12,000 12-dimensional nodes with a map of 200x200. On my puny i5 650 (3.2 Ghz), 8 Gb RAM, generating a SOM with these parameters takes around 4 hrs, maybe less. One "epoch" takes around 40 sec and I run 500 epochs, however, since the neighborhood of code vectors that gets trained in one epoch gradually diminishes, it is not a straight multiplication of 500 * 40.

These experiments have actually not yielded the results I was hoping for, perhaps because the training set is not large enough for the data I am trying to cluster. Be it as it may, more experiments are needed with a larger dataset, and I am already at the brink of feasibility as far as performance. It does increase linearly with the number of nodes. The C++ code that I have (stolen) is actually pretty good, but it is single-threaded, so doing things in parallel seems to be a logical next step.

### Step 0. CPU F#

At this point, I was actually interested more in performance improvements than in re-implementing SOM in F#. I did re-implement it in F#, but for my performance bench-marking I did not use the whole algorithm, just the first part where BMUs are calculated. Since BMU is a map vector that is closest to the given node, in order to compute one BMU it is necessary to iterate over the entire map. So computing BMUs for the entire set of nodes gets us the cost of O(n * d * m1 * m2) (m1, m2 are map dimensions, n is the length of the nodes array, d is dimensionality of each node vector). That's for one epoch only. And there are 500 of those. It adds up.

My F# implementation computed the BMU for one 12 dimensional node on a 200x200 SOM in a whopping 140 ms. Vs just 4ms for C++. I did expect a perfromance drop from C++, I just did not expect it to be that drastic.

member this.GetBMU (node : Node) =
let min = ref Double.MaxValue
let minI = ref -1
let minJ = ref -1
this.somMap |> Array2D.iteri (fun i j e ->
let dist = getDistance e node this.Metric
if dist < !min then min := dist; minI := i; minJ := j)
!minI, !minJ

Then I added parallelism. And Parallel.ForEach worked slightly better than Parallel.For.

member this.GetBMUParallel (node : Node) =
let monitor = new obj()
let minList = ref []
Parallel.ForEach(
Partitioner.Create(0, fst dims),
(fun () -> ref (Double.MaxValue, -1, -1)),
(fun range state local ->
let mutable(min, minI, minJ) =
match !local with
| m, i, j -> m, i, j
for i = fst range to snd range - 1 do
for j = 0 to snd this.Dimensions - 1 do
let dist = getDistance this.somMap.[i, j] node this.Metric
if dist < min then
min <- dist; minI <- i; minJ <- j
local := (min, minI, minJ)
local),
(fun local -> lock monitor (fun () ->
match !local with
| m, i, j when i > 0 ->
minList := (m, i, j) :: !minList
|_ -> ()
))) |> ignore
let minTuple = !minList |> List.minBy (fun (x, i, j) -> x)
match minTuple with
| x, i, j -> i, j

Nothing fancy here. Split the first dimension of the map into chunks and try processing them as much as possible in parallel, by utilizing all the 4 logical cores. The inner loop could also be re-written the same way (to use Parallel.For or Parallel.ForEach), but it would probably not do much good since we are already as parallel as we can be. (And in reality it did not. Do any good, that is). While I expected an at least 4-fold performance increase, I did not get. I did get a 2 times increase. Now it only took 70 ms for one node.

### Going massively parallel

At this point, things are really intuitive. If it were up to me, I'd do every calculation there is in parallel and then reduce them once they are done. If it takes, I dunno, say 0.001 mks to multiply 2 numbers in one processor thread, how long does it take to multiply 12000 * 12 * 2 numbers on 144000 processors? Obviously the same 0.001 ms. So the problem becomes almost constant in the number of nodes if we only could always have as many processors as the number of nodes * their dimensions.

Reality is of course vastly different but it does not have to be measured by the number 4 (of CPU cores). Thus, CUDA or OpenCL. I invested $166 in a Quadro K600 Graphics card, which has 192 cores and 1 Gb of on-board RAM. I still wanted to remain faithful to F#, so I looked for a .NET/F# CUDA framework. After briefly evaluating several such frameworks (and they are all in different stages of nascent at this point), I picked Alea.cuBase from QuantAlea.

#### The Framework

Alea.cuBase is pretty neat. I like the paradigm - using quotations to write CUDA code. The documentation is pretty good, gets you up and running very quickly, once the basic concepts of CUDA have been grasped. There are problems, though.

- I could not get shared memory to work despite claims that it is supported. Things just crashed.
- No support yet for multi-dimensional arrays. This was kind of a bummer, because it meant some preprocessing on my part to get things going on the GPU. Oh well. Whatever doesn't kill you...

So how did I do having re-written things to run massively in parallel (with one hand cuffed to the radiator, since I could not use multidimensional arrays)? Well, I did several implementations, I will describe them next time, but here are the charts.

#### Results

The results are averaged over 3 repetitions. Y-axis values are in ms, both axes are logarithmic. Experiments, using CPU are in dashed lines, they start fading at the point where I simply estimated the run-time based on previous performance (did not want to wait 168 sec). On the x-axis are the number of nodes (200x200 SOM, 12 dimensions). I finally did outperform a single-threaded C++ implementation with the "GPU-iterations" algorithm. (Why is its performance so staggeringly awful on small sizes of the dataset? I will talk about it in my next post). Although the gains are not that impressive. At the end I was able to shave about 13-16 seconds off of the "real" 12 000 strong dataset. Which, I guess, is not bad, although not quite there yet... Why? As it turns out, the parallel part of all of this has never exceeded 50% of the total run-time. Which means, the algorithms work in the "hurry-up-and-wait" mode. While parallel calculations do their part pretty fast, negotiating things back and forth with .NET kills the performance.

Still, I made some optimizations and here is what I got:

Notice, how the "GPU-node-by-node" algorithm from being a loser actually advanced to the first place. This was due to a very small change.

At the end, I absolutely fell in love with GPUs and CUDA. It really demystifies parallel programming. All of the behaviors I encountered while experimenting with different implementations were immediately obvious and predictable. I also changed quite a few of my previous convictions about software development. I will talk about it next time (with code in hand).

```
```

```
```## Outperforming MathNet with Task Parallel Library

February 25, 2012
3 comments
Math.NET is a great project that brings numerics to .NET the OS way. Perusing their blog, I found this post on the on-line algorithm for std calculation. A simple Wiki search revealed this article that describes a parallel algorithm due to Chan for calculating variance (std = sqrt(variance)). So, I set to figure out whether I could improve on Math.NET in the case of a large dataset.

First of all, my algorithm is slightly different from Math.NET. I have implemented the on-line algorithm from the Wikipedia article, since Chan’s parallel algorithm uses it to merge datasets calculated “on-line”:

(Full sources on my GitHub)

[<DebuggerDisplay("{Display}")>]
type auxValues =
{M2: float; mean : float; n : float
}
with
member private a.Display = "M2: " + a.M2.ToString() + "; mean: " + a.mean.ToString() + "; n: " + a.n.ToString()
let auxValuesOfSet (data : double []) start finish =
let mutable n = 0.
let mutable mean = 0.
let mutable M2 = 0.
for x = start to finish do
n <- n + 1.
let delta = data.[x] - mean
mean <- mean + delta/n
if n > 1. then M2 <- M2 + delta * (data.[x] - mean)
{M2 = M2; mean = mean; n = float(finish - start + 1)}
let Variance (data : double []) =
(auxValuesOfSet data 0 (data.Length - 1)).M2 / (float data.Length)

Math.NET implements the “naive” algorithm, and, as my experiments showed, it performs better than the on-line algorithm on large datasets. On a dataset of 100,000,000 doubles, Math.NET gives me about 3330 ms, while the on-line algorithm gives about 4260 ms. The driver program source is here.

Now the fun begins with parallel. I am using TPL (Task Parallel Library) with its promise to utilize all of the cores of my machine (Intel i5, 4 logical cores @ 3.2 GHz, 8Gb RAM).

My first crack at parallelizing was to simply split the dataset in two and run both as parallel tasks:

let combineM2s (r1 : auxValues) (r2 : auxValues) =
let delta = r1.mean - r2.mean
let deltaSq = delta * delta
let n = r1.n + r2.n
let M2 = r1.M2 + r2.M2 + deltaSq * r1.n *r2.n / n
let mean = (r1.n * r1.mean + r2.n * r2.mean) / n
{M2 = M2; mean = mean; n = n}
let Variance2(data : double []) =
let partition = data.Length / 2
let finish = data.Length - 1
let tasks =
[|
Task.Factory.StartNew(fun () -> auxValuesOfSet data 0 partition)
Task.Factory.StartNew(fun () -> auxValuesOfSet data (partition + 1) finish)
|]
let results = Task.Factory.ContinueWhenAll(tasks, fun tasks -> tasks |> Array.map(fun (v : Task<auxValues>) -> v.Result)).Result
let res = combineM2s results.[0] results.[1]
res.M2 / res.n

Surprisingly enough performance, instead of doubling as I expected, grew by about a factor of 8: 478ms for the on-line algorithm. Here combineM2s implements Chan’s parallel algorithm which joins to sets of variance calculations.

Calculations now flying, I decided to do it right and have TPL partition my dataset optimally between the cores. I then wrote the following that, naturally, does not work (although it took me some time to figure out why):

let VarianceForDoesntWork (data : double []) =
let monitor = new obj()
let m2 = ref {M2 = 0.; n= 0.; mean = 0.}
Parallel.ForEach(
Partitioner.Create(0, data.Length),
(fun () -> {M2 = 0.; n= 0.; mean = 0.}),
(fun range state local -> auxValuesOfSet data (fst range) ((snd range) - 1)
),
(fun (local : auxValues) -> lock monitor (fun () -> do m2:= combineM2s local !m2))) |> ignore
(!m2).M2 / (!m2).n

So why does this not work? Superficially, everything seems in order. We have a Parallel.For, which partitions our collection in chunks by creating tuples that stand for ranges of indices into the collection: [incluse, exclusive). Each iteration gets a local instantiation (line 07), then the on-line algorithm is called on the subset of the collection (line 08) and then Chan’s algorithm is called to keep accumulating results of individual on-line algorithms (line 10).

This, however, does not work, because ForEach accumulates local and final results not per-iteration, but per-task. A task may be composed of several iterations, however, which means, that this form of ForEach is in actuality an aggregator and needs to be written accordingly. The following implementation will work:

let VarianceForCummul (data : double []) =
let monitor = new obj()
let m2list : auxValues list ref = ref []
Parallel.ForEach(
Partitioner.Create(0, data.Length),
(fun () -> ref []),
(fun range state local ->
local := auxValuesOfSet data (fst range) ((snd range) - 1) :: !local; local),
(fun local -> lock monitor (fun () -> m2list := !m2list @ !local))) |> ignore
let res = !m2list |> List.reduce combineM2s
res.M2 / res.n

Here each task produces a list of what the Wiki article calls M2’s, calculated by its (the task) iterations of the body of the ForEach loop over the ranges, that this task gets to execute (it may produce several such “local” values, and they are all kept in a list), and then those are wrapped into a final result by calling List.reduce with Chan’s aggregation.

Finally a home-grown implementation of Parallel.For:

let VarianceForTask (data : double []) =
let monitor = new obj()
let partitioner = Partitioner.Create(0, data.Length)
let partitions = partitioner.GetDynamicPartitions()
let tasks =
[| for p in partitions ->
Task.Factory.StartNew(fun () -> auxValuesOfSet data (fst p) ((snd p) - 1))
|]
let results = Task.Factory.ContinueWhenAll(tasks, fun tasks -> tasks |> Array.map(fun (v : Task<auxValues>) -> v.Result)).Result
let res = results |> Array.reduce combineM2s
res.M2 / res.n

This works just as well as Parallel.For in my experiments: both give around 350 – 370ms. A roughly 20% improvement over the “naive parallel” algorithm.

Categories: F#, Parallel, TPL
calculating variance, parallel algorithm, parallel library

```
```