By 2017, it is expected that GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architectures’ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth limitations from an interface such as PCI Express for transfers between CPUs and GPUs, and the system-level energy overheads for both chip crossings and replicated chip infrastructure.
This is going to be all about parallelism, CUDA, performance, and new direction in software development. For me personally anyway. All the code mentioned below is here.
So, here we've got it. The need for SOMs. SOMs are wonderful for clustering and visualizing high dimensional data. Naturally, using the sacred principle of software development (best software engineers steal) I looked around for some existing code. Found this and this very fast. The first of these shows you very quickly what SOMs are and how to do build them step-by-step, while the second already has the C++ code that can just be taken as is and used for computing SOMs. Which was what I did.
In my experiments, I used around 12,000 12-dimensional nodes with a map of 200x200. On my puny i5 650 (3.2 Ghz), 8 Gb RAM, generating a SOM with these parameters takes around 4 hrs, maybe less. One "epoch" takes around 40 sec and I run 500 epochs, however, since the neighborhood of code vectors that gets trained in one epoch gradually diminishes, it is not a straight multiplication of 500 * 40.
These experiments have actually not yielded the results I was hoping for, perhaps because the training set is not large enough for the data I am trying to cluster. Be it as it may, more experiments are needed with a larger dataset, and I am already at the brink of feasibility as far as performance. It does increase linearly with the number of nodes. The C++ code that I have (stolen) is actually pretty good, but it is single-threaded, so doing things in parallel seems to be a logical next step.
Step 0. CPU F#
At this point, I was actually interested more in performance improvements than in re-implementing SOM in F#. I did re-implement it in F#, but for my performance bench-marking I did not use the whole algorithm, just the first part where BMUs are calculated. Since BMU is a map vector that is closest to the given node, in order to compute one BMU it is necessary to iterate over the entire map. So computing BMUs for the entire set of nodes gets us the cost of O(n * d * m1 * m2) (m1, m2 are map dimensions, n is the length of the nodes array, d is dimensionality of each node vector). That's for one epoch only. And there are 500 of those. It adds up.
My F# implementation computed the BMU for one 12 dimensional node on a 200x200 SOM in a whopping 140 ms. Vs just 4ms for C++. I did expect a perfromance drop from C++, I just did not expect it to be that drastic.
member this.GetBMU (node : Node) = let min = ref Double.MaxValue let minI = ref -1 let minJ = ref -1 this.somMap |> Array2D.iteri (fun i j e -> let dist = getDistance e node this.Metric if dist < !min then min := dist; minI := i; minJ := j) !minI, !minJ
Then I added parallelism. And Parallel.ForEach worked slightly better than Parallel.For.
member this.GetBMUParallel (node : Node) = let monitor = new obj() let minList = ref  Parallel.ForEach( Partitioner.Create(0, fst dims), (fun () -> ref (Double.MaxValue, -1, -1)), (fun range state local -> let mutable(min, minI, minJ) = match !local with | m, i, j -> m, i, j for i = fst range to snd range - 1 do for j = 0 to snd this.Dimensions - 1 do let dist = getDistance this.somMap.[i, j] node this.Metric if dist < min then min <- dist; minI <- i; minJ <- j local := (min, minI, minJ) local), (fun local -> lock monitor (fun () -> match !local with | m, i, j when i > 0 -> minList := (m, i, j) :: !minList |_ -> () ))) |> ignore let minTuple = !minList |> List.minBy (fun (x, i, j) -> x) match minTuple with | x, i, j -> i, j
Nothing fancy here. Split the first dimension of the map into chunks and try processing them as much as possible in parallel, by utilizing all the 4 logical cores. The inner loop could also be re-written the same way (to use Parallel.For or Parallel.ForEach), but it would probably not do much good since we are already as parallel as we can be. (And in reality it did not. Do any good, that is). While I expected an at least 4-fold performance increase, I did not get. I did get a 2 times increase. Now it only took 70 ms for one node.
Going massively parallel
At this point, things are really intuitive. If it were up to me, I'd do every calculation there is in parallel and then reduce them once they are done. If it takes, I dunno, say 0.001 mks to multiply 2 numbers in one processor thread, how long does it take to multiply 12000 * 12 * 2 numbers on 144000 processors? Obviously the same 0.001 ms. So the problem becomes almost constant in the number of nodes if we only could always have as many processors as the number of nodes * their dimensions.
Reality is of course vastly different but it does not have to be measured by the number 4 (of CPU cores). Thus, CUDA or OpenCL. I invested $166 in a Quadro K600 Graphics card, which has 192 cores and 1 Gb of on-board RAM. I still wanted to remain faithful to F#, so I looked for a .NET/F# CUDA framework. After briefly evaluating several such frameworks (and they are all in different stages of nascent at this point), I picked Alea.cuBase from QuantAlea.
Alea.cuBase is pretty neat. I like the paradigm - using quotations to write CUDA code. The documentation is pretty good, gets you up and running very quickly, once the basic concepts of CUDA have been grasped. There are problems, though.
- I could not get shared memory to work despite claims that it is supported. Things just crashed.
- No support yet for multi-dimensional arrays. This was kind of a bummer, because it meant some preprocessing on my part to get things going on the GPU. Oh well. Whatever doesn't kill you...
So how did I do having re-written things to run massively in parallel (with one hand cuffed to the radiator, since I could not use multidimensional arrays)? Well, I did several implementations, I will describe them next time, but here are the charts.
The results are averaged over 3 repetitions. Y-axis values are in ms, both axes are logarithmic. Experiments, using CPU are in dashed lines, they start fading at the point where I simply estimated the run-time based on previous performance (did not want to wait 168 sec). On the x-axis are the number of nodes (200x200 SOM, 12 dimensions). I finally did outperform a single-threaded C++ implementation with the "GPU-iterations" algorithm. (Why is its performance so staggeringly awful on small sizes of the dataset? I will talk about it in my next post). Although the gains are not that impressive. At the end I was able to shave about 13-16 seconds off of the "real" 12 000 strong dataset. Which, I guess, is not bad, although not quite there yet... Why? As it turns out, the parallel part of all of this has never exceeded 50% of the total run-time. Which means, the algorithms work in the "hurry-up-and-wait" mode. While parallel calculations do their part pretty fast, negotiating things back and forth with .NET kills the performance.
Still, I made some optimizations and here is what I got:
Notice, how the "GPU-node-by-node" algorithm from being a loser actually advanced to the first place. This was due to a very small change.
At the end, I absolutely fell in love with GPUs and CUDA. It really demystifies parallel programming. All of the behaviors I encountered while experimenting with different implementations were immediately obvious and predictable. I also changed quite a few of my previous convictions about software development. I will talk about it next time (with code in hand).