Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

In the previous post I spoke briefly about motivations for implementing self-organizing maps in F# using GPU with CUDA. I have finally been able to outperform a single threaded C++ implementation by a factor of about 1.5. This is quite modest, but on the other hand rather impressive since we started out by being 60 times slower. At this point I am bidding farewell to F# and switching to C++. It will be interesting to see how this works out, but here are my initial algorithms.

So, we are parallelizing the following:

    member this.GetBMU (node : Node) =
        let min = ref Double.MaxValue
        let minI = ref -1
        let minJ = ref -1
        this.somMap |> Array2D.iteri (fun i j e -> 
            let dist = getDistance e node this.Metric
            if dist < !min then min := dist; minI := i; minJ := j)
        !minI, !minJ

Here somMap is just a two-dimensional array of Node-s. And a Node is simply an array of float’s (double’s in C#), of the size equal to dimensionality of our space.

The code for getDistance is also simple:

    let getDistanceEuclidian (x : Node) (y : Node) =
        Math.Sqrt([0..x.Dimension - 1] |> Seq.fold(fun sq i -> sq + (x.[i] - y.[i]) ** 2.) 0.)

    let getDistance (x : Node) (y : Node) metric =
        if x.Dimension <> y.Dimension then failwith "Dimensions must match"
        else
            match metric with
            | Euclidian ->
                getDistanceEuclidian x y
            | Taxicab -> 
                getDistanceTaxicab x y

In my parallel version I am only using Euclidian distance for simplicity.

Parallel Algorithms

As I have mentioned above, Alea.cuBase (my F#-to-CUDA framework) does not support multidimensional arrays (as well as shared memory or calling a kernel from a kernel), so this is the price I had to pay for sticking to my favorite language. Given these constraints, here is what I came up with:

Node-by-node

The very first idea (that proved also the very best) is quite intuitive. To find BMUs for the entire set of nodes, just take each individual node, find its BMU in parallel, repeat.

Node-by-node algorithm
Node-by-node algorithm

The map is first flattened into a single dimension, where if the cell coordinates were (i, j, k), they are mapped to (i * j * k + j * k + k). All distance sub-calculations can be performed in one shot, and then one thread finishes them up. By “sub-calculation” I mean computing (node(i) – map(j)) ** 2. These are then added up to the distance squared (I don’t calculate sqrt, since I don’t need it to find the minimum).

So, here is the implementation:

    let pDistances = 
        cuda {
            let! kernel =
                <@ fun nodeLen len
                    (node : DevicePtr<float>) 
                    (map :  DevicePtr<float>)
                    (distances : DevicePtr<float>)
                    (minDist : DevicePtr<float>)
                    (minIndex : DevicePtr<int>) 
                    ->

                    // index into the original map, assuming
                    // a node is a single entity
                    let mapI = blockIdx.x * blockDim.x

                    // actual position of the node component in the map
                    let i = mapI + threadIdx.x  

                    if i < len then                    
                        // index into the node
                        let j = threadIdx.x % nodeLen

                        distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j])
                        if threadIdx.x = 0 then
                            __syncthreads()
                            let mutable thread = 0
                            minIndex.[blockIdx.x] <- -1
                             
                            // find the minimum among threads
                            while mapI + thread < len && thread < blockDim.x do
                                let k = mapI + thread
                                let mutable sum = 0.
                                for j = 0 to nodeLen - 1 do
                                    sum <- sum + distances.[k + j]
                                if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then
                                    minDist.[blockIdx.x] <- sum
                                    minIndex.[blockIdx.x] <- k / nodeLen
                                thread <- thread + nodeLen
                                    
                    @> |> defineKernelFunc

            let diagnose (stats:KernelExecutionStats) =
               printfn "gpu timing: %10.3f ms %6.2f%% threads(%d) reg(%d) smem(%d)"
                   stats.TimeSpan
                   (stats.Occupancy * 100.0)
                   stats.LaunchParam.BlockDim.Size
                   stats.Kernel.NumRegs
                   stats.Kernel.StaticSharedMemBytes

            return PFunc(fun (m:Module) (nodes : float [] list) (map : float []) ->
                let kernel = kernel.Apply m
                let nodeLen = nodes.[0].Length
                let chunk = map.Length
                let nt = (256 / nodeLen) * nodeLen // number of threads divisible by nodeLen
                let nBlocks = (chunk + nt - 1)/ nt //map.Length is a multiple of nodeLen by construction
                use dMap = m.Worker.Malloc(map)
                use dDist = m.Worker.Malloc<float>(map.Length)
                use dMinIndices = m.Worker.Malloc<int>(nBlocks * nodes.Length)
                use dMinDists = m.Worker.Malloc<float>(nBlocks * nodes.Length)
                use dNodes = m.Worker.Malloc(nodes.SelectMany(fun n -> n :> float seq).ToArray())
                
                let lp = LaunchParam(nBlocks, nt) //|> Engine.setDiagnoser diagnose
                nodes |> List.iteri (fun i node ->
                    kernel.Launch lp nodeLen chunk (dNodes.Ptr + i * nodeLen) dMap.Ptr dDist.Ptr (dMinDists.Ptr + i * nBlocks) (dMinIndices.Ptr + i * nBlocks))
              
                let minDists = dMinDists.ToHost()                                        
                let indices = dMinIndices.ToHost()
                let mins = (Array.zeroCreate nodes.Length)

                for i = 0 to nodes.Length - 1 do
                    let baseI = i * nBlocks
                    let mutable min = minDists.[baseI]
                    mins.[i] <- indices.[baseI]
                    for j = 1 to nBlocks - 1 do
                        if minDists.[baseI + j] < min then
                            min <-minDists.[baseI + j]
                            mins.[i] <- indices.[baseI + j]
                mins
                )
        }

To get as much parallelism as possible, I start with 256 threads per block (maximum on Kepler is 1024, but 256 gives the best performance). Since each block of threads is going to compute as many distances as possible, I try to allocate the maximum number of threads divisible by node dimensionality. In my case: 256 / 12 * 256 = 252. Pretty good, we get almost an optimal number of threads per block.

The number of blocks are computed from the length of the map. I want to split this in an optimal way since each block is scheduled on one SP, and I have 192 of those I don’t want them to idle. The algorithm leans towards parallelizing calculations relative to the map (see picture above, I want all those “arrows” to be executed in parallel), so the number of blocks will be (somArray.length + nt – 1) / nt (nt is the number of threads – 252, somArray.length is 200 * 200 * 12 = 480000, the formula above takes into account the fact that this number may not be a multiple of 252, in which case we will need one more incomplete block). My block size is 1905 – pretty good, CUDA devs recommend that to be at least twice the number of multiprocessors. It is necessary to hide latency, which is killer in this development paradigm (you need to rely on your PCI slot to transfer data).

One weird thing here is that I have to allocate a relatively large dDist arrray. This is where temporary values of (map(i) – node(j))**2 go. In reality (if I were writing in C++) I would not do that. I would just use the super-fast shared memory for this throw-away array. I could not get that to work with Alea, although the documentation says it is supported.

Another thing: the call to __syncthreads() is issued under a conditional statement. This, in general, is a horrible idea, because in the SIMT (single instruction multiple threads), threads march “in sync” so to speak, instruction by instruction. Thus, doing what I have done may lead to things hanging as some threads may take different branches and other threads will wait forever. Here, however, we are good, because the only way to go if the condition evaluates to false is to simply quit.

The kernel is called in a loop. One call for each node: 10000 passes (vs 10000 * 40000) in the sequential case. I also make sure that all of my memory is allocated once and everything I need is copied there. Not doing that leads to disastrous consequences, since all you would have in that case is pure latency.

The code is self-explanatory. Once everyone has computed their part of the distance, these parts are assembled by the 0-th thread of the block. That thread is responsible for calculating the final square of the distance, comparing the result of what we already have in the minDist array for this map node and storing that result.

Node-by-node optimized

And here lies the mistake that makes this algorithm lose out on performance: there is no need to delegate this work to the 0-th thread (looping over all 252 threads of the block). It is enough to delegate that to each “threadIdx.x % nodeLen = 0″‘s thread of the block, so now 21 threads do this work in parallel, each for only 12 threads. Algorithm re-written this way outperforms everything else I could come up with.

Here is the re-write of the kernel function:

            let! kernel =
                <@ fun nodeLen len
                    (node : DevicePtr<float>) 
                    (map :  DevicePtr<float>)
                    (distances : DevicePtr<float>)
                    (minDist : DevicePtr<float>)
                    (minIndex : DevicePtr<int>) 
                    ->

                    // index into the original map, assuming
                    // a node is a single entity
                    let mapI = blockIdx.x * blockDim.x

                    // actual position of the node component in the map
                    let i = mapI + threadIdx.x  

                    if i < len then                    
                        // index into the node
                        let j = threadIdx.x % nodeLen

                        distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j])
                        __syncthreads()
                        if threadIdx.x % nodeLen = 0 then
                            minIndex.[blockIdx.x] <- -1
                            
                            // find the minimum among threads
                            let k = mapI + threadIdx.x
                            let mutable sum = 0.
                            for j = k to k + nodeLen - 1 do
                                sum <- sum + distances.[j]
                            if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then
                                minDist.[blockIdx.x] <- sum
                                minIndex.[blockIdx.x] <- k / nodeLen
                                    
                    @> |> defineKernelFunc

This kernel function only stores “winning” (minimal) distances within each section of the map. Now they need to be reduced to one value per node. There are 1905 blocks and 10000 nodes. 10000 minimum values are computed in one pass over the 1905 * 10000-length array of accumulated minimal distances:

                let minDists = dMinDists.ToHost()                                        
                let indices = dMinIndices.ToHost()
                let mins = (Array.zeroCreate nodes.Length)

                for i = 0 to nodes.Length - 1 do
                    let baseI = i * nBlocks
                    let mutable min = minDists.[baseI]
                    mins.[i] <- indices.[baseI]
                    for j = 1 to nBlocks - 1 do
                        if minDists.[baseI + j] < min then
                            min <-minDists.[baseI + j]
                            mins.[i] <- indices.[baseI + j]
                mins

And we are done. The improved version beats all the rest of my algorithms. Since all of these performance improvements looked so “juicy”, I decided it was high time to abandon managed code and go back to the C++ world. Especially since writing C++ code is not going to be so different: no annoying loops to write, just express it linearly and reap the massively parallel goodness.

One thought on “Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.