This blog chronicles my experiences learning F# via implementing Push: a programming language designed for use in evolving populations of programs geared for solving symbolic regression problems (genetic programming).

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 2

August 22, 2014 Leave a comment

Compute the mathces

Here is a simple, purely brute-force algorithm for computing the join mentioned in Part 1.

Here is the entirely “CPU” implementation of the algorithm:

std::vector<bool> matches(size);
std::fill(matches.begin(), matches.end(), false);

for (int i = 0; i < size; i++)
    for (int j = 0; j < classes.size(); j++)
        bool match = classes[j] == classCode[i];
        if (match)
            matches[i] = match;

Loop over both datasets, compare them one-by-one, if there is a match – flag it. The only thing to note is that while it’s tempting to write the inner loop like this:

matches[i] = classes[j] == classCode[i];

it would be an obvious bug to do so.

This has a loathsome O(m*n) complexity (m, n – sizes of the datasets), and, in general this code is criminal.

In an attempt to decriminalize it, we delegate this work to the GPU as follows:

__global__ void joinValue(const int left, const int * const right, int rightSize, bool * const matches)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < rightSize)
        bool match = left == right[idx];
        if (match)
            matches[idx] = match;

__global__ void join(const int * const left, const int * const right, int leftSize, int rightSize, int childBlocks, int childThreadsPerBlock, bool * const matches)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < leftSize)
        joinValue<<<childBlocks, childThreadsPerBlock>>>(left[idx], right, rightSize, matches);


Pretty simple. The “join” kernel is used to “loop” over the filtering map one entry per thread. This calls a child kernel to “loop” over the “database table”. The output array of “matches” is originally instantiated to “false”. We are still O(m*n) in terms of work complexity, but step complexity is O(1) – just two steps are necessary.

Thrust is used to extract raw pointers to the data and pass them to the kernels. No need for CUDA boiler-plate to allocate, copy, and free the memory:

const int * const right = thrust::raw_pointer_cast(;
bool * const matchesRaw = thrust::raw_pointer_cast(;
const int * const left = thrust::raw_pointer_cast(;

//nBlocks: calculated so we have one thread per element of the "left" dataset
int nBlocks = (size + threadCount) / threadCount;
//the child kernel has nChildBlocks: one thread per element of the "right" dataset
int nChildBlocks = ((int)d_patentInfo.classCode.size() + threadCount) / threadCount;

// launch join kernel
int rightSize = (int)d_patentInfo.classCode.size();
join<<<nBlocks, threadCount>>>(left, right, size, rightSize, nChildBlocks, threadCount, matchesRaw);
//... and wait for it to finish

Filter the data

From this point on, everything is done with Thrust.

We count the number of matches and allocate a structure of arrays on the GPU to receive them. The constructor for the structure allocates every one of its arrays.

int nMatches = (int)thrust::count(matches.begin(), matches.end(), true);
PatentInfo relevantPatents(nMatches);

We copy all of the filtered data into the new location in one shot, using the Thrust zip iterators:

// output iterator. "matches" at the end has a larger size than nMatches.
PatentZipIterator relevantPatentsIter = 
                              relevantPatents.classCode.begin(), relevantPatents.appYear.begin(), 
                              relevantPatents.issueYear.begin(), relevantPatents.companyId.begin(), 
                              relevantPatents.isImportant.begin(), matches.begin()));

//input iterator
PatentZipIterator patentZipIter = 
    thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.begin(), d_patentInfo.classCode.begin(), 
                              d_patentInfo.appYear.begin(), d_patentInfo.issueYear.begin(), 
                              d_patentInfo.companyId.begin(), d_patentInfo.isImportant.begin(), matches.begin())); 

PatentZipIterator patentZipIterEnd = 
    thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.end(), d_patentInfo.classCode.end(), 
                              d_patentInfo.appYear.end(), d_patentInfo.issueYear.end(), 
                              d_patentInfo.companyId.end(),  d_patentInfo.isImportant.end(), matches.end()));

//copy to the output if we are matching
thrust::copy_if(patentZipIter, patentZipIterEnd, relevantPatentsIter, PatentInfo::is_matching());

Here PatentZipIterator is defined as follows:

typedef thrust::tuple<IntIterator, IntIterator, IntIterator, IntIterator, IntIterator, BoolIterator, BoolIterator> PatentIteratorTuple;
typedef thrust::zip_iterator<PatentIteratorTuple> PatentZipIterator;

A zip iterator in Thrust is similar to Zip() function in LINQ. In LINQ, Zip() is limited to two sequences, while in Thrust zip iterator can be used with as many sequences as the thrust::tuple type can handle. In large part, the purpose of this exercise was to take this iterator for a ride: an obvious optimization (discussed later) would not really require this complexity.

In this implementation, we zip up all the arrays in our PatentInfo structure together with the array of matches in a septuple. thrust::make_zip_iterator() takes a tuple of iterators and creates a single iterator that allows us to march thrust algorithms on seven arrays in lock-step. So, thrust::copy_if() – our gathering algorithm can examine one element at a time using PatentInfo::is_matching functor and see if this is a “matched” element. The predicate functor is implemented as follows:

typedef thrust::tuple<int, int, int, int, int, bool, bool> PatentTuple;

struct is_matching : public thrust::unary_function < PatentTuple, bool >
    __host__ __device__
        bool operator() (const PatentTuple& patentTuple)
        return thrust::get<6>(patentTuple);

Each element fed to the predicate is a septuple wrapped into the PatentTuple type, the last member of each septuple is an element of the matches[] array that determines the return value of the predicate.

After the gathering algorithm is done, we are ready to move the results to the host memory, which is done again painlessly by Thrust.




Performance is, unfortunately, ridiculously high. “Unfortunately” because bad performance is a great drive force of innovation and improvement. Here is the summary (in seconds, avg) for a dataset of 20,000,000 “patent ids” joined over 4000 classes.

7(actually for 5,000,000 over 3000 classes) 80 0.012

GPU performance does not include moving the data back into the host memory – this maybe takes a second. Deploying the original data to the GPU is done “off-line”, so is of little interest, but it also takes a second or so.


One optimization is so glaring, that I cannot pass it by. Leave the first step intact, but then, instead of moving data to/from the GPU, convert the array of “matches” to the array of indices matched and return it to the user. No need for zip iterators, no need to shovel lots of data to/from the GPU:

  1. Deploy the list of “keys” (original list of patentIds in this case) to the GPU
  2. Find matches as before
  3. Convert the matches array into a compacted array of indices
  4. return the array to the calling application

Then the calling application can scan the array of indices on the CPU in O(nMatches) time, and return the actual filtered dataset.
Besides the fact, that much less data needs to be moved between different spaces (managed, native, GPU), we also free the GPU implementation of any data complexity: no need to know what the table we are filtering actually looks like, as long as we are filtering on a single integer key!

Categories: CPP, CUDA

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

August 21, 2014 1 comment

This is a post in two parts:
Part 1 – The problem, solution setup, the algorithm.
Part 2 – (The juicy) Implementation details, discussion.

Suppose at the heart of the data layer of a web application there is a join like this:

select distinct p.PatentID, p.ClassId, p.CompanyId, p.IssueYear, p.AppYear, p.IsImportant
from @topClasses t
join Patents p on p.ClassId = t.ClassId

This join filters patents belonging to a set of classes from the Patents table.

Semantics are not that important, but the fact that a table of classes is on the order of a few thousands, while that of patents – a few million, and so is the resulting filtered dataset. Classes table variable contains unique entries for classes, and patents belonging to these classes are extracted. This join takes around 6 – 8 sec on SQL Server 2014 (timing SSMS, probably even less time in reality), not a whole lot of time, but a lifetime in some contexts. Such as a web application’s, for example.

Enter GTX Titan, fun begins.

Brute Force, Massively Parallel Solution

We would like to deploy an equivalent of the Patents table to the GPU and see if we can speed up the join. For this we need:

  1. Define the data structures and deploy the data
  2. Filter (“join”) the data based on our set of keys
  3. Gather the result of this join
  4. Move it out from the GPU back to main memory

Note: There is more, of course: somehow the data needs to persist on the GPU between web server calls, requests need to be synchronized, and data has to be marshalled back and forth between managed and native space. These are all pretty standard problems I am not addressing in this post.

Representing the data.

The data has been quite well normalized, so we are dealing with a bunch of 4-byte integers (and a boolean).  This makes – 21 bytes per record, assuming booleans are stored as bytes. Since our table size is in the millions of records, we have in the order of tens of megabytes of data we would like to store on the GPU. With GTX Titan and its 6GB of memory, we are good!

The most  intuitive way to represent this data on the device would be an array of structures:

struct PatentInfo
    int patentId;
    int classCode;
    int appYear;
    int issueYear;
    int companyId;
    bool isImportant;


Defining an optimized structure with Thrust

However, one of the basics of GPU optimization is coalesced global memory access. We are joining on just one field:  classCode, and we want coalesced access to that field in particular.

So, we define our structure as a structure of arrays:

#include <thrust/device_vector.h>
#include <thrust/remove.h>
#include <thrust/transform_scan.h>
#include <thrust/copy.h>
#include <thrust/count.h>

typedef thrust::device_vector<int> IntVector;
typedef thrust::device_vector<bool> BoolVector;
typedef thrust::device_vector<int>::iterator IntIterator; 
typedef thrust::device_vector<bool>::iterator BoolIterator;

struct PatentInfo
    IntVector patentId;
    IntVector classCode;
    IntVector appYear;
    IntVector issueYear;
    IntVector companyId;
    BoolVector isImportant;

Here NVIDIA Thrust library is used to wrap device containers. I found this library to be quite helpful. Essential, really. In particular it’s extremely easy to deploy data to the GPU:

IntVector d_vector = h_vector;

The above copies the thrust::host_vector<int> to the device. Or, if the data came from the managed space as an array:

IntVector d_vector(size);
thrust::copy(h_data, h_data + size, d_vector.begin());

will copy the unmanaged host data stored in h_data to the device.
NOTE: It is important to keep in mind that all of the Thrust code should be placed in .cu files (or files included in .cu files), so NVCC compiler can get to it.

Now that the data is safely on the device…

The Algorithm

This is a simple, brute force algorithm. I am not a believer in brute force, so I would prefer to optimize what I currently have, unfortunately, it performs incredibly well and thus earns its right to exist.

Here it is in a nutshell (with the help of .NET LINQ notation in C#):
Given the PatentInfo patentInfo structure (where each array is of the size num_patents), and int const * const classes array of size num_classes , of distinct (preferably, but not necessarily) classes:

  1. (In a massively parallel fashion), build an array bool matches[num_of_patents] where matches[i] = true iff there exists j < num_classes: classes[j] == patentInfo[i]
  2. Allocate a structure (on the device) PatentInfo relevantPatents of the size == matches.Count(m => m) – “true” matches.
  3. Copy all the arrays from the original structure into the new one, such that: array.Where((elem, i) => matches[i]) – index of the element corresponds to a “true” match.
  4. Copy the structure above to the host memory and return it


Categories: CPP, CUDA

Compiling CUDA Projects with Dynamic Parallelism (VS 2012/13)

August 17, 2014 Leave a comment

Just a quick note.
If you are starting from a template C++ CUDA project in VS 2012/2013, calling a kernel from a kernel (dynamic parallelism) would not compile:

error : kernel launch from __device__ or __global__ functions requires separate compilation mode

To fix this, first make sure your hardware supports it (cc 3.5 or higher) and compute capability is set correctly in the compiler options (“compute_35,sm_35″)


Then set -rdc=true in CUDA C++ options:


Then add cudadevrt.lib to the libraries that are statically linked in the Linker options:


Categories: CUDA

D3 Fisheye Distortion for Bar Charts

February 25, 2014 Leave a comment


Focus + context visualizations are quite useful when we want to zoom into some part of a visualization, but are unwilling to give up the “bird’s eye” view of the entire picture. In this case, we distort the part of the visual on which we want to focus, while preserving the entire view.

In D3, this task is accomplished through “fisheye distortion”. Mike Bostock has examples of usage of his fisheye plugin on his site.

The problem is this plugin does not support bar charts, where the application could also be quite useful. I believe lack of support is explained by the fact that the ordinal scale rangeBand() function does not take any parameters, which is fine for any bar chart: all you need is to split your range into equal regions. When applying a distortion, however, the chunk of the range devoted to a particular part of the input depends on the position of this particular chunk. So extending the plugin is really trivial, just need to provide a new signature for the rangeBand() function.

Here is what a finished example looks like:


The complete code for it, including the plugin can be found on JSFiddle.

In order to create it, I have modified an existing example by a StackOverflow user (thank you very much!).


The actual fisheye plugin spans lines 1 – 144 in the jsfiddle cited above. Cut it, save to a file.

  1. Create your ordinal scale for the bar chart using the plugin:
    var x = d3.fisheye.ordinal().rangeRoundBands([0, w - p[1] - p[3]])
  2. Replace all calls to rangeBand() with calls to rangeBand(d.x):

    // Add a rect for each date.
    var rect = cause.selectAll("rect")
    .attr("x", function(d) { return x(d.x); })
    .attr("y", function(d) { return -y(d.y0) - y(d.y); })
    .attr("height", function(d) { return y(d.y); })
    .attr("width", function(d) {return x.rangeBand(d.x);});
    // Add a label per date.
    var label = svg.selectAll("text")
    .attr("x", function(d) { return x(d) + x.rangeBand(d.x) / 2; })
    .attr("y", 6)
    .attr("text-anchor", "middle")
    .attr("dy", ".71em")
  3. Add mouse interaction to your main container, to update the focus of the distortion, and redraw the affected elements:

    //respond to the mouse and distort where necessary
    svg.on("mousemove", function() {
        var mouse = d3.mouse(this);
        //refocus the distortion
        //redraw the bars
        .attr("x", function(d) { return x(d.x); })
        .attr("width", function(d) {return x.rangeBand(d.x);});
        //redraw the text
        label.attr("x", function(d) { return x(d) + x.rangeBand(d.x) / 2; });

Et voilà!

Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

September 25, 2013 1 comment

In the previous post I spoke briefly about motivations for implementing self-organizing maps in F# using GPU with CUDA. I have finally been able to outperform a single threaded C++ implementation by a factor of about 1.5. This is quite modest, but on the other hand rather impressive since we started out by being 60 times slower. At this point I am bidding farewell to F# and switching to C++. It will be interesting to see how this works out, but here are my initial algorithms.

So, we are parallelizing the following:

    member this.GetBMU (node : Node) =
        let min = ref Double.MaxValue
        let minI = ref -1
        let minJ = ref -1
        this.somMap |> Array2D.iteri (fun i j e -> 
            let dist = getDistance e node this.Metric
            if dist < !min then min := dist; minI := i; minJ := j)
        !minI, !minJ

Here somMap is just a two-dimensional array of Node-s. And a Node is simply an array of float’s (double’s in C#), of the size equal to dimensionality of our space.

The code for getDistance is also simple:

    let getDistanceEuclidian (x : Node) (y : Node) =
        Math.Sqrt([0..x.Dimension - 1] |> Seq.fold(fun sq i -> sq + (x.[i] - y.[i]) ** 2.) 0.)

    let getDistance (x : Node) (y : Node) metric =
        if x.Dimension <> y.Dimension then failwith "Dimensions must match"
            match metric with
            | Euclidian ->
                getDistanceEuclidian x y
            | Taxicab -> 
                getDistanceTaxicab x y

In my parallel version I am only using Euclidian distance for simplicity.

Parallel Algorithms

As I have mentioned above, Alea.cuBase (my F#-to-CUDA framework) does not support multidimensional arrays (as well as shared memory or calling a kernel from a kernel), so this is the price I had to pay for sticking to my favorite language. Given these constraints, here is what I came up with:


The very first idea (that proved also the very best) is quite intuitive. To find BMUs for the entire set of nodes, just take each individual node, find its BMU in parallel, repeat.

Node-by-node algorithm

Node-by-node algorithm

The map is first flattened into a single dimension, where if the cell coordinates were (i, j, k), they are mapped to (i * j * k + j * k + k). All distance sub-calculations can be performed in one shot, and then one thread finishes them up. By “sub-calculation” I mean computing (node(i) – map(j)) ** 2. These are then added up to the distance squared (I don’t calculate sqrt, since I don’t need it to find the minimum).

So, here is the implementation:

    let pDistances = 
        cuda {
            let! kernel =
                <@ fun nodeLen len
                    (node : DevicePtr<float>) 
                    (map :  DevicePtr<float>)
                    (distances : DevicePtr<float>)
                    (minDist : DevicePtr<float>)
                    (minIndex : DevicePtr<int>) 

                    // index into the original map, assuming
                    // a node is a single entity
                    let mapI = blockIdx.x * blockDim.x

                    // actual position of the node component in the map
                    let i = mapI + threadIdx.x  

                    if i < len then                    
                        // index into the node
                        let j = threadIdx.x % nodeLen

                        distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j])
                        if threadIdx.x = 0 then
                            let mutable thread = 0
                            minIndex.[blockIdx.x] <- -1
                            // find the minimum among threads
                            while mapI + thread < len && thread < blockDim.x do
                                let k = mapI + thread
                                let mutable sum = 0.
                                for j = 0 to nodeLen - 1 do
                                    sum <- sum + distances.[k + j]
                                if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then
                                    minDist.[blockIdx.x] <- sum
                                    minIndex.[blockIdx.x] <- k / nodeLen
                                thread <- thread + nodeLen
                    @> |> defineKernelFunc

            let diagnose (stats:KernelExecutionStats) =
               printfn "gpu timing: %10.3f ms %6.2f%% threads(%d) reg(%d) smem(%d)"
                   (stats.Occupancy * 100.0)

            return PFunc(fun (m:Module) (nodes : float [] list) (map : float []) ->
                let kernel = kernel.Apply m
                let nodeLen = nodes.[0].Length
                let chunk = map.Length
                let nt = (256 / nodeLen) * nodeLen // number of threads divisible by nodeLen
                let nBlocks = (chunk + nt - 1)/ nt //map.Length is a multiple of nodeLen by construction
                use dMap = m.Worker.Malloc(map)
                use dDist = m.Worker.Malloc<float>(map.Length)
                use dMinIndices = m.Worker.Malloc<int>(nBlocks * nodes.Length)
                use dMinDists = m.Worker.Malloc<float>(nBlocks * nodes.Length)
                use dNodes = m.Worker.Malloc(nodes.SelectMany(fun n -> n :> float seq).ToArray())
                let lp = LaunchParam(nBlocks, nt) //|> Engine.setDiagnoser diagnose
                nodes |> List.iteri (fun i node ->
                    kernel.Launch lp nodeLen chunk (dNodes.Ptr + i * nodeLen) dMap.Ptr dDist.Ptr (dMinDists.Ptr + i * nBlocks) (dMinIndices.Ptr + i * nBlocks))
                let minDists = dMinDists.ToHost()                                        
                let indices = dMinIndices.ToHost()
                let mins = (Array.zeroCreate nodes.Length)

                for i = 0 to nodes.Length - 1 do
                    let baseI = i * nBlocks
                    let mutable min = minDists.[baseI]
                    mins.[i] <- indices.[baseI]
                    for j = 1 to nBlocks - 1 do
                        if minDists.[baseI + j] < min then
                            min <-minDists.[baseI + j]
                            mins.[i] <- indices.[baseI + j]

To get as much parallelism as possible, I start with 256 threads per block (maximum on Kepler is 1024, but 256 gives the best performance). Since each block of threads is going to compute as many distances as possible, I try to allocate the maximum number of threads divisible by node dimensionality. In my case: 256 / 12 * 256 = 252. Pretty good, we get almost an optimal number of threads per block.

The number of blocks are computed from the length of the map. I want to split this in an optimal way since each block is scheduled on one SP, and I have 192 of those I don’t want them to idle. The algorithm leans towards parallelizing calculations relative to the map (see picture above, I want all those “arrows” to be executed in parallel), so the number of blocks will be (somArray.length + nt – 1) / nt (nt is the number of threads – 252, somArray.length is 200 * 200 * 12 = 480000, the formula above takes into account the fact that this number may not be a multiple of 252, in which case we will need one more incomplete block). My block size is 1905 – pretty good, CUDA devs recommend that to be at least twice the number of multiprocessors. It is necessary to hide latency, which is killer in this development paradigm (you need to rely on your PCI slot to transfer data).

One weird thing here is that I have to allocate a relatively large dDist arrray. This is where temporary values of (map(i) – node(j))**2 go. In reality (if I were writing in C++) I would not do that. I would just use the super-fast shared memory for this throw-away array. I could not get that to work with Alea, although the documentation says it is supported.

Another thing: the call to __syncthreads() is issued under a conditional statement. This, in general, is a horrible idea, because in the SIMT (single instruction multiple threads), threads march “in sync” so to speak, instruction by instruction. Thus, doing what I have done may lead to things hanging as some threads may take different branches and other threads will wait forever. Here, however, we are good, because the only way to go if the condition evaluates to false is to simply quit.

The kernel is called in a loop. One call for each node: 10000 passes (vs 10000 * 40000) in the sequential case. I also make sure that all of my memory is allocated once and everything I need is copied there. Not doing that leads to disastrous consequences, since all you would have in that case is pure latency.

The code is self-explanatory. Once everyone has computed their part of the distance, these parts are assembled by the 0-th thread of the block. That thread is responsible for calculating the final square of the distance, comparing the result of what we already have in the minDist array for this map node and storing that result.

Node-by-node optimized

And here lies the mistake that makes this algorithm lose out on performance: there is no need to delegate this work to the 0-th thread (looping over all 252 threads of the block). It is enough to delegate that to each “threadIdx.x % nodeLen = 0″‘s thread of the block, so now 21 threads do this work in parallel, each for only 12 threads. Algorithm re-written this way outperforms everything else I could come up with.

Here is the re-write of the kernel function:

            let! kernel =
                <@ fun nodeLen len
                    (node : DevicePtr<float>) 
                    (map :  DevicePtr<float>)
                    (distances : DevicePtr<float>)
                    (minDist : DevicePtr<float>)
                    (minIndex : DevicePtr<int>) 

                    // index into the original map, assuming
                    // a node is a single entity
                    let mapI = blockIdx.x * blockDim.x

                    // actual position of the node component in the map
                    let i = mapI + threadIdx.x  

                    if i < len then                    
                        // index into the node
                        let j = threadIdx.x % nodeLen

                        distances.[i] <- (map.[i] - node.[j]) * (map.[i] - node.[j])
                        if threadIdx.x % nodeLen = 0 then
                            minIndex.[blockIdx.x] <- -1
                            // find the minimum among threads
                            let k = mapI + threadIdx.x
                            let mutable sum = 0.
                            for j = k to k + nodeLen - 1 do
                                sum <- sum + distances.[j]
                            if minDist.[blockIdx.x] > sum || minIndex.[blockIdx.x] < 0 then
                                minDist.[blockIdx.x] <- sum
                                minIndex.[blockIdx.x] <- k / nodeLen
                    @> |> defineKernelFunc

This kernel function only stores “winning” (minimal) distances within each section of the map. Now they need to be reduced to one value per node. There are 1905 blocks and 10000 nodes. 10000 minimum values are computed in one pass over the 1905 * 10000-length array of accumulated minimal distances:

                let minDists = dMinDists.ToHost()                                        
                let indices = dMinIndices.ToHost()
                let mins = (Array.zeroCreate nodes.Length)

                for i = 0 to nodes.Length - 1 do
                    let baseI = i * nBlocks
                    let mutable min = minDists.[baseI]
                    mins.[i] <- indices.[baseI]
                    for j = 1 to nBlocks - 1 do
                        if minDists.[baseI + j] < min then
                            min <-minDists.[baseI + j]
                            mins.[i] <- indices.[baseI + j]

And we are done. The improved version beats all the rest of my algorithms. Since all of these performance improvements looked so “juicy”, I decided it was high time to abandon managed code and go back to the C++ world. Especially since writing C++ code is not going to be so different: no annoying loops to write, just express it linearly and reap the massively parallel goodness.

Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 1: F#

September 24, 2013 1 comment

By 2017, it is expected that GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architectures’ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth limitations from an interface such as PCI Express for transfers between CPUs and GPUs, and the system-level energy overheads for both chip crossings and replicated chip infrastructure.

Alan Tatourian.

This is going to be all about parallelism, CUDA, performance, and new direction in software development. For me personally anyway. All the code mentioned below is here.

So, here we've got it. The need for SOMs. SOMs are wonderful for clustering and visualizing high dimensional data. Naturally, using the sacred principle of software development (best software engineers steal) I looked around for some existing code. Found this and this very fast. The first of these shows you very quickly what SOMs are and how to do build them step-by-step, while the second already has the C++ code that can just be taken as is and used for computing SOMs. Which was what I did.

In my experiments, I used around 12,000 12-dimensional nodes with a map of 200x200. On my puny i5 650 (3.2 Ghz), 8 Gb RAM, generating a SOM with these parameters takes around 4 hrs, maybe less. One "epoch" takes around 40 sec and I run 500 epochs, however, since the neighborhood of code vectors that gets trained in one epoch gradually diminishes, it is not a straight multiplication of 500 * 40.

These experiments have actually not yielded the results I was hoping for, perhaps because the training set is not large enough for the data I am trying to cluster. Be it as it may, more experiments are needed with a larger dataset, and I am already at the brink of feasibility as far as performance. It does increase linearly with the number of nodes. The C++ code that I have (stolen) is actually pretty good, but it is single-threaded, so doing things in parallel seems to be a logical next step.

Step 0. CPU F#

At this point, I was actually interested more in performance improvements than in re-implementing SOM in F#. I did re-implement it in F#, but for my performance bench-marking I did not use the whole algorithm, just the first part where BMUs are calculated. Since BMU is a map vector that is closest to the given node, in order to compute one BMU it is necessary to iterate over the entire map. So computing BMUs for the entire set of nodes gets us the cost of O(n * d * m1 * m2) (m1, m2 are map dimensions, n is the length of the nodes array, d is dimensionality of each node vector). That's for one epoch only. And there are 500 of those. It adds up.

My F# implementation computed the BMU for one 12 dimensional node on a 200x200 SOM in a whopping 140 ms. Vs just 4ms for C++. I did expect a perfromance drop from C++, I just did not expect it to be that drastic.

    member this.GetBMU (node : Node) =
        let min = ref Double.MaxValue
        let minI = ref -1
        let minJ = ref -1
        this.somMap |> Array2D.iteri (fun i j e -> 
            let dist = getDistance e node this.Metric
            if dist < !min then min := dist; minI := i; minJ := j)
        !minI, !minJ

Then I added parallelism. And Parallel.ForEach worked slightly better than Parallel.For.

    member this.GetBMUParallel (node : Node) =
        let monitor = new obj()
        let minList = ref []

            Partitioner.Create(0, fst dims), 
            (fun () -> ref (Double.MaxValue, -1, -1)), 
            (fun range state local -> 
                let mutable(min, minI, minJ) = 
                    match !local with
                    | m, i, j -> m, i, j
                for i = fst range to snd range - 1 do
                    for j = 0 to snd this.Dimensions - 1 do
                        let dist = getDistance this.somMap.[i, j] node this.Metric
                        if dist < min then 
                            min <- dist; minI <- i; minJ <- j
                local := (min, minI, minJ)
            (fun local -> lock monitor (fun () ->
                match !local with
                | m, i, j when i > 0 -> 
                    minList := (m, i, j) :: !minList
                |_ -> ()
                ))) |> ignore

        let minTuple = !minList |> List.minBy (fun (x, i, j) -> x)
        match minTuple with
        | x, i, j -> i, j

Nothing fancy here. Split the first dimension of the map into chunks and try processing them as much as possible in parallel, by utilizing all the 4 logical cores. The inner loop could also be re-written the same way (to use Parallel.For or Parallel.ForEach), but it would probably not do much good since we are already as parallel as we can be. (And in reality it did not. Do any good, that is). While I expected an at least 4-fold performance increase, I did not get. I did get a 2 times increase. Now it only took 70 ms for one node.

Going massively parallel

At this point, things are really intuitive. If it were up to me, I'd do every calculation there is in parallel and then reduce them once they are done. If it takes, I dunno, say 0.001 mks to multiply 2 numbers in one processor thread, how long does it take to multiply 12000 * 12 * 2 numbers on 144000 processors? Obviously the same 0.001 ms. So the problem becomes almost constant in the number of nodes if we only could always have as many processors as the number of nodes * their dimensions.

Reality is of course vastly different but it does not have to be measured by the number 4 (of CPU cores). Thus, CUDA or OpenCL. I invested $166 in a Quadro K600 Graphics card, which has 192 cores and 1 Gb of on-board RAM. I still wanted to remain faithful to F#, so I looked for a .NET/F# CUDA framework. After briefly evaluating several such frameworks (and they are all in different stages of nascent at this point), I picked Alea.cuBase from QuantAlea.

The Framework

Alea.cuBase is pretty neat. I like the paradigm - using quotations to write CUDA code. The documentation is pretty good, gets you up and running very quickly, once the basic concepts of CUDA have been grasped. There are problems, though.

  • I could not get shared memory to work despite claims that it is supported. Things just crashed.
  • No support yet for multi-dimensional arrays. This was kind of a bummer, because it meant some preprocessing on my part to get things going on the GPU. Oh well. Whatever doesn't kill you...

So how did I do having re-written things to run massively in parallel (with one hand cuffed to the radiator, since I could not use multidimensional arrays)? Well, I did several implementations, I will describe them next time, but here are the charts.


Iteration 1.

Iteration 1.

The results are averaged over 3 repetitions. Y-axis values are in ms, both axes are logarithmic. Experiments, using CPU are in dashed lines, they start fading at the point where I simply estimated the run-time based on previous performance (did not want to wait 168 sec). On the x-axis are the number of nodes (200x200 SOM, 12 dimensions). I finally did outperform a single-threaded C++ implementation with the "GPU-iterations" algorithm. (Why is its performance so staggeringly awful on small sizes of the dataset? I will talk about it in my next post). Although the gains are not that impressive. At the end I was able to shave about 13-16 seconds off of the "real" 12 000 strong dataset. Which, I guess, is not bad, although not quite there yet... Why? As it turns out, the parallel part of all of this has never exceeded 50% of the total run-time. Which means, the algorithms work in the "hurry-up-and-wait" mode. While parallel calculations do their part pretty fast, negotiating things back and forth with .NET kills the performance.

Still, I made some optimizations and here is what I got:

Iteration 2

Iteration 2

Notice, how the "GPU-node-by-node" algorithm from being a loser actually advanced to the first place. This was due to a very small change.

At the end, I absolutely fell in love with GPUs and CUDA. It really demystifies parallel programming. All of the behaviors I encountered while experimenting with different implementations were immediately obvious and predictable. I also changed quite a few of my previous convictions about software development. I will talk about it next time (with code in hand).

Categories: CUDA, F#, Parallel Tags: , , ,

Visualizing Crime with d3: Hooking up Data and Colors, Part 2

June 17, 2013 1 comment

In the previous post, we derived a class from BubbleChart and this got us started on actually visualizing some meaningful data using bubbles.

There are a couple of things to iron out before a visual can appear.

Color Schemes

I am using Cynthia Brewer color schemes, available for download in colorbrewer.css. This file is available on my GitHub as well.

It consists of entries like:

.Spectral .q0-3{fill:rgb(252,141,89)}
.Spectral .q1-3{fill:rgb(255,255,191)}
.Spectral .q2-3{fill:rgb(153,213,148)}
.Spectral .q0-4{fill:rgb(215,25,28)}
.Spectral .q1-4{fill:rgb(253,174,97)}
.Spectral .q2-4{fill:rgb(171,221,164)}
.Spectral .q3-4{fill:rgb(43,131,186)}
.Spectral .q0-5{fill:rgb(215,25,28)}
.Spectral .q1-5{fill:rgb(253,174,97)}

Usage is simple: you pick a color scheme and add it to the class of your parent element that will contain the actual SVG elements displayed, e.g.: Spectral. Then, one of the “qi-n classes are assigned to these child elements to get the actual color.

So for instance:

The main SVG element on the Crime Explorer visual looks like this:

<svg class="Spectral" id="svg_vis">...</svg>

Then, each of the “circle” elements inside this SVG container will have one of the qi-9 (I am using 9 total colors to display this visualization so i ranges from 0..8).

<circle r="9.664713682964603" class="q2-9" stroke-width="2" stroke="#b17943" id="city_0" cx="462.4456905180483" cy="574.327856528298"></circle>

(Note the class=”q2-9″ attribute above).

All of this is supported by the BubbleChart class with some prodding.
You need to:

  1. Pass the color scheme to the constructor of the class derived from BubbleChart upon instantiation:
    allStates = new AllStates('vis', crime_data, 'Spectral')
  2. Implement a function called color_class in the derived class, that will produce a string of type “qi-n”, given an i. The default function supplied with the base class always returns “q1-6″.
    @color_class =
          d3.scale.threshold().domain(@domain).range(("q#{i}-9" for i in [8..0]))

    In my implementation, I am using the d3 threshold scale to map a domain of values to the colors I need based on certain thresholds. The range is reversed only because I want “blue” colors to come out on lower threshold values, and “red” – on higher ones (less crime is “better”, so I use red for higher values). See AllStates.coffe for a full listing.How this is hooked up tho the actual data is discussed in the next section.

Data Protocol

This is key: data you pass to the BubbleChart class must comply with the following requirements:

  1. It must be an array (not an associative array, a regular array). Each element of this array will be displayed as a circle (“bubble”) on the screen.
  2. Each element must contain the following fields:
    • id – this is a UNIQUE id of the element. It is used by BubbleChart to do joins (see d3 documentation for what these are)
    • value – this is what the “value” of each data element is, and it is used to compute the radius of each bubble
    • group – indicates the “color group” to which the bubble belongs. This is what is fed to the color_class function to determine the color of each individual bubble

With all these conditions satisfied, the array of data is now ready to be displayed.

Displaying It

Now that it is all done, showing the visual is simple:

allStates = new AllStates('vis', crime_data, 'Spectral')

Next time: displaying the auxiliary elements: color and size legends, the search box.


Get every new post delivered to your Inbox.