Archive for the ‘CPP’ Category

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 2

August 22, 2014 1 comment


All this code is now on GitHub.

Compute the mathces

Here is a simple, purely brute-force algorithm for computing the join mentioned in Part 1.

Here is the entirely “CPU” implementation of the algorithm:

std::vector<bool> matches(size);
std::fill(matches.begin(), matches.end(), false);

for (int i = 0; i < size; i++)
    for (int j = 0; j < classes.size(); j++)
        bool match = classes[j] == classCode[i];
        if (match)
            matches[i] = match;

Loop over both datasets, compare them one-by-one, if there is a match – flag it. The only thing to note is that while it’s tempting to write the inner loop like this:

matches[i] = classes[j] == classCode[i];

it would be an obvious bug to do so.

This has a loathsome O(m*n) complexity (m, n – sizes of the datasets), and, in general this code is criminal.

In an attempt to decriminalize it, we delegate this work to the GPU as follows:

__global__ void joinValue(const int left, const int * const right, int rightSize, bool * const matches)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx &lt; rightSize)
        bool match = left == right[idx];
        if (match)
            matches[idx] = match;

__global__ void join(const int * const left, const int * const right, int leftSize, int rightSize, int childBlocks, int childThreadsPerBlock, bool * const matches)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx &lt; leftSize)
        joinValue&lt;&lt;&lt;childBlocks, childThreadsPerBlock&gt;&gt;&gt;(left[idx], right, rightSize, matches);


Pretty simple. The “join” kernel is used to “loop” over the filtering map one entry per thread. This calls a child kernel to “loop” over the “database table”. The output array of “matches” is originally instantiated to “false”. We are still O(m*n) in terms of work complexity, but step complexity is O(1) – just two steps are necessary.

Thrust is used to extract raw pointers to the data and pass them to the kernels. No need for CUDA boiler-plate to allocate, copy, and free the memory:

const int * const right = thrust::raw_pointer_cast(;
bool * const matchesRaw = thrust::raw_pointer_cast(;
const int * const left = thrust::raw_pointer_cast(;

//nBlocks: calculated so we have one thread per element of the &quot;left&quot; dataset
int nBlocks = (size + threadCount) / threadCount;
//the child kernel has nChildBlocks: one thread per element of the &quot;right&quot; dataset
int nChildBlocks = ((int)d_patentInfo.classCode.size() + threadCount) / threadCount;

// launch join kernel
int rightSize = (int)d_patentInfo.classCode.size();
join&lt;&lt;&lt;nBlocks, threadCount&gt;&gt;&gt;(left, right, size, rightSize, nChildBlocks, threadCount, matchesRaw);
//... and wait for it to finish

Filter the data

From this point on, everything is done with Thrust.

We count the number of matches and allocate a structure of arrays on the GPU to receive them. The constructor for the structure allocates every one of its arrays.

int nMatches = (int)thrust::count(matches.begin(), matches.end(), true);
PatentInfo relevantPatents(nMatches);

We copy all of the filtered data into the new location in one shot, using the Thrust zip iterators:

// output iterator. &quot;matches&quot; at the end has a larger size than nMatches.
PatentZipIterator relevantPatentsIter = 
                              relevantPatents.classCode.begin(), relevantPatents.appYear.begin(), 
                              relevantPatents.issueYear.begin(), relevantPatents.companyId.begin(), 
                              relevantPatents.isImportant.begin(), matches.begin()));

//input iterator
PatentZipIterator patentZipIter = 
    thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.begin(), d_patentInfo.classCode.begin(), 
                              d_patentInfo.appYear.begin(), d_patentInfo.issueYear.begin(), 
                              d_patentInfo.companyId.begin(), d_patentInfo.isImportant.begin(), matches.begin())); 

PatentZipIterator patentZipIterEnd = 
    thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.end(), d_patentInfo.classCode.end(), 
                              d_patentInfo.appYear.end(), d_patentInfo.issueYear.end(), 
                              d_patentInfo.companyId.end(),  d_patentInfo.isImportant.end(), matches.end()));

//copy to the output if we are matching
thrust::copy_if(patentZipIter, patentZipIterEnd, relevantPatentsIter, PatentInfo::is_matching());

Here PatentZipIterator is defined as follows:

typedef thrust::tuple&lt;IntIterator, IntIterator, IntIterator, IntIterator, IntIterator, BoolIterator, BoolIterator&gt; PatentIteratorTuple;
typedef thrust::zip_iterator&lt;PatentIteratorTuple&gt; PatentZipIterator;

A zip iterator in Thrust is similar to Zip() function in LINQ. In LINQ, Zip() is limited to two sequences, while in Thrust zip iterator can be used with as many sequences as the thrust::tuple type can handle. In large part, the purpose of this exercise was to take this iterator for a ride: an obvious optimization (discussed later) would not really require this complexity.

In this implementation, we zip up all the arrays in our PatentInfo structure together with the array of matches in a septuple. thrust::make_zip_iterator() takes a tuple of iterators and creates a single iterator that allows us to march thrust algorithms on seven arrays in lock-step. So, thrust::copy_if() – our gathering algorithm can examine one element at a time using PatentInfo::is_matching functor and see if this is a “matched” element. The predicate functor is implemented as follows:

typedef thrust::tuple&lt;int, int, int, int, int, bool, bool&gt; PatentTuple;

struct is_matching : public thrust::unary_function &lt; PatentTuple, bool &gt;
    __host__ __device__
        bool operator() (const PatentTuple&amp; patentTuple)
        return thrust::get&lt;6&gt;(patentTuple);

Each element fed to the predicate is a septuple wrapped into the PatentTuple type, the last member of each septuple is an element of the matches[] array that determines the return value of the predicate.

After the gathering algorithm is done, we are ready to move the results to the host memory, which is done again painlessly by Thrust.




Performance is, unfortunately, ridiculously high. “Unfortunately” because bad performance is a great drive force of innovation and improvement. Here is the summary (in seconds, avg) for a dataset of 20,000,000 “patent ids” joined over 4000 classes.

7(actually for 5,000,000 over 3000 classes) 80 0.012

GPU performance does not include moving the data back into the host memory – this maybe takes a second. Deploying the original data to the GPU is done “off-line”, so is of little interest, but it also takes a second or so.


One optimization is so glaring, that I cannot pass it by. Leave the first step intact, but then, instead of moving data to/from the GPU, convert the array of “matches” to the array of indices matched and return it to the user. No need for zip iterators, no need to shovel lots of data to/from the GPU:

  1. Deploy the list of “keys” (original list of patentIds in this case) to the GPU
  2. Find matches as before
  3. Convert the matches array into a compacted array of indices
  4. return the array to the calling application

Then the calling application can scan the array of indices on the CPU in O(nMatches) time, and return the actual filtered dataset.
Besides the fact, that much less data needs to be moved between different spaces (managed, native, GPU), we also free the GPU implementation of any data complexity: no need to know what the table we are filtering actually looks like, as long as we are filtering on a single integer key!

Categories: CPP, CUDA

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

August 21, 2014 1 comment

This is a post in two parts:
Part 1 – The problem, solution setup, the algorithm.
Part 2 – (The juicy) Implementation details, discussion.

Suppose at the heart of the data layer of a web application there is a join like this:

select distinct p.PatentID, p.ClassId, p.CompanyId, p.IssueYear, p.AppYear, p.IsImportant
from @topClasses t
join Patents p on p.ClassId = t.ClassId

This join filters patents belonging to a set of classes from the Patents table.

Semantics are not that important, but the fact that a table of classes is on the order of a few thousands, while that of patents – a few million, and so is the resulting filtered dataset. Classes table variable contains unique entries for classes, and patents belonging to these classes are extracted. This join takes around 6 – 8 sec on SQL Server 2014 (timing SSMS, probably even less time in reality), not a whole lot of time, but a lifetime in some contexts. Such as a web application’s, for example.

Enter GTX Titan, fun begins.

Brute Force, Massively Parallel Solution

We would like to deploy an equivalent of the Patents table to the GPU and see if we can speed up the join. For this we need:

  1. Define the data structures and deploy the data
  2. Filter (“join”) the data based on our set of keys
  3. Gather the result of this join
  4. Move it out from the GPU back to main memory

Note: There is more, of course: somehow the data needs to persist on the GPU between web server calls, requests need to be synchronized, and data has to be marshalled back and forth between managed and native space. These are all pretty standard problems I am not addressing in this post.

Representing the data.

The data has been quite well normalized, so we are dealing with a bunch of 4-byte integers (and a boolean).  This makes – 21 bytes per record, assuming booleans are stored as bytes. Since our table size is in the millions of records, we have in the order of tens of megabytes of data we would like to store on the GPU. With GTX Titan and its 6GB of memory, we are good!

The most  intuitive way to represent this data on the device would be an array of structures:

struct PatentInfo
    int patentId;
    int classCode;
    int appYear;
    int issueYear;
    int companyId;
    bool isImportant;


Defining an optimized structure with Thrust

However, one of the basics of GPU optimization is coalesced global memory access. We are joining on just one field:  classCode, and we want coalesced access to that field in particular.

So, we define our structure as a structure of arrays:

#include <thrust/device_vector.h>
#include <thrust/remove.h>
#include <thrust/transform_scan.h>
#include <thrust/copy.h>
#include <thrust/count.h>

typedef thrust::device_vector<int> IntVector;
typedef thrust::device_vector<bool> BoolVector;
typedef thrust::device_vector<int>::iterator IntIterator; 
typedef thrust::device_vector<bool>::iterator BoolIterator;

struct PatentInfo
    IntVector patentId;
    IntVector classCode;
    IntVector appYear;
    IntVector issueYear;
    IntVector companyId;
    BoolVector isImportant;

Here NVIDIA Thrust library is used to wrap device containers. I found this library to be quite helpful. Essential, really. In particular it’s extremely easy to deploy data to the GPU:

IntVector d_vector = h_vector;

The above copies the thrust::host_vector<int> to the device. Or, if the data came from the managed space as an array:

IntVector d_vector(size);
thrust::copy(h_data, h_data + size, d_vector.begin());

will copy the unmanaged host data stored in h_data to the device.
NOTE: It is important to keep in mind that all of the Thrust code should be placed in .cu files (or files included in .cu files), so NVCC compiler can get to it.

Now that the data is safely on the device…

The Algorithm

This is a simple, brute force algorithm. I am not a believer in brute force, so I would prefer to optimize what I currently have, unfortunately, it performs incredibly well and thus earns its right to exist.

Here it is in a nutshell (with the help of .NET LINQ notation in C#):
Given the PatentInfo patentInfo structure (where each array is of the size num_patents), and int const * const classes array of size num_classes , of distinct (preferably, but not necessarily) classes:

  1. (In a massively parallel fashion), build an array bool matches[num_of_patents] where matches[i] = true iff there exists j < num_classes: classes[j] == patentInfo[i]
  2. Allocate a structure (on the device) PatentInfo relevantPatents of the size == matches.Count(m => m) – “true” matches.
  3. Copy all the arrays from the original structure into the new one, such that: array.Where((elem, i) => matches[i]) – index of the element corresponds to a “true” match.
  4. Copy the structure above to the host memory and return it


Categories: CPP, CUDA