Home > CPP, CUDA > Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

This is a post in two parts:
Part 1 – The problem, solution setup, the algorithm.
Part 2 – (The juicy) Implementation details, discussion.

Suppose at the heart of the data layer of a web application there is a join like this:

select distinct p.PatentID, p.ClassId, p.CompanyId, p.IssueYear, p.AppYear, p.IsImportant
from @topClasses t
join Patents p on p.ClassId = t.ClassId

This join filters patents belonging to a set of classes from the Patents table.

Semantics are not that important, but the fact that a table of classes is on the order of a few thousands, while that of patents – a few million, and so is the resulting filtered dataset. Classes table variable contains unique entries for classes, and patents belonging to these classes are extracted. This join takes around 6 – 8 sec on SQL Server 2014 (timing SSMS, probably even less time in reality), not a whole lot of time, but a lifetime in some contexts. Such as a web application’s, for example.

Enter GTX Titan, fun begins.

Brute Force, Massively Parallel Solution

We would like to deploy an equivalent of the Patents table to the GPU and see if we can speed up the join. For this we need:

  1. Define the data structures and deploy the data
  2. Filter (“join”) the data based on our set of keys
  3. Gather the result of this join
  4. Move it out from the GPU back to main memory

Note: There is more, of course: somehow the data needs to persist on the GPU between web server calls, requests need to be synchronized, and data has to be marshalled back and forth between managed and native space. These are all pretty standard problems I am not addressing in this post.

Representing the data.

The data has been quite well normalized, so we are dealing with a bunch of 4-byte integers (and a boolean).  This makes – 21 bytes per record, assuming booleans are stored as bytes. Since our table size is in the millions of records, we have in the order of tens of megabytes of data we would like to store on the GPU. With GTX Titan and its 6GB of memory, we are good!

The most  intuitive way to represent this data on the device would be an array of structures:

struct PatentInfo
    int patentId;
    int classCode;
    int appYear;
    int issueYear;
    int companyId;
    bool isImportant;


Defining an optimized structure with Thrust

However, one of the basics of GPU optimization is coalesced global memory access. We are joining on just one field:  classCode, and we want coalesced access to that field in particular.

So, we define our structure as a structure of arrays:

#include <thrust/device_vector.h>
#include <thrust/remove.h>
#include <thrust/transform_scan.h>
#include <thrust/copy.h>
#include <thrust/count.h>

typedef thrust::device_vector<int> IntVector;
typedef thrust::device_vector<bool> BoolVector;
typedef thrust::device_vector<int>::iterator IntIterator; 
typedef thrust::device_vector<bool>::iterator BoolIterator;

struct PatentInfo
    IntVector patentId;
    IntVector classCode;
    IntVector appYear;
    IntVector issueYear;
    IntVector companyId;
    BoolVector isImportant;

Here NVIDIA Thrust library is used to wrap device containers. I found this library to be quite helpful. Essential, really. In particular it’s extremely easy to deploy data to the GPU:

IntVector d_vector = h_vector;

The above copies the thrust::host_vector<int> to the device. Or, if the data came from the managed space as an array:

IntVector d_vector(size);
thrust::copy(h_data, h_data + size, d_vector.begin());

will copy the unmanaged host data stored in h_data to the device.
NOTE: It is important to keep in mind that all of the Thrust code should be placed in .cu files (or files included in .cu files), so NVCC compiler can get to it.

Now that the data is safely on the device…

The Algorithm

This is a simple, brute force algorithm. I am not a believer in brute force, so I would prefer to optimize what I currently have, unfortunately, it performs incredibly well and thus earns its right to exist.

Here it is in a nutshell (with the help of .NET LINQ notation in C#):
Given the PatentInfo patentInfo structure (where each array is of the size num_patents), and int const * const classes array of size num_classes , of distinct (preferably, but not necessarily) classes:

  1. (In a massively parallel fashion), build an array bool matches[num_of_patents] where matches[i] = true iff there exists j < num_classes: classes[j] == patentInfo[i]
  2. Allocate a structure (on the device) PatentInfo relevantPatents of the size == matches.Count(m => m) – “true” matches.
  3. Copy all the arrays from the original structure into the new one, such that: array.Where((elem, i) => matches[i]) – index of the element corresponds to a “true” match.
  4. Copy the structure above to the host memory and return it


Categories: CPP, CUDA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: