Note:
All this code is now on GitHub.
Compute the mathces
Here is a simple, purely brute-force algorithm for computing the join mentioned in Part 1.
Here is the entirely “CPU” implementation of the algorithm:
std::vector<bool> matches(size); std::fill(matches.begin(), matches.end(), false); for (int i = 0; i < size; i++) { for (int j = 0; j < classes.size(); j++) { bool match = classes[j] == classCode[i]; if (match) { matches[i] = match; } } }
Loop over both datasets, compare them one-by-one, if there is a match – flag it. The only thing to note is that while it’s tempting to write the inner loop like this:
matches[i] = classes[j] == classCode[i];
it would be an obvious bug to do so.
This has a loathsome O(m*n) complexity (m, n – sizes of the datasets), and, in general this code is criminal.
In an attempt to decriminalize it, we delegate this work to the GPU as follows:
__global__ void joinValue(const int left, const int * const right, int rightSize, bool * const matches) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < rightSize) { bool match = left == right[idx]; if (match) { matches[idx] = match; } } } __global__ void join(const int * const left, const int * const right, int leftSize, int rightSize, int childBlocks, int childThreadsPerBlock, bool * const matches) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < leftSize) { joinValue<<<childBlocks, childThreadsPerBlock>>>(left[idx], right, rightSize, matches); } }
Pretty simple. The “join” kernel is used to “loop” over the filtering map one entry per thread. This calls a child kernel to “loop” over the “database table”. The output array of “matches” is originally instantiated to “false”. We are still O(m*n) in terms of work complexity, but step complexity is O(1) – just two steps are necessary.
Thrust is used to extract raw pointers to the data and pass them to the kernels. No need for CUDA boiler-plate to allocate, copy, and free the memory:
const int * const right = thrust::raw_pointer_cast(d_patentInfo.classCode.data()); bool * const matchesRaw = thrust::raw_pointer_cast(matches.data()); const int * const left = thrust::raw_pointer_cast(classes.data()); //nBlocks: calculated so we have one thread per element of the "left" dataset int nBlocks = (size + threadCount) / threadCount; //the child kernel has nChildBlocks: one thread per element of the "right" dataset int nChildBlocks = ((int)d_patentInfo.classCode.size() + threadCount) / threadCount; // launch join kernel int rightSize = (int)d_patentInfo.classCode.size(); join<<<nBlocks, threadCount>>>(left, right, size, rightSize, nChildBlocks, threadCount, matchesRaw); //... and wait for it to finish cudaDeviceSynchronize();
Filter the data
From this point on, everything is done with Thrust.
We count the number of matches and allocate a structure of arrays on the GPU to receive them. The constructor for the structure allocates every one of its arrays.
int nMatches = (int)thrust::count(matches.begin(), matches.end(), true); PatentInfo relevantPatents(nMatches);
We copy all of the filtered data into the new location in one shot, using the Thrust zip iterators:
// output iterator. "matches" at the end has a larger size than nMatches. PatentZipIterator relevantPatentsIter = thrust::make_zip_iterator(thrust::make_tuple(relevantPatents.patentId.begin(), relevantPatents.classCode.begin(), relevantPatents.appYear.begin(), relevantPatents.issueYear.begin(), relevantPatents.companyId.begin(), relevantPatents.isImportant.begin(), matches.begin())); //input iterator PatentZipIterator patentZipIter = thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.begin(), d_patentInfo.classCode.begin(), d_patentInfo.appYear.begin(), d_patentInfo.issueYear.begin(), d_patentInfo.companyId.begin(), d_patentInfo.isImportant.begin(), matches.begin())); PatentZipIterator patentZipIterEnd = thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.end(), d_patentInfo.classCode.end(), d_patentInfo.appYear.end(), d_patentInfo.issueYear.end(), d_patentInfo.companyId.end(), d_patentInfo.isImportant.end(), matches.end())); //copy to the output if we are matching thrust::copy_if(patentZipIter, patentZipIterEnd, relevantPatentsIter, PatentInfo::is_matching());
Here PatentZipIterator
is defined as follows:
typedef thrust::tuple<IntIterator, IntIterator, IntIterator, IntIterator, IntIterator, BoolIterator, BoolIterator> PatentIteratorTuple; typedef thrust::zip_iterator<PatentIteratorTuple> PatentZipIterator;
A zip iterator in Thrust is similar to Zip() function in LINQ. In LINQ, Zip() is limited to two sequences, while in Thrust zip iterator can be used with as many sequences as the thrust::tuple
type can handle. In large part, the purpose of this exercise was to take this iterator for a ride: an obvious optimization (discussed later) would not really require this complexity.
In this implementation, we zip up all the arrays in our PatentInfo structure together with the array of matches in a septuple. thrust::make_zip_iterator()
takes a tuple of iterators and creates a single iterator that allows us to march thrust algorithms on seven arrays in lock-step. So, thrust::copy_if()
– our gathering algorithm can examine one element at a time using PatentInfo::is_matching
functor and see if this is a “matched” element. The predicate functor is implemented as follows:
typedef thrust::tuple<int, int, int, int, int, bool, bool> PatentTuple; struct is_matching : public thrust::unary_function < PatentTuple, bool > { __host__ __device__ bool operator() (const PatentTuple& patentTuple) { return thrust::get<6>(patentTuple); } };
Each element fed to the predicate is a septuple wrapped into the PatentTuple type, the last member of each septuple is an element of the matches[]
array that determines the return value of the predicate.
After the gathering algorithm is done, we are ready to move the results to the host memory, which is done again painlessly by Thrust.
Discussion
Performance
Performance is, unfortunately, ridiculously high. “Unfortunately” because bad performance is a great drive force of innovation and improvement. Here is the summary (in seconds, avg) for a dataset of 20,000,000 “patent ids” joined over 4000 classes.
SQL Server | CPU | GPU |
7(actually for 5,000,000 over 3000 classes) | 80 | 0.012 |
GPU performance does not include moving the data back into the host memory – this maybe takes a second. Deploying the original data to the GPU is done “off-line”, so is of little interest, but it also takes a second or so.
Optimizations
One optimization is so glaring, that I cannot pass it by. Leave the first step intact, but then, instead of moving data to/from the GPU, convert the array of “matches” to the array of indices matched and return it to the user. No need for zip iterators, no need to shovel lots of data to/from the GPU:
- Deploy the list of “keys” (original list of patentIds in this case) to the GPU
- Find matches as before
- Convert the matches array into a compacted array of indices
- return the array to the calling application
Then the calling application can scan the array of indices on the CPU in O(nMatches) time, and return the actual filtered dataset.
Besides the fact, that much less data needs to be moved between different spaces (managed, native, GPU), we also free the GPU implementation of any data complexity: no need to know what the table we are filtering actually looks like, as long as we are filtering on a single integer key!
Do you have any additional resources of how you would use this with an exiting SQL installation, or are you proposing this replace the JOIN operator in something like MS SQL?