Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 2

Note: All this code is now on GitHub. Compute the mathces Here is a simple, purely brute-force algorithm for computing the join mentioned in Part 1. Here is the entirely "CPU" implementation of the algorithm: Loop over both datasets, compare them one-by-one, if there is a match - flag it. The only thing to note … Continue reading Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 2

Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

This is a post in two parts: Part 1 - The problem, solution setup, the algorithm. Part 2 - (The juicy) Implementation details, discussion. Suppose at the heart of the data layer of a web application there is a join like this: This join filters patents belonging to a set of classes from the Patents … Continue reading Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

Compiling CUDA Projects with Dynamic Parallelism (VS 2012/13)

Just a quick note. If you are starting from a template C++ CUDA project in VS 2012/2013, calling a kernel from a kernel (dynamic parallelism) would not compile: error : kernel launch from __device__ or __global__ functions requires separate compilation mode To fix this, first make sure your hardware supports it (cc 3.5 or higher) … Continue reading Compiling CUDA Projects with Dynamic Parallelism (VS 2012/13)

Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

In the previous post I spoke briefly about motivations for implementing self-organizing maps in F# using GPU with CUDA. I have finally been able to outperform a single threaded C++ implementation by a factor of about 1.5. This is quite modest, but on the other hand rather impressive since we started out by being 60 … Continue reading Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 2: Algorithms

Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 1: F#

By 2017, it is expected that GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architectures’ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth … Continue reading Computing Self-Organizing Maps in a Massively Parallel Way with CUDA. Part 1: F#