## Modelling Stochastically Independent Processes with F# Computation Expressions: Part 2

Code for this entry: here.

In this first part, we started ~~shamelessly plagiarizing~~ creatively reproducing a Clojure monad that does a great job describing stochastically independent processes.

After a few trials I became convinced, that the approach in the blog post mentioned above was indeed optimal: a “natural” approach of modelling an event as a (value, probability) tuple is flawed in a sense that the probability needs to always be carried around and that gets in the way of describing the process in a declarative way.

### Auxiliaries

I have defined a couple of auxiliary functions, just like in the Clojure case:

type Microsoft.FSharp.Collections.Map<'Key, 'Value when 'Key : comparison> with static member mapValues (f : 'Value -> 'U) x = x |> Map.toSeq |> Seq.map (fun (k, v) -> (k, f v)) let uniform (v : int seq) = let v = v |> Seq.toList v |> Seq.map (fun i -> i, 1N / BigRational.FromInt v.Length) |> Map.ofSeq

I have extended the Map type (because I can), to make the code below even more readable.

### Computation expression

type IndEventBuilder () = member this.Zero () = Map.empty.Add(0, 0.) member this.Return (e : 'a) : Map<'a, BigRational> = Map.empty.Add(e, 1N) member this.Bind (m : Map<'a, BigRational>, fn : 'a -> Map<'b, BigRational>) : Map<'b, BigRational> = m |> Map.toSeq |> Seq.map (fun (key, value) -> (fn key) |> Map.mapValues (fun x -> x * value)) |> Seq.concat |> Seq.groupBy(fun (key, value) -> key) |> Seq.map(fun (key, sq) -> (key, sq |> Seq.map (fun (key, value) -> value) |> Seq.sum)) |> Map.ofSeq let independent = IndEventBuilder()

Zero and Return functions are necessary: Zero because I want to use “if” statements within the computation expression, and Return is always necessary as a terminator of binding chains: Bind(.., Bind(.., (.., Return))). In that precise sense, Return makes the event marked by its argument occur with the probability 1.

Bind takes a distribution and unwraps it by making each event in the distribution occur together with what comes next:

- For each event e in the distribution
- Take the event and map it to the distribution that comes afterwards by multiplying the probability of this event by the probability of every event in the distribution that follows
- Take the sequence of tuple sequences produced in 2 and flatten them
- Isolate actual events and add up their individual probabilities, to compute the actual probability of each distinct event
- Turn the resulting sequence into a map.

### Red Dog

Here is a slightly more interesting application: modelling the Red Dog game. Here is the “naive” version:

let simpleRedDog = independent { let! card1 = cardProb let! card2 = cardProb let! card3 = cardProb let minCard = min card1 card2 let maxCard = max card1 card2 if card1 = card2 then if card2 = card3 then return 10 else return 0 elif minCard + 1 = maxCard then return 0 elif minCard < card3 && card3 < maxCard then return 1 else return -1 }

This is the version from the Clojure blog (the rules are explained there as well). All we need to do is understand the rules and write them out. The only problem is that this implementation assumes cards are being replaced in the deck while the hand is played. Easy to fix, however, it actually gives close to correct results (with probabilities being small enough, rigorously accounting for replacements does not matter much):

val simpleRedDog : Map<int,BigRational> = map [(-1, 88/169N); (0, 36/169N); (1, 44/169N); (10, 1/169N)]

Here is the fixed version, modelling the process without replacements:

let redDog = let removeCard (v : BigRational) (i : BigInteger) = if v = 0N then v else let mult = BigInteger range / v.Denominator BigRational.FromBigInt (v.Numerator * mult - i) / BigRational.FromBigInt (v.Denominator * mult - BigInteger.One) let distWithRemoved card dist = dist |> Map.map (fun key v -> if key <> card then removeCard v BigInteger.Zero else removeCard v BigInteger.One) independent { let! card1 = cardProb let removedCard = distWithRemoved card1 cardProb let! card2 = removedCard let! card3 = distWithRemoved card2 removedCard let minCard = min card1 card2 let maxCard = max card1 card2 if card1 = card2 then if card2 = card3 then return 10 else return 0 elif minCard + 1 = maxCard then return 0 elif minCard < card3 && card3 < maxCard then return 1 else return -1 } let expect dist = dist |> Map.toSeq |> Seq.sumBy (fun (k, v) -> float k * BigRational.ToDouble v)

We simply adjust for the removed card in the distWithRemoved function. (Expressing this in Clojure is harder because monads are not part of the language, and so succinctness characteristic of Clojure will be affected).

The results:

val redDog : Map<int,BigRational> = map [(-1, 8624/16575N); (0, 1112/5525N); (1, 352/1275N); (10, 1/425N)]

These are not all that different from the approximation. Even though the chance of an 11-fold gain appears to be about 3 times higher in the simplified version, the real difference between 0.59% and 0.23% (actual) is probably not all that great. Expected gains are also pretty close:

> expect redDog;; val it : float = -0.220693816 > expect simpleRedDog;; val it : float = -0.201183432

### BUGS/JAGS

BUGS & JAGS are famous enough with Bayesian statisticians to not require detailed description. Usage is based on a very cool principle of declarative programming: simply describe the distributions and let the system do the sampling behind the scenes. While the creators of these packages probably were not thinking in terms of functional programming, the results, in view of everything described above, are if not close, at least strikingly similar: we extend the language to model the process chain in a declarative way.

## Modelling Stochastically Independent Processes with F# Computation Expressions: Part 1

The idea for doing this is not new. There is an excellent series of posts closely tracing an article on applications of functional programming to probability.

A colleague of mine has recently called my attention to his own post of two years ago, where he describes a monad that models stochastically independent events in Clojure. I thought the latter implementation actually went to the heart of the whole idea of monads and that is why, once I started writing my own in F# from scratch, event though I naturally/brute-force-like, chose something described below and which corresponds exactly to the series of posts above, I eventually was compelled to do it my colleague’s way.

Here is the natural choice: a distribution (p.m.f) is just a sequence of events. Each event is simply a record/tuple/map of two values: the signifier of the event and its probability.

//underlying type type 'a Outcome = { Value: 'a Probability : BigRational } //Monadic type type 'a Distribution = 'a Outcome seq

This is great, and it works very well, as the posts show. My slight dissatisfaction with this approach is due to the fact that the competing one appears much more transparent, extremely easy to read, easy to use. In fact, anyone who knows Clojure syntax can use it right away to model independent processes simply by writing them out in Clojure, AND there is no need for any helper functions to extract the results. It just works (from the post, with some minor corrections):

(domonad probability-m [die1 (uniform [1 2 3 4 5 6]) die2 (uniform [1 2 3 4 5 6])] (+ die1 die2))

This code models an experiment of rolling two dice and returns a probability mass function that describes the experiment. All we had to do as users was describe the experiment in the most natural way possible: what is the probability of getting an exact value as the sum of both values of 2 rolled dice. Hence the expression: extract the value from the first pmf, add it to the value extracted from the second. Don’t even need to think about probabilities, as the magic happens behind the monadic scene.

The code above gives us the result: {2 1/36, 3 1/18, 4 1/12, 5 1/9, 6 5/36, 7 1/6, 8 5/36, 9 1/9, 10 1/12, 11 1/18, 12 1/36} – which is a map of the sum of dice values to the probability of getting them. In F#, I believe, the Algol-like syntax actually works to our advantage (I call the computation expression “independent” for obvious reasons):

independent { let! x = uniform [1..6] let! y = uniform [1..6] return x + y }

When executed in F# interactive, and using the FSharp.PowerPack:

val mp : Map<int,BigRational> = map [(2, 1/36N); (3, 1/18N); (4, 1/12N); (5, 1/9N); (6, 5/36N); (7, 1/6N); (8, 5/36N); (9, 1/9N); (10, 1/12N); ...]

We, amazingly enough, get the same answers. Next time: better examples and nuts & bolts of the code.

## Generating Permutations: Clojure or F#: Part 2

Marching on from the last post.

### Lazy Sequences

This is my favorite feature ever. If I want to generate just a few of 10! (nobody even knows how much that is) permutations, I could:

(take 10 (permute [1 2 3 4 5 6 7 8 9 10]))

provided, the function is defined (as described in the first post):

(defn permute [v] (when-let [[pos2 pos1] (findStartingPos v)] (let [nxt (sort-remainder (swapDigits v pos2 pos1) (inc pos1))] (cons nxt (lazy-seq (permute nxt))))))

Here I am not sure which language I like more. Clojure has easier syntax: everything fits nicely within the recursive function call. Returning nil terminates the loop, while in F# you need to know to return an option type where None terminates iteration. On the other hand, I like the fact that everything is neatly wrapped in the “unfold” function: seems more “natural” to me: fold/unfold – there is a certain symmetry here. Also, everything exists nicely in this LINQ-like world…

let permute (v : 'a array when 'a: comparison) = Seq.unfold (fun prev -> match findStartingPos prev with | None -> None | Some (cur, pos) -> Some(prev, sortRemainder (swapPositions prev cur pos) (pos + 1))) v

### Weak Typing

I really like Clojure weak typing. And I like the F# strong type system:

let sortRemainder (v : 'a array) pos = if v.Length - 1 = pos then v else [| yield! v.[0..pos - 1] yield! Array.sort v.[pos..v.Length - 1]; |]

F# type system requires that the first argument be qualified, but it is happy with this abbreviation, while the full qualification should be:

let sortRemainder (v : 'a array when 'a: comparison) pos =

Since we are sorting a subvector, the array has to be of a “comparable” type. Which is the condition of the applicability of the algorithm.

In Clojure it looks simpler, but it’s essentially the same:

(defn sort-remainder [v pos1] (if (= (dec (count v)) pos1) v (into (subvec v 0 pos1) (sort (subvec v pos1)))))

### Tail Recursion

One more cool feature of functional languages. I think it’s another tie once you use it, although the “loop” construct that demands it is very nice.

The following function returns a tuple (current, found) of two positions within the array: one of the element that is being “promoted” up (current), and the other – of the smaller element being pushed back. (So, current > found && v[current] < v[found]). Or nil/None if no such pair can be found. This is the key function of the algorithm:

(defn findStartingPos [v] (loop [cur (dec (count v)) acc [-1 -1]] (let [maxPos (second acc)] (if (or (< cur maxPos) (< cur 0)) (if (= maxPos -1) nil acc) (if-let [pos (findFirstLessThan v cur)] (recur (dec cur) (if (< maxPos pos) [cur pos] acc)) (recur (dec cur) acc))))))

And F#:

let findStartingPos v = let rec findStartingPosRec cur acc = let maxPos = snd acc if cur < 0 || cur < maxPos then if maxPos < 0 then None else Some acc else let pos = findFirstLessThan v cur match pos with | Some pos -> findStartingPosRec (cur - 1) (if maxPos < pos then (cur, pos) else acc) | None -> findStartingPosRec (cur - 1) acc findStartingPosRec (v.Length - 1) (-1, -1)

It’s nice that we have a “loop” keyword in Clojure to provide cleaner syntax and more discipline for defining tail-recursive functions, but I am not appalled with the way we do it in F# either.

(The above functions contain obvious optimizations: we stop scanning once we have a pair of “swappable” elements and we have moved beyond the “found” position. Also, we discard a valid pair if we already have a pair where “found” position is larger than the “found” position of the current iteration).

## Doing it in a Massively Parallel Way

Of course, I like everything parallel… So what about doing it on a GPU, using, say CUDA? It is definitely possible, although probably not very practical. Even if we only have an array of 10 distinct elements, the number of permutations is already ridiculously large (although who knows what we are going to be using them for)… In any event, this is solvable if we can get “random access” to permutations. Instead of unfolding them as a lazy sequence, generate them all at once in a massively parallel fashion.

This is possible because permutations are neatly tied to factoradic numbers, as this Wikipedia article explains. So, it is always possible to generate “permutation #10″ to be guaranteed different from “permutation #5″ for distinct, fully ordered sets. (Any sets where ordering relationship is not defined can still be easily permuted as long as its elements are stored in indexed data structures, such as arrays, by simply generating permutations of indices). Thus, taking CUDA “single data multiple threads” computation model it is easy to generate all (or however many) permutations in parallel. Naturally, if we are not just outputting the results but need to store them, the exponential nature of the problem memory growth, as well as the number of threads required, and the limited amount of GPU memory (a single computer RAM for that matter) will quickly become a problem. I guess the CUDA C++ version of this will have to wait until the next job interview…

## Generating Permutations: Clojure or F#: Part 1

## The Alogirthm

Recently, I have entered a brave (new?) world of Clojure and was looking for a small project to take it for a ride. I stopped on a popular/interesting enough little problem, that subsumed a certain interview question which I was once unfortunate enough to stumble through with half my brain tied behind my back (long story).

The algorithm to generate permutations in sequence is actually quite simple and intuitive, I believe it originates somewhere in India thousands of years ago. At least that was the impression I got from a Wikipedia article. Anyway, given a fully ordered set S, generate all permutations of it. The algorithm works for sets with repetition, but we can assume no repetitions for simplicity:

- (Since the set is fully ordered), sort it in ascending order, represent it as a sequence. Output it.
- Find the next sequence consisting of the elements of S and following the current one immediately based on the ordering relationship. Output it.
- Repeat 2, until no such sequence can be found.

The juice here is of course 2. To illustrate with an example. Suppose our sequence is just numbers “{}” here mean sequence, order matters: S ={1, 2, 3, 4}. The next one in lexicographical order is: {1, 2, 4, 3}. The one after it: {1, 3, 2, 4}. The one after: {1, 3, 4, 2}. Shall we do one more? {1, 4, 2, 3} Maybe just one more, so it becomes even clearer: {1, 4, 3, 2}.

So the key here, in case you have not caught on yet:

- Starting from the end of the sequence, going backwards, find the first smallest position in the sequence, if such exists, where the element in current position can be swapped with the element in the found position to make the sequence “larger” than the previous one. This is accomplished if the element at the current position is less than the element at the found position. Swap elements in the current and the found positions.
- Sort the sub-sequence starting at the found + 1 position, ascending. So, in the above example we get: {1, 3, 2, 4}

For instance, looking at {1, 2, 4, 3} we find: (2 < 3), (2 < 4), (1 < 2). But 2 < 3 is the very first pair that fits the bill (we are starting from the back), so we don't go any further and swap them. We get {1, 3, 4, 2}

Et c’est tout!

(How simple is that? I bet the brute force recursive “solution” feels very stupid right about now).

## Implementation

At this point it’s all but unnecessary to actually show the implementations. However, I just had fun with some of the Clojure concepts/structures, and wanted to compare them to what I am used to in F#. So this is the point. Meta-programming. The Clojure code is here and F# – here.

Observations:

### Immutability

The way the algorithm is structured, immutability is intuitive: we are producing a new sequence in each step, and old sequences should also be preserved and output. Should have been a no-brainer, since we are dealing with functional languages with immutable structures. And here was a slight jolt. Consider a function that swaps elements in the sequence (the algorithm calls for swapping, so…):

Clojure:

(defn swapDigits [v pos1 pos2] (assoc v pos2 (v pos1) pos1 (v pos2)))

(I was using strings of digits as input sets in my initial awkward steps through the language, hence the function name).

Now, I wanted something as short and as sweet as this for F#, so I cheated! .NET arrays are mutable and used I .NET arrays instead of F# lists (hope functional programming purists will commute a very harsh sentence I deserve for writing the following, and not condemn me to Java programming for life):

let swapPositions (v : 'a array) pos1 pos2 = let res = Array.copy v res.[pos1] <- v.[pos2] res.[pos2] <- v.[pos1] res

So yes. The code at line 2 is pure sin and I should have used F# lists which are immutable, and the code should have been:

let swapPositions (v : 'a list) pos1 pos2 = v |> List.permute (fun i -> if i = pos1 then pos2 elif i = pos2 then pos1 else i)

but frankly, this is not the (very) first thing that comes to mind if you haven’t been using the language for a long time, and not as clear as the Clojure solution, which was close enough to the surface for a noob like myself to scoop it up.

Alright. This has gone on long enough, more soon…

## Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 2

### Note:

All this code is now on GitHub.

### Compute the mathces

Here is a simple, purely brute-force algorithm for computing the join mentioned in Part 1.

Here is the entirely “CPU” implementation of the algorithm:

std::vector<bool> matches(size); std::fill(matches.begin(), matches.end(), false); for (int i = 0; i < size; i++) { for (int j = 0; j < classes.size(); j++) { bool match = classes[j] == classCode[i]; if (match) { matches[i] = match; } } }

Loop over both datasets, compare them one-by-one, if there is a match – flag it. The only thing to note is that while it’s tempting to write the inner loop like this:

matches[i] = classes[j] == classCode[i];

it would be an obvious bug to do so.

This has a loathsome O(m*n) complexity (m, n – sizes of the datasets), and, in general this code is criminal.

In an attempt to decriminalize it, we delegate this work to the GPU as follows:

__global__ void joinValue(const int left, const int * const right, int rightSize, bool * const matches) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < rightSize) { bool match = left == right[idx]; if (match) { matches[idx] = match; } } } __global__ void join(const int * const left, const int * const right, int leftSize, int rightSize, int childBlocks, int childThreadsPerBlock, bool * const matches) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < leftSize) { joinValue<<<childBlocks, childThreadsPerBlock>>>(left[idx], right, rightSize, matches); } }

Pretty simple. The “join” kernel is used to “loop” over the filtering map one entry per thread. This calls a child kernel to “loop” over the “database table”. The output array of “matches” is originally instantiated to “false”. We are still O(m*n) in terms of work complexity, but step complexity is O(1) – just two steps are necessary.

Thrust is used to extract raw pointers to the data and pass them to the kernels. No need for CUDA boiler-plate to allocate, copy, and free the memory:

const int * const right = thrust::raw_pointer_cast(d_patentInfo.classCode.data()); bool * const matchesRaw = thrust::raw_pointer_cast(matches.data()); const int * const left = thrust::raw_pointer_cast(classes.data()); //nBlocks: calculated so we have one thread per element of the "left" dataset int nBlocks = (size + threadCount) / threadCount; //the child kernel has nChildBlocks: one thread per element of the "right" dataset int nChildBlocks = ((int)d_patentInfo.classCode.size() + threadCount) / threadCount; // launch join kernel int rightSize = (int)d_patentInfo.classCode.size(); join<<<nBlocks, threadCount>>>(left, right, size, rightSize, nChildBlocks, threadCount, matchesRaw); //... and wait for it to finish cudaDeviceSynchronize();

### Filter the data

From this point on, everything is done with Thrust.

We count the number of matches and allocate a structure of arrays on the GPU to receive them. The constructor for the structure allocates every one of its arrays.

int nMatches = (int)thrust::count(matches.begin(), matches.end(), true); PatentInfo relevantPatents(nMatches);

We copy all of the filtered data into the new location in one shot, using the Thrust zip iterators:

// output iterator. "matches" at the end has a larger size than nMatches. PatentZipIterator relevantPatentsIter = thrust::make_zip_iterator(thrust::make_tuple(relevantPatents.patentId.begin(), relevantPatents.classCode.begin(), relevantPatents.appYear.begin(), relevantPatents.issueYear.begin(), relevantPatents.companyId.begin(), relevantPatents.isImportant.begin(), matches.begin())); //input iterator PatentZipIterator patentZipIter = thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.begin(), d_patentInfo.classCode.begin(), d_patentInfo.appYear.begin(), d_patentInfo.issueYear.begin(), d_patentInfo.companyId.begin(), d_patentInfo.isImportant.begin(), matches.begin())); PatentZipIterator patentZipIterEnd = thrust::make_zip_iterator(thrust::make_tuple(d_patentInfo.patentId.end(), d_patentInfo.classCode.end(), d_patentInfo.appYear.end(), d_patentInfo.issueYear.end(), d_patentInfo.companyId.end(), d_patentInfo.isImportant.end(), matches.end())); //copy to the output if we are matching thrust::copy_if(patentZipIter, patentZipIterEnd, relevantPatentsIter, PatentInfo::is_matching());

Here `PatentZipIterator`

is defined as follows:

typedef thrust::tuple<IntIterator, IntIterator, IntIterator, IntIterator, IntIterator, BoolIterator, BoolIterator> PatentIteratorTuple; typedef thrust::zip_iterator<PatentIteratorTuple> PatentZipIterator;

A zip iterator in Thrust is similar to Zip() function in LINQ. In LINQ, Zip() is limited to two sequences, while in Thrust zip iterator can be used with as many sequences as the `thrust::tuple`

type can handle. In large part, the purpose of this exercise was to take this iterator for a ride: an obvious optimization (discussed later) would not really require this complexity.

In this implementation, we zip up all the arrays in our PatentInfo structure together with the array of matches in a septuple. `thrust::make_zip_iterator()`

takes a tuple of iterators and creates a single iterator that allows us to march thrust algorithms on seven arrays in lock-step. So, `thrust::copy_if()`

– our gathering algorithm can examine one element at a time using `PatentInfo::is_matching`

functor and see if this is a “matched” element. The predicate functor is implemented as follows:

typedef thrust::tuple<int, int, int, int, int, bool, bool> PatentTuple; struct is_matching : public thrust::unary_function < PatentTuple, bool > { __host__ __device__ bool operator() (const PatentTuple& patentTuple) { return thrust::get<6>(patentTuple); } };

Each element fed to the predicate is a septuple wrapped into the PatentTuple type, the last member of each septuple is an element of the `matches[]`

array that determines the return value of the predicate.

After the gathering algorithm is done, we are ready to move the results to the host memory, which is done again painlessly by Thrust.

## Discussion

### Performance

Performance is, unfortunately, ridiculously high. “Unfortunately” because bad performance is a great drive force of innovation and improvement. Here is the summary (in seconds, avg) for a dataset of 20,000,000 “patent ids” joined over 4000 classes.

SQL Server | CPU | GPU |

7(actually for 5,000,000 over 3000 classes) | 80 | 0.012 |

GPU performance does not include moving the data back into the host memory – this maybe takes a second. Deploying the original data to the GPU is done “off-line”, so is of little interest, but it also takes a second or so.

### Optimizations

One optimization is so glaring, that I cannot pass it by. Leave the first step intact, but then, instead of moving data to/from the GPU, convert the array of “matches” to the array of indices matched and return it to the user. No need for zip iterators, no need to shovel lots of data to/from the GPU:

- Deploy the list of “keys” (original list of patentIds in this case) to the GPU
- Find matches as before
- Convert the matches array into a compacted array of indices
- return the array to the calling application

Then the calling application can scan the array of indices on the CPU in O(nMatches) time, and return the actual filtered dataset.

Besides the fact, that much less data needs to be moved between different spaces (managed, native, GPU), we also free the GPU implementation of any data complexity: no need to know what the table we are filtering actually looks like, as long as we are filtering on a single integer key!

## Supercharging SQL Join with GTX Titan, CUDA C++, and Thrust: Part 1

This is a post in two parts:

Part 1 – The problem, solution setup, the algorithm.

Part 2 – (The juicy) Implementation details, discussion.

Suppose at the heart of the data layer of a web application there is a join like this:

select distinct p.PatentID, p.ClassId, p.CompanyId, p.IssueYear, p.AppYear, p.IsImportant from @topClasses t join Patents p on p.ClassId = t.ClassId

This join filters patents belonging to a set of classes from the Patents table.

Semantics are not that important, but the fact that a table of classes is on the order of a few thousands, while that of patents – a few million, and so is the resulting filtered dataset. Classes table variable contains unique entries for classes, and patents belonging to these classes are extracted. This join takes around 6 – 8 sec on SQL Server 2014 (timing SSMS, probably even less time in reality), not a whole lot of time, but a lifetime in some contexts. Such as a web application’s, for example.

Enter GTX Titan, fun begins.

## Brute Force, Massively Parallel Solution

We would like to deploy an equivalent of the Patents table to the GPU and see if we can speed up the join. For this we need:

- Define the data structures and deploy the data
- Filter (“join”) the data based on our set of keys
- Gather the result of this join
- Move it out from the GPU back to main memory

**Note: **There is more, of course: somehow the data needs to persist on the GPU between web server calls, requests need to be synchronized, and data has to be marshalled back and forth between managed and native space. These are all pretty standard problems I am not addressing in this post.

### Representing the data.

The data has been quite well normalized, so we are dealing with a bunch of 4-byte integers (and a boolean). This makes – 21 bytes per record, assuming booleans are stored as bytes. Since our table size is in the millions of records, we have in the order of tens of megabytes of data we would like to store on the GPU. With GTX Titan and its 6GB of memory, we are good!

The most intuitive way to represent this data on the device would be an array of structures:

struct PatentInfo { int patentId; int classCode; int appYear; int issueYear; int companyId; bool isImportant; }

**However…**

### Defining an optimized structure with Thrust

However, one of the basics of GPU optimization is coalesced global memory access. We are joining on just one field: ` classCode`

, and we want coalesced access to that field in particular.

So, we define our structure as a structure of arrays:

#include <thrust/device_vector.h> #include <thrust/remove.h> #include <thrust/transform_scan.h> #include <thrust/copy.h> #include <thrust/count.h> typedef thrust::device_vector<int> IntVector; typedef thrust::device_vector<bool> BoolVector; typedef thrust::device_vector<int>::iterator IntIterator; typedef thrust::device_vector<bool>::iterator BoolIterator; struct PatentInfo { IntVector patentId; IntVector classCode; IntVector appYear; IntVector issueYear; IntVector companyId; BoolVector isImportant; }

Here NVIDIA Thrust library is used to wrap device containers. I found this library to be quite helpful. Essential, really. In particular it’s extremely easy to deploy data to the GPU:

IntVector d_vector = h_vector;

The above copies the `thrust::host_vector<int>`

to the device. Or, if the data came from the managed space as an array:

IntVector d_vector(size); thrust::copy(h_data, h_data + size, d_vector.begin());

will copy the unmanaged host data stored in `h_data`

to the device.

**NOTE: It is important to keep in mind that all of the Thrust code should be placed in .cu files (or files included in .cu files), so NVCC compiler can get to it.**

Now that the data is safely on the device…

### The Algorithm

This is a simple, brute force algorithm. I am not a believer in brute force, so I would prefer to optimize what I currently have, unfortunately, it performs incredibly well and thus earns its right to exist.

Here it is in a nutshell (with the help of .NET LINQ notation in C#):

Given the `PatentInfo patentInfo `

structure (where each array is of the size `num_patents`

), and ` int const * const classes `

array of size ` num_classes `

, of distinct (preferably, but not necessarily) classes:

- (In a massively parallel fashion), build an array
`bool matches[num_of_patents]`

where`matches[i] = true`

iff there exists`j < num_classes: classes[j] == patentInfo[i]`

- Allocate a structure (on the device)
`PatentInfo relevantPatents`

of the`size == matches.Count(m => m)`

– “true” matches. - Copy all the arrays from the original structure into the new one, such that:
`array.Where((elem, i) => matches[i])`

– index of the element corresponds to a “true” match. - Copy the structure above to the host memory and return it

## Compiling CUDA Projects with Dynamic Parallelism (VS 2012/13)

Just a quick note.

If you are starting from a template C++ CUDA project in VS 2012/2013, calling a kernel from a kernel (dynamic parallelism) would not compile:

`error : kernel launch from __device__ or __global__ functions requires separate compilation mode`

To fix this, first make sure your hardware supports it (cc 3.5 or higher) and compute capability is set correctly in the compiler options (“compute_35,sm_35″)

Then set -rdc=true in CUDA C++ options:

Then add cudadevrt.lib to the libraries that are statically linked in the Linker options: