Code
The complete source: here
Intro
Starting to play with graph algorithms on the GPU (who wants to wait for nvGraph from NVIDIA, right?) As one of the precursor problems – sorting large arrays of non-negative integers came up. Radix sort is a simple effective algorithm quite suitable for GPU implementation.
Least Significant Digit Radix Sort
The idea is simple. We take a set A of fixed size elements with lexicographical ordering defined. We represent each element in a numeric system of a given radix (e.g., 2). We then scan each element from right to left and group the elements based on the digit within the current scan window, preserving the original order of the elements. The algorithm is described here in more detail.
Here is a the pseudocode for radix = 2 (easy to extend for any radix):
k = sizeof(int) n = array.length for i = 0 to k all_0s = [] all_1s = [] for j = 0 to n if bit(i, array[j]) == 0 then all_0s.add(array[j]) else all_1s.add(array[j]) array = all_0s + all_1s
GPU Implementation
This is a poster problem for the “split” GPU primitive. Split is what <code>GroupBy</code> is in LINQ (or <code>GROUP BY</code> in TSQL) where a sequence of entries are split (grouped) into a number of categories. The code above applies split k times to an array of length n, each time the categories are are “0” and “1”, and splitting/grouping are done based on the current digit of an entry.
The particular case of splitting into just two categories is easy to implement on the GPU. Chapter 39 of GPU Gems describes such implementation (search for “radix sort” on the page).
The algorithm described there shows how to implement the pseudocode above by computing the position of each member of the array in the new merged array (line 10 above) without an intermediary of accumulating lists. The new position of each member of the array is computed based on the exclusive scan where the value of scanned[i] = scanned[i – 1] + 1 when array[i-1] has 0 in the “current” position. (scanned[0] = 0). Thus, by the end of the scan, we know where in the new array the “1” category starts (it’s scanned[n – 1] + is0(array[n – 1]) – total length of the “0” category, and the new address of each member of the array is computed from the scan: for the “0” category – it is simply the value of scanned (each element of scanned is only increased when a 0 bit is encountered), and start_of_1 + (i – scanned[i]) for each member of the “1” category: its position in the original array minus the number of “0” category members up to this point, offset by the start of the “1” category.
The algorithm has two parts: sorting each block as described above and then merging the results. In our implementation we skip the second step, since Alea.CUDA DeviceScanModule does the merge for us (inefficiently so at each iteration, but makes for simpler, more intuitive code).
Coding with Alea.CUDA
The great thing about Alea.CUDA libray v2 is the Alea.CUDA.Unbound namespace that implements all kinds of scans (and reducers, since a reducer is just a scan that throws away almost all of its results).
let sort (arr : int []) = let len = arr.Length if len = 0 then [||] else let gridSize = divup arr.Length blockSize let lp = LaunchParam(gridSize, blockSize) // reducer to find the maximum number & get the number of iterations // from it. use reduceModule = new DeviceReduceModule<int>(target, <@ max @>) use reducer = reduceModule.Create(len) use scanModule = new DeviceScanModule<int>(target, <@ (+) @>) use scanner = scanModule.Create(len) use dArr = worker.Malloc(arr) use dBits = worker.Malloc(len) use numFalses = worker.Malloc(len) use dArrTemp = worker.Malloc(len) // Number of iterations = bit count of the maximum number let numIter = reducer.Reduce(dArr.Ptr, len) |> getBitCount let getArr i = if i &&& 1 = 0 then dArr else dArrTemp let getOutArr i = getArr (i + 1) for i = 0 to numIter - 1 do // compute significant bits worker.Launch <@ getNthSignificantReversedBit @> lp (getArr i).Ptr i len dBits.Ptr // scan the bits to compute starting positions further down scanner.ExclusiveScan(dBits.Ptr, numFalses.Ptr, 0, len) // scatter worker.Launch <@ scatter @> lp (getArr i).Ptr len numFalses.Ptr dBits.Ptr (getOutArr i).Ptr (getOutArr (numIter - 1)).Gather() let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))
This is basically it. One small optimization - instead of looping 32 times (length of int in F#), we figure out the bit count the largest element and iterate fewer times.
Helper functions/kernels are straightforward enough:
let worker = Worker.Default let target = GPUModuleTarget.Worker worker let blockSize = 512 [<Kernel; ReflectedDefinition>] let getNthSignificantReversedBit (arr : deviceptr<int>) (n : int) (len : int) (revBits : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then revBits.[idx] <- ((arr.[idx] >>> n &&& 1) ^^^ 1) [<Kernel; ReflectedDefinition>] let scatter (arr : deviceptr<int>) (len: int) (falsesScan : deviceptr<int>) (revBits : deviceptr<int>) (out : deviceptr<int>) = let idx = blockIdx.x * blockDim.x + threadIdx.x if idx < len then let totalFalses = falsesScan.[len - 1] + revBits.[len - 1] // when the bit is equal to 1 - it will be offset by the scan value + totalFalses // if it's equal to 0 - just the scan value contains the right address let addr = if revBits.[idx] = 1 then falsesScan.[idx] else totalFalses + idx - falsesScan.[idx] out.[addr] <- arr.[idx] let getBitCount n = let rec getNextPowerOfTwoRec n acc = if n = 0 then acc else getNextPowerOfTwoRec (n >>> 1) (acc + 1) getNextPowerOfTwoRec n 0
Generating Data and Testing with FsCheck
I find FsCheck
quite handy for quick testing and generating data with minimal setup in FSharp scripts:
let genNonNeg = Arb.generate<int> |> Gen.filter ((<=) 0) type Marker = static member arbNonNeg = genNonNeg |> Arb.fromGen static member ``Sorting Correctly`` arr = sort arr = Array.sort arr Arb.registerByType(typeof<Marker>) Check.QuickAll(typeof<Marker>)
Just a quick note: the first line may seem weird at first glance: looks like the filter condition is dropping non-positive numbers. But a more attentive second glance reveals that it is actually filtering out negative numbers (currying, prefix notation should help convince anyone in doubt).
Large sets test.
I have performance tested this on large datasets (10, 000, 000 - 300, 000, 000) integers, any more and I'm out of memory on my 6 Gb Titan GPU. The chart below goes up to 100,000,000.
I have generated data for these tests on the GPU as well, using Alea.CUDA random generators, since FsCheck generator above is awfully slow when it comes to large enough arrays.
The code comes almost verbatim from this blog post:
let generateRandomData n = if n <= 0 then failwith "n should be positive" let seed = uint32 DateTime.Now.Second // setup random number generator use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32> use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n // create random numbers cudaRandom.Fill(0, n, prngBuffer) // transfer results from device to host prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))
except I'm converting the generated array to integers in the [0; n] range. Since my radix sort work complexity is O(k * n) (step complexity O(k)) where k = bitCount(max input)
I figured this would give an adequate (kinda) comparison with the native F# Array.orderBy
.
Optimization
It is clear from the above chart, that there is a lot of latency in our GPU implementation - the cost we incur for the coding nicety: by using DeviceScanModule
and DeviceReduceModule
we bracket the merges we would otherwise have to do by hand. Hence the first possible optimization: bite the bullet, do it right, with a single merge at the end of the process and perform a block-by-block sort with BlockScanModule
Thanks for wwriting