Home > CUDA, F# > GPU Split & Sort With Alea.CUDA

GPU Split & Sort With Alea.CUDA

Code

The complete source: here

Intro

Starting to play with graph algorithms on the GPU (who wants to wait for nvGraph from NVIDIA, right?) As one of the precursor problems – sorting large arrays of non-negative integers came up. Radix sort is a simple effective algorithm quite suitable for GPU implementation.

Least Significant Digit Radix Sort

The idea is simple. We take a set A of fixed size elements with lexicographical ordering defined. We represent each element in a numeric system of a given radix (e.g., 2). We then scan each element from right to left and group the elements based on the digit within the current scan window, preserving the original order of the elements. The algorithm is described here in more detail.

Here is a the pseudocode for radix = 2 (easy to extend for any radix):

k = sizeof(int)
n = array.length
for i = 0 to k
    all_0s = []
    all_1s = []
    for j = 0 to n
        if bit(i, array[j]) == 0
        then all_0s.add(array[j])
        else all_1s.add(array[j])
    array = all_0s + all_1s

GPU Implementation

This is a poster problem for the “split” GPU primitive. Split is what <code>GroupBy</code> is in LINQ (or <code>GROUP BY</code> in TSQL) where a sequence of entries are split (grouped) into a number of categories. The code above applies split k times to an array of length n, each time the categories are are “0” and “1”, and splitting/grouping are done based on the current digit of an entry.

The particular case of splitting into just two categories is easy to implement on the GPU. Chapter 39 of GPU Gems describes such implementation (search for “radix sort” on the page).

The algorithm described there shows how to implement the pseudocode above by computing the position of each member of the array in the new merged array (line 10 above) without an intermediary of accumulating lists. The new position of each member of the array is computed based on the exclusive scan where the value of scanned[i] = scanned[i – 1] + 1 when array[i-1] has 0 in the “current” position. (scanned[0] = 0). Thus, by the end of the scan, we know where in the new array the “1” category starts (it’s scanned[n – 1] + is0(array[n – 1]) – total length of the “0” category, and the new address of each member of the array is computed from the scan: for the “0” category – it is simply the value of scanned (each element of scanned is only increased when a 0 bit is encountered), and start_of_1 + (i – scanned[i]) for each member of the “1” category: its position in the original array minus the number of “0” category members up to this point, offset by the start of the “1” category.

The algorithm has two parts: sorting each block as described above and then merging the results. In our implementation we skip the second step, since Alea.CUDA DeviceScanModule does the merge for us (inefficiently so at each iteration, but makes for simpler, more intuitive code).

Coding with Alea.CUDA

The great thing about Alea.CUDA libray v2 is the Alea.CUDA.Unbound namespace that implements all kinds of scans (and reducers, since a reducer is just a scan that throws away almost all of its results).

let sort (arr : int []) =
    let len = arr.Length
    if len = 0 then [||]
    else
        let gridSize = divup arr.Length blockSize
        let lp = LaunchParam(gridSize, blockSize)
        
        // reducer to find the maximum number & get the number of iterations
        // from it.
        use reduceModule = new DeviceReduceModule<int>(target, <@ max @>)
        use reducer = reduceModule.Create(len)

        use scanModule = new DeviceScanModule<int>(target, <@ (+) @>)
        use scanner = scanModule.Create(len)

        use dArr = worker.Malloc(arr)
        use dBits = worker.Malloc(len)
        use numFalses = worker.Malloc(len)
        use dArrTemp = worker.Malloc(len)

        // Number of iterations = bit count of the maximum number
        let numIter = reducer.Reduce(dArr.Ptr, len) |> getBitCount

        let getArr i = if i &&& 1 = 0 then dArr else dArrTemp
        let getOutArr i = getArr (i + 1)

        for i = 0 to numIter - 1 do
            // compute significant bits
            worker.Launch <@ getNthSignificantReversedBit @> lp (getArr i).Ptr i len dBits.Ptr

            // scan the bits to compute starting positions further down
            scanner.ExclusiveScan(dBits.Ptr, numFalses.Ptr, 0, len)

            // scatter
            worker.Launch <@ scatter @> lp (getArr i).Ptr len numFalses.Ptr dBits.Ptr (getOutArr i).Ptr

        (getOutArr (numIter - 1)).Gather()

let generateRandomData n =
    if n <= 0 then failwith "n should be positive"
    let seed = uint32 DateTime.Now.Second

    // setup random number generator        
    use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32>
    use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n
    
    // create random numbers
    cudaRandom.Fill(0, n, prngBuffer)
    // transfer results from device to host 
    prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))


This is basically it. One small optimization - instead of looping 32 times (length of int in F#), we figure out the bit count the largest element and iterate fewer times.

Helper functions/kernels are straightforward enough:

let worker = Worker.Default
let target = GPUModuleTarget.Worker worker

let blockSize = 512

[<Kernel; ReflectedDefinition>]
let getNthSignificantReversedBit (arr : deviceptr<int>) (n : int) (len : int) (revBits : deviceptr<int>) =
    let idx = blockIdx.x * blockDim.x + threadIdx.x
    if idx < len then
        revBits.[idx] <- ((arr.[idx] >>> n &&& 1) ^^^ 1)

[<Kernel; ReflectedDefinition>]    
let scatter (arr : deviceptr<int>) (len: int) (falsesScan : deviceptr<int>) (revBits : deviceptr<int>) (out : deviceptr<int>) =
    let idx = blockIdx.x * blockDim.x + threadIdx.x
    if idx < len then

        let totalFalses = falsesScan.[len - 1] + revBits.[len - 1]

        // when the bit is equal to 1 - it will be offset by the scan value + totalFalses
        // if it's equal to 0 - just the scan value contains the right address
        let addr = if revBits.[idx] = 1 then falsesScan.[idx] else totalFalses + idx - falsesScan.[idx]
        out.[addr] <- arr.[idx]

let getBitCount n =
    let rec getNextPowerOfTwoRec n acc =
        if n = 0 then acc
        else getNextPowerOfTwoRec (n >>> 1) (acc + 1)

    getNextPowerOfTwoRec n 0

Generating Data and Testing with FsCheck

I find FsCheck quite handy for quick testing and generating data with minimal setup in FSharp scripts:

let genNonNeg = Arb.generate<int> |> Gen.filter ((<=) 0)

type Marker =
    static member arbNonNeg = genNonNeg |> Arb.fromGen
    static member ``Sorting Correctly`` arr =
        sort arr = Array.sort arr

Arb.registerByType(typeof<Marker>)
Check.QuickAll(typeof<Marker>)


Just a quick note: the first line may seem weird at first glance: looks like the filter condition is dropping non-positive numbers. But a more attentive second glance reveals that it is actually filtering out negative numbers (currying, prefix notation should help convince anyone in doubt).

Large sets test.

I have performance tested this on large datasets (10, 000, 000 - 300, 000, 000) integers, any more and I'm out of memory on my 6 Gb Titan GPU. The chart below goes up to 100,000,000.
I have generated data for these tests on the GPU as well, using Alea.CUDA random generators, since FsCheck generator above is awfully slow when it comes to large enough arrays.

The code comes almost verbatim from this blog post:

let generateRandomData n =
    if n <= 0 then failwith "n should be positive"
    let seed = uint32 DateTime.Now.Second

    // setup random number generator        
    use cudaRandom = (new XorShift7.CUDA.DefaultNormalRandomModuleF32(target)).Create(1, 1, seed) :> IRandom<float32>
    use prngBuffer = cudaRandom.AllocCUDAStreamBuffer n
    
    // create random numbers
    cudaRandom.Fill(0, n, prngBuffer)
    // transfer results from device to host 
    prngBuffer.Gather() |> Array.map (((*) (float32 n)) >> int >> (fun x -> if x = Int32.MinValue then Int32.MaxValue else abs x))

except I'm converting the generated array to integers in the [0; n] range. Since my radix sort work complexity is O(k * n) (step complexity O(k))  where k = bitCount(max input) I figured this would give an adequate (kinda) comparison with the native F# Array.orderBy .

sorting_chart

Optimization

It is clear from the above chart, that there is a lot of latency in our GPU implementation - the cost we incur for the coding nicety: by using DeviceScanModule and DeviceReduceModule we bracket the merges we would otherwise have to do by hand. Hence the first possible optimization: bite the bullet, do it right, with a single merge at the end of the process and perform a block-by-block sort with BlockScanModule

Categories: CUDA, F# Tags: ,
  1. No comments yet.
  1. June 12, 2016 at 11:59 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s