Marching on from the last post.

### Lazy Sequences

This is my favorite feature ever. If I want to generate just a few of 10! (nobody even knows how much that is) permutations, I could:

(take 10 (permute [1 2 3 4 5 6 7 8 9 10]))

provided, the function is defined (as described in the first post):

(defn permute [v] (when-let [[pos2 pos1] (findStartingPos v)] (let [nxt (sort-remainder (swapDigits v pos2 pos1) (inc pos1))] (cons nxt (lazy-seq (permute nxt))))))

Here I am not sure which language I like more. Clojure has easier syntax: everything fits nicely within the recursive function call. Returning nil terminates the loop, while in F# you need to know to return an option type where None terminates iteration. On the other hand, I like the fact that everything is neatly wrapped in the “unfold” function: seems more “natural” to me: fold/unfold – there is a certain symmetry here. Also, everything exists nicely in this LINQ-like world…

let permute (v : 'a array when 'a: comparison) = Seq.unfold (fun prev -> match findStartingPos prev with | None -> None | Some (cur, pos) -> Some(prev, sortRemainder (swapPositions prev cur pos) (pos + 1))) v

### Weak Typing

I really like Clojure weak typing. And I like the F# strong type system:

let sortRemainder (v : 'a array) pos = if v.Length - 1 = pos then v else [| yield! v.[0..pos - 1] yield! Array.sort v.[pos..v.Length - 1]; |]

F# type system requires that the first argument be qualified, but it is happy with this abbreviation, while the full qualification should be:

let sortRemainder (v : 'a array when 'a: comparison) pos =

Since we are sorting a subvector, the array has to be of a “comparable” type. Which is the condition of the applicability of the algorithm.

In Clojure it looks simpler, but it’s essentially the same:

(defn sort-remainder [v pos1] (if (= (dec (count v)) pos1) v (into (subvec v 0 pos1) (sort (subvec v pos1)))))

### Tail Recursion

One more cool feature of functional languages. I think it’s another tie once you use it, although the “loop” construct that demands it is very nice.

The following function returns a tuple (current, found) of two positions within the array: one of the element that is being “promoted” up (current), and the other – of the smaller element being pushed back. (So, current > found && v[current] < v[found]). Or nil/None if no such pair can be found. This is the key function of the algorithm:

(defn findStartingPos [v] (loop [cur (dec (count v)) acc [-1 -1]] (let [maxPos (second acc)] (if (or (< cur maxPos) (< cur 0)) (if (= maxPos -1) nil acc) (if-let [pos (findFirstLessThan v cur)] (recur (dec cur) (if (< maxPos pos) [cur pos] acc)) (recur (dec cur) acc))))))

And F#:

let findStartingPos v = let rec findStartingPosRec cur acc = let maxPos = snd acc if cur < 0 || cur < maxPos then if maxPos < 0 then None else Some acc else let pos = findFirstLessThan v cur match pos with | Some pos -> findStartingPosRec (cur - 1) (if maxPos < pos then (cur, pos) else acc) | None -> findStartingPosRec (cur - 1) acc findStartingPosRec (v.Length - 1) (-1, -1)

It’s nice that we have a “loop” keyword in Clojure to provide cleaner syntax and more discipline for defining tail-recursive functions, but I am not appalled with the way we do it in F# either.

(The above functions contain obvious optimizations: we stop scanning once we have a pair of “swappable” elements and we have moved beyond the “found” position. Also, we discard a valid pair if we already have a pair where “found” position is larger than the “found” position of the current iteration).

## Doing it in a Massively Parallel Way

Of course, I like everything parallel… So what about doing it on a GPU, using, say CUDA? It is definitely possible, although probably not very practical. Even if we only have an array of 10 distinct elements, the number of permutations is already ridiculously large (although who knows what we are going to be using them for)… In any event, this is solvable if we can get “random access” to permutations. Instead of unfolding them as a lazy sequence, generate them all at once in a massively parallel fashion.

This is possible because permutations are neatly tied to factoradic numbers, as this Wikipedia article explains. So, it is always possible to generate “permutation #10” to be guaranteed different from “permutation #5” for distinct, fully ordered sets. (Any sets where ordering relationship is not defined can still be easily permuted as long as its elements are stored in indexed data structures, such as arrays, by simply generating permutations of indices). Thus, taking CUDA “single data multiple threads” computation model it is easy to generate all (or however many) permutations in parallel. Naturally, if we are not just outputting the results but need to store them, the exponential nature of the problem memory growth, as well as the number of threads required, and the limited amount of GPU memory (a single computer RAM for that matter) will quickly become a problem. I guess the CUDA C++ version of this will have to wait until the next job interview…

Why only C++? There’s also: http://fsharp.org/use/gpu/

Of all the options mentioned there Alea.cuDbase is the only viable one (statistics package doesn’t apply, the rest are either abandoned or not there yet). The main problem with Alea is that it invokes Nvidia compiler at run-time, and since that compiler is very slow, the application takes a relatively large hit negating many GPU advantages. Also, a necessity to shovel memory across managed/native boundary adds an extra overhead to moving it to the GPU. And third, this option makes it impossible to use libraries like Thrust, which is great at wrapping GPU functionality and of which do not have anything analogous in .NET. So basically not quite there yet…