F# vs C#. Fold and Aggregate
Suppose you need to write a script that finds n files, all called based on some pattern, say “c:\temp\my_file_x.txt”, where “x” is replaced by a range of numbers [1..30] for instance, reads the content of these files and glues them together. Suppose also that the files are very small, so you can keep them in memory all at once. Also, it should be solved in one line (except for auxilaires: defining variables, writing out the results).
One-line solutions exist both in F# and C#. Which one is prettier? I vote for F#.
Here is the C# code:
string templ = @"C:\temp\my_file_";
var content =
Enumerable.Range(1, 30)
.Aggregate(
new List<string>(),
(a, e) =>
{
a.AddRange(File.ReadAllLines(templ + e.ToString() + ".txt"));
return a;
});
File.WriteAllLines(templ + ".txt", content);
And here is the F# version (of just the relevant part):
let content =
[1..30]
|> List.fold (
fun content i ->
content @
(File.ReadAllLines(fun i -> templ + i.ToString() + ".txt") |> Array.toList)
) []
You can accomplish almost anything with fold() and its C# Linq equivalent Aggregate().
So first we create a range, (1..30) (note here, that although [1..30] and Enumerable.Range(1, 30) generate sequences of numbers from 1 to 30, their semantics are different, so [0..30] and Enumerable.Range(0, 30) generate different sequences: the latter generates a sequence of numbers 0..29).
Then we fold the range of numbers into a list of lines (we could have just kept appending the text, not lines, but it is not all that important for this macro, and we want to make sure we start each new addition from a new line), by reading the files and gluing the results together
Data and Visualization.
As the three of us embarked on this new data-mining project, we were the data scientist, the manager and the developer, who knew nothing about visualizations. We didn’t even want to do any visuals at first.
Then someone stumbled across the New York Times Obama Budget visual and the wheels started spinning. Pretty soon we had something like this of our own, and then it snowballed into a real project with quite a few interactive charts and visuals, all d3 based.
While developing all this, I started to wonder: why are the right visuals so incredibly effective in presenting data? Exactly what do the bubbles have that the tables don’t: it is the same data after all. I called upon phenomenology as it was first presented in Logical Investigations by Edmund Husserl’s (because I haven’t made it any further in husserlian literature yet) to help me understand what is happening.
Husserl and Data Intuition
The core idea of Logical Investigations is that meanings in the broadest sense of the word (either what I “mean” when I express a thought, or simply say: “This is blue”, “His name is Neal”), exist as a class in itself. Not quite like entities in the platonic heaven of Ideas, but they are a class of some kind of entities, “logical entities” to be exact, in a sense that, just like logical constructs they exist independently of human perception or imagination of any kind.
This seems rather far-fetched at first, after all, through the entire history of philosophy we seem to have always started from sensory perception as the stepping stone towards
When in a presentation I write: “Should yellow patent classes intersect with the green ones?” a person out of context with my project, one without knowledge of patent taxonomy of any kind, can nevertheless have a basic grasp of what I mean: obviously I have somehow separated groups of patents into larger groups. assigned colors to them and now I want to know something about the properties of these groups. Again, the meaning does not seem to depend on perception or experience at all. In fact, most of the 1000 pages of Logical Investigations is spent combating those views. It is not as incredible as it sounds, though. Surely when I say “Paris is beautiful” or “Bed bugs are something you should never experience”, my listener, if she understands the English language, understands what I mean, even if she has never been to Paris, or, God forbid, been bitten by bed bugs. (In fact, when I had my first and I hope only encounter with them, it took me very little time to realize what is going on, even though nobody had warned me and I had never been bitten before that time).
According to Husserl our grasp of meaning is an act that has nothing to do with generating the meaning itself, and occurs when we direct ourselves towards the meaning. The word he uses is “intendieren”, to intend. Expression or understanding, are “intentional” acts in a sense that out of the entire universe of meanings we direct ourselves (“intend”) to a particular one (or a particular cluster) and bring it into focus.
The question still remains: what is the role of perception, or even imagination in all of this? After all, we do seem to think in pictures of sorts, and there is no denial: I understand “Paris is beautiful” or “Bed bugs suck” on a very different level if I have been to Paris or had a misfortune to sleep in the wrong bed.
So, Husserl distinguishes two classes of acts: signicativen (or signitiven ) and intuitiven (erfüllenden). Signifying and intuitive (fulfilling). Signifying are all the acts where meaning is simply expressed, and intuitive are the acts where perception or imagination is used to “fill” the meaning with some content. When I say “Paris is beautiful”, or “This tree is green”, or “This is Neal”, my expressions are purely signitive, i.e. they just point in the direction of the meanings, “signify” them (from the root “sign”). If I show pictures of Paris (or rely on your imagination to picture Paris), point out of the window at the tree, introduce Neal, – I am now “filling” these pure meanings with intuitive content. Now what I mean actually takes shape. I don’t gain any more understanding, what I gain is insight: internal-sight.
The distinction is important. While all of meaning is expressed in signifying acts, it does not come to a full grasp, until it is intuited, seen in the mind’s eye.
I think these concepts are illustrated par excellence in the field of data visualization. In Husserl’s terminology we may have called it “data intuition”, or “data fulfillment”, or even “data insight”. There is enough meaning in the data itself, especially once data scientists go to work on it and extract trends, make predictions, etc. However, there is no “intuition” in all that. And without this intuition, it so happens, you cannot have a meaningful conversation with your user who may be a layman in the area of statistics, machine learning, data mining: your ideas are empty. You need to “fill” them with pictures. Moving and interactive pictures – better still.
And so we arrive at the definition of “data visualization” (according to Kant it is lucky in philosophical discourse to ever arrive at a definition, in a blog entry it must be nearly impossible):
Data visualization is an act of creating/perceiving presentations of certain aspects signified by data in an intuitive way.
Visualizing Crime with d3: Intro
Figure a blog without pictures or conversations is just boring, so, here it is.
Lately, I have been dealing a lot with data visualization. This was a brand new area for me and while we do use F# for data extraction, all of the front end is done using d3, an amazing toolkit by Mike Bostock.
First of all, I owe the fact that my projects got off the ground to Mike and Jim Vallandigham. Jim taught me all I know about how to draw bubble “charts” and use d3 force layouts. His blog is invaluable for anyone making first steps in the area of data visualization. Code snippets I am going to post here are due to Jim’s and Mike’s generosity. So, thank you Mike and Jim.
One may ask, if there are already tutorials on how to put together visuals, why assault the world with more musings (as a Candace Bushnell character once wrote). The answer is: my goal in these posts is not to exploring creation of visuals, but rather sharing experiences on how to put together projects that involve visualizations.
These are very different problems, since your task is not just to create a single document or web page for a single purpose, but to create something that can dynamically build these documents or pages, and maybe, within each such document provide different views of the same data. Questions of:
- design
- reuse
- coding practices
come up right away, not to mention general problems:
- What are data visualizations?
- What are they used for?
- Are they needed at all?
So, for these posts, we will be building a project that visualizes crime statistics in the US for the year 2008. The data source for this is found here and the complete solution will look like this.
The approximate plan for the next few posts:
- Thinking about visualizations and what they are
- Preparations
- Getting Data (retrieving, massaging, formatting)
- Getting the tools together (CofeeScript, d3, ColorBrewer, Knockout.js, Twitter Bootstrap, jQuery, jQuery bbq)
- Building the visuals
- Laying out “single” charts
- Laying out multiple charts on the same page
- A word about maps
- Lessons learned: architecting for reuse, etc
Disposable Objects with Computation Expressions
The last post contains the description of a sqlMonad. It also happens to contain a silly and obvious (aren’t they all in hindsight) bug. The bug is in implementing the containing CmdSqlBuilder with the IDisposable.
While the intent was good (the class wraps resources that should be promptly disposed of – SqlCommand and SqlConnection):
let connection = new SqlConnection(connectionString)
let cmd = new SqlCommand(name, connection)
there is no real opportunity to use it in this way, since the underlying object is statically created in advance and so cannot be used as disposable objects normally are!
The fix is to clean thing up after each run like so:
member this.Run( m : CmdSqlMonad<'a>) =
try
m cmd
finally
dispose()
Here, after each run, dispose() function should do its work closing connections and disposing of the command object. Disposable pattern should not be implemented as its application in this case makes no sense.
Exploring Monadic Landscape: Sql Command Computation Expression
Most of the developers have dealt with calling SQL server stored procedures from their applications at least once or twice. In my last project, where intense data mining is done on the SQL side, this is basically all I am doing. There is always a desire to wrap and abstract the ever-repetitive code to get the connection, build an instance of the SqlCommand class, read in the returned dataset. And it is never coming out quite as succinct as expected.
Again, this is a perfect situation for using computation expressions, as we can clearly see the workflow:
- Connect to the database
- Set command text
- Set command parameters (if necessary)
- Set other command options
- Execute the command of a necessary type
So at this point, it is easy to figure out how to write the builder for the command-oriented workflow.
Defining the Monadic Type
The gist of this workflow is that we take an instance of SqlCommand and run with it every step of our workflow. Hence, the step is defined like this:
type CmdSqlMonad<'a> = SqlCommand -> 'a
let sqlMonad<'a> (f : SqlCommand -> 'a) : CmdSqlMonad<'a> = f
(the operator on line 2 is defined for convenience and to guide the type system).
We can also define some auxiliary methods:
type sqlParams = (string * obj) []
let setParameters (sqlParameters : sqlParams) =
sqlMonad(fun (cmd : SqlCommand) -> sqlParameters |> Seq.iter(fun (name, value) -> cmd.Parameters.AddWithValue(name, value) |> ignore))
let setType (tp : CommandType) = sqlMonad (fun cmd -> cmd.CommandType cmd.ExecuteReader())
let execNonQuery() =
sqlMonad(fun cmd -> cmd.ExecuteNonQuery())
let execScalar() =
sqlMonad (fun cmd -> cmd.ExecuteScalar())
let setTimeout(sec) = sqlMonad(fun cmd -> cmd.CommandTimeout
Each of these (except for the last three) are of the type CmdSqlMonad<unit>, as they simply set some properties on our SqlCommand object. This object is propagated all the way through the workflow by our Bind() function:
member this.Bind(c : CmdSqlMonad<'a>, f : 'a -> CmdSqlMonad<'b>) =
sqlMonad(fun cmd ->
let value = c cmd
f value cmd)
We can start defining the builder now. This builder is parameterized. It takes the connection string and the command name (or any query for that matter):
type CmdSqlBuilder (connectionString, command) =
do
if String.IsNullOrWhiteSpace(connectionString) then invalidArg "connectionString" "connection string must be supplied"
let connection = new SqlConnection(connectionString)
let cmd = new SqlCommand(name, connection)
do
(retry {
return connection.Open()
}) defaultRetryParams
let dispose () =
cmd.Dispose()
interface IDisposable with
member this.Dispose () =
dispose()
GC.SuppressFinalize(this)
override this.Finalize() = dispose()
(Note the use of “retry” computation expression).
The rest of the stuff is pretty standard:
member this.Return ( x : 'a) : CmdSqlMonad<'a> = fun cmd -> x
member this.Run( m : CmdSqlMonad<'a>) = m cmd
member this.Delay(f : unit -> CmdSqlMonad<'a>) = f()
member this.ReturnFrom(m : CmdSqlMonad<'a>) = m
We define the Run method to execute the workflow right away with the command that is created in the constructor.
Finally, to define the computation expression:
let sqlCommand(connectionString, name) = new CmdSqlBuilder(connectionString, name)
At this point, wrapping sprocs is easy:
let args : sqlParams = [|("@param1", val1 :> obj); ("@param2", val2 :> obj)|]
sqlCommand (connectionString, name) {
do! setParameters(args)
do! setTimeout(10 * 60)
do! setType(CommandType.StoredProcedure)
return! execNonQuery()
}
Or calling a function:
let args : sqlParams = [|("@param", value :> obj)|]
sqlCommand(connectionString, "select dbo.MyFunc(@param)") {
do! setParameters [|("@param", searchString :> obj)|]
return! execScalar()
}
> string
Or even a simple query:
let rd =
sqlCommand(connectionString, "select * from someTable") {
return! execReader()
}
The code is concise and easy to understand.
Here is the complete source:
module CommandBuilder =
open System
open System.Data.SqlClient
open System.Data
type sqlParams = (string * obj) []
type CmdSqlMonad<'a> = SqlCommand -> 'a
let sqlMonad<'a> (f : SqlCommand -> 'a) : CmdSqlMonad<'a> = f
let setParameters (sqlParameters : sqlParams) =
sqlMonad(fun (cmd : SqlCommand) -> sqlParameters |> Seq.iter(fun (name, value) -> cmd.Parameters.AddWithValue(name, value) |> ignore))
let setType (tp : CommandType) = sqlMonad (fun cmd -> cmd.CommandType <- tp)
let execReader () =
sqlMonad(fun cmd -> cmd.ExecuteReader())
let execNonQuery() =
sqlMonad(fun cmd -> cmd.ExecuteNonQuery())
let execScalar() =
sqlMonad (fun cmd -> cmd.ExecuteScalar())
let command(text) = sqlMonad(fun cmd -> cmd.CommandText <- text)
let setTimeout(sec) = sqlMonad(fun cmd -> cmd.CommandTimeout <- sec)
type CmdSqlBuilder (connectionString, name) =
do
if String.IsNullOrWhiteSpace(connectionString) then invalidArg "connectionString" "connection string must be supplied"
let connection = new SqlConnection(connectionString)
let cmd = new SqlCommand(name, connection)
do
cmd.CommandTimeout <- 60 * 20
(retry {
return connection.Open()
}) defaultRetryParams
let dispose () =
cmd.Dispose()
interface IDisposable with
member this.Dispose () =
dispose()
GC.SuppressFinalize(this)
override this.Finalize() = dispose()
member this.Command = cmd
member this.Return ( x : 'a) : CmdSqlMonad<'a> = fun cmd -> x
member this.Run( m : CmdSqlMonad<'a>) = m cmd
member this.Delay(f : unit -> CmdSqlMonad<'a>) = f()
member this.ReturnFrom(m : CmdSqlMonad<'a>) = m
member this.Bind(c : CmdSqlMonad<'a>, f : 'a -> CmdSqlMonad<'b>) =
sqlMonad(fun cmd ->
let value = c cmd
f value cmd)
let sqlCommand(connection, name) = new CmdSqlBuilder(connection, name)
The Push Monad: Introduction
Chapter 5 of Friendly F# has a great practical explanation of F# computation expressions often called “monads” from their use in computer science and Haskell. The material in Chapter 5 of the book does a lot to demystify the concept, theoretical coverage of which is done well in this Wikipedia article.
Monads are an example of things best grasped by actually doing. So I set out to implement one in my project.
What Friendly F# discussion instills above all (confirmed by the Wiki article) is that a monad in functional programming is a great way to subsume some common side effects or patterns under an explicit syntax that serves several purposes:
- Unclutter the code
- Make the pattern visible thus improving readability, while at the same time
- Avoiding “action at a distance” anti-pattern, where things seem to happen magically but it is extremely hard to figure out what is actually responsible for the magic
In this particular case, the monadic pattern is implied by the Push language: all programs are 100% robust, i.e. all syntactically correct programs execute without throwing an exception and the state of the system is preserved. This means that every time something occurs that makes execution of an operation impossible, we need to “unwind” the system and return it to its state before execution had started. It would be nice to factor all of that out of the implementation so we can concentrate exclusively on semantics of the operations.
So, while implementing Push operations the following must be done:
- See if there are enough arguments on stack(s). If there were less than enough exit.
- Start executing the operation. If the operation cannot be completed return everything back to the stack(s), exit. Else:
- Push result to the appropriate stack.
For instance, here is an implementation of one of Push operations written without the use of monads:
[<PushOperation("%")>]
static member Mod() =
match processArgs2 Float.Me.MyType with
| [a1; a2] ->
if a2.Raw<float>() = 0.
then
pushResult a1
pushResult a2
else
let quot = Math.Floor(Math.Floor(a1.Raw<float>()) / Math.Floor(a2.Raw<float>()))
let res = a1.Raw<float>() - quot * a2.Raw<float>()
pushResult(Float(res))
| _ -> ()
Here all the steps are recognizable:
- Pop two arguments from the FLOAT stack using processArgs2. If it returns anything but a list of two values exit.
- Check if the second argument is 0. If so, return arguments back to the stack and exit, otherwise execute the operation.
- Push the result back to the FLOAT stack
Here is the monadic version:
[<PushOperation("%")>]
static member Mod() =
let getMod stack =
push {
let! right = popOne stack
let! left = popOne<float> stack
if right <> 0. then
let quot = Math.Floor(Math.Floor(left) / Math.Floor(right))
return! result stack (left - quot * right)
}
getMod Float.Me.MyType
We no longer need to explicitly handle the pattern mentioned above. All the steps and branches are contained within our definition of the “push” monad, so no magic here. The reader of the code knows where to look for explanation of the side effects.
If there are less than 2 values on top of the FLOAT stack, execution will not go forward and previous arguments will be returned to the stack.
If the right argument is 0, “unwinding” of the state will also happen automatically without any need to handle this case explicitly.
One other convenience: we can now factor out extracting the value from an object we get from the top of a stack (by calling its Raw<’a>() function). This is done by implementing the monad and presented through compiler sugar of “let!” assignment. A great improvement on maintainability and ease of implementation.
“Under the hood” details to be discussed in the next post.
