Tuesday, October 23, 2012

Talk in Dublin: Adventures in Functional Big Data

I had an opprotunity to speak at the IE group's Big Data Innovation Summit in Dublin last Friday. There was a good mix of people from several different industries - I think I learned more from the audience then they did from me. In any case, I thought I'd post my notes and slides here, in case someone found them useful/interesting. These are my raw notes, however, so I apologize for any grammatical or spelling errors.

Adventures in Functional Big Data:

How Data Sciences drives non-technology companies to make some cutting-edge technology decisions.

Main idea: Functional programming (when combined with research experience and core business knowledge) can extract value from big data and resolve the technical challenges by turning the data into an interactive model.

The massive amounts of data available to today's businesses affords us many opportunities to optimize our business process and reveal new ways to create value. But doing so presents a number of research and technology challenges. I want to tell a story to explain these problems, and how a combination of functional programming, research experience, and core business knowledge can be combined to resolve them and make one's business more effective.

Hi everyone. I'm Matthew. I'm the Senior Data Scientist with Universal Pictures International. I make machine-learning models to help our business run better.

I want to talk to you today about our adventures learning from big data. In particular, I want to tell you about the problems we have in getting the company on board with big data, and how I've used some unorthodox technologies (functional programming, e.g., Lisp and Racket) to bring business managers closer to the data.

To start off, I want to explain the relationship between "big data" and "the business", and then tell you a story to explain the problems these two have in getting along, and how functional programming helped bring them together.

"Big data" and "data science" are still new, vaguely defined terms. In fact, without some context of your business, "big data" and "data science" are just interesting academic ideas involving statistics and computing. It's important to keep in mind that it's the business domain that makes it come alive.

Now, if you're a technology company, like Google or Zynga, your "big data" likely comes from web and mobile statistics, A/B testing, et cetera. Your "data scientist" are combing through it, looking for usage patterns and trends to exploit, so you can improve your existing technology products, or create new ones.

But not all of us are technology companies.

Our company isn't quite like that. We're not trying to make a better product, we're trying to learn which of our products the market wants and when the market wants them. We don't have web and mobile statistics. Instead, we have a variety of different data coming from hundreds of [dirty] sources. Once we have it all cleaned up, we're hoping it can tell us how to make better business decisions. I expect that some of your businesses are similar.

But there's a problem here! There isn't a lot of "experiential overlap" between the folks who know and run the business, and the folks [like me] who know and can interpret the data.

Indeed, the directors and senior execs at our company are, of course, intelligent people who know the business inside and out. But when it comes to the statistics and mathematics behind "big data", well, they think "the standard deviation" could be the title of an upcoming film.

Even so, they're aware that there is this data is out there, and that it has these useful nuggets of value. It's a vast dark forest, like Mirkwood from The Hobbit. It's too big, mysterious, and dangerous for them to venture into.

But they know that there are these wizards, these Data Scientists, who can help them navigate the Data Mirkwood and find the treasure. The challenge here is for the Data Science Wizards to use all their sorcery, Hadoop and machine learning algorithms and the like, to shed some light on the Data Mirkwood and present it in a way that makes sense of the business. We'll see soon how one extra bit of magic, functional programming, can help them do that job. But before that, how does this relationship usually play out?

Well, not like an adventure at all. It's a mess!

On the one side of the equation we've got these execs and directors who have been hearing about all this "big data" and "machine learning" stuff, and reading articles about how "data scientists" are "sexy". They're smelling ways to maximize value and want to put it to use.

Our best example of this is our Pointy-Haired Boss, let's call him Paul. Paul practically lives on data. But he doesn't go into the Data Mirkwood himself. He just asks for reports from it. And he's always asking for reports. He wants his excel spreadsheets on this and his PDFs for that. There's a lot of redundancy, and none of these bits of data are very intelligent. He doesn't want to get too deep into the Data Mirkwood. He just wants something relevant that makes sense.

On the other hand, you've got the analyst trying to prepare all these reports. Just when they have a process for generating excel spreadsheets one thing, Paul the Pointy-Haired boss asks for a PDF on another. They're busy jumping around like data bunnies in our big data, trying to clean up this data source and that data source to keep up with it all.

It's all fairly superficial stuff - the data bunnies are cleaning up and reporting pieces of the large set of data, but they don't want to see the whole thing. And they certainly can't see large scale patterns and trends that are emerging from it. Paul the Pointy-Haired Boss doesn't even ask; he doesn't know or care how Bayesian clustering algorithms would help him make better decisions, and the bunnies don't know how to do that anyway.

This is where our Data Scientist comes in. We'll call him Duncan

Duncan the Data Scientist is this mystical wizard. He knows the Data Mirkwood well, he's got deep sorcery to make it come alive. He can Hadoop and MapReduce and Machine learn the data backwards and forwards. But he also knows the business, and can direct Paul the Pointy-Haired boss to the right questions.

Duncan the Data Scientist is tasked with guiding Paul the Pointy-Haired Boss through Data Mirkwood. Paul needs to know the trends and patterns hiding in Data Mirkwood, but he lacks the computational and statistical know-how to even ask the right questions. Paul knows the business, but Duncan knows the statistics.

What does Duncan need to be successful at this?

With a CV like this, I have to say that Duncan is sounding less like a scientist and more like a unicorn! In reality, Duncan has the maths, stats, and programming background to dig deeper into Data Mirkwood than do the data bunnies, but his knowledge of the business isn't nearly as deep as Paul's.

So how can Duncan get Paul into Data Mirkwood? How can he get Paul into the data in a way that he can understand?

Duncan needs to create a usable interface over the data for Paul to use. By "usable interface" I don't mean "data visualization". I mean something that Paul can interact with, a method he can use to pose questions within the context of his business, id est, Duncan needs to use the data to create a place where Paul can ask "what does the data say about this aspect of my business"? Duncan needs to create something like a "business-domain specific language" for querying the data.

At my company, the interface we built was a simulation. We cleaned up our big data, and we continually feed it through a variety of regression and classification algorithms. These algorithms then tune various parameters of our simulation, and enable us to ask questions like "what happens if this product flops?" or "when will there be room in the market for this product to survive?". The nature of your usable interface would be specific to your business, naturally, and would come from a collaboration between your Duncans and your Pauls.

But building such an interface is no walk in the park either. Say Paul and Duncan have a good working relationship. They understand each other, and Duncan is able to translate Paul's concerns into technical questions to ask about the data. Duncan still has to turn these questions into programming experiments, and then into a usable interface for Paul. He's got to do his "data science/machine learning" programming, and somehow feed that into his "application programming".

Typically, when one is programming in big data, the languages one has at his disposal fall in sort of a grid:

On the horizontal scale we have expressive languages that allow Duncan to express his statistical and mathematical analysis for the data. But these don't scale well. R is fine for small data sets, but it chokes a bit when you try to feed it petabytes of data.

On the vertical scale you have your high powered technologies. These are the sorts of things you'd use in for MapReduce over your Hadoop cluster. These are the languages you use to take Paul's questions and turn them into massively parallel computations that find patterns in the data.

There's a bit of a gap of languages that can do both. Often, Duncan finds himself prototyping and experimenting solutions on the horizontal scale, but has to re-write them on the vertical scale.

And this is where functional programming comes in. Functional programming languages like Lisp/Racket fill that void:

Functional languages, like Racket, Clojure, Haskell, Scala, et cetera are incredibly expressive. They differ from other languages in that they don't tell the computer what to do.

For instance, in a functional language, you don't do things like "set the variable x to 5. Add 3 to x and store it in y".

Rather, in functional languages, everything evaluates to a mathematical expression. You describe programs by writing down a mathematical transformation from your initial state to your desired state.

To do your data analysis, you'd describe the mathematics and statistics behind it, defining your input data and how the output should look.

To do Paul's interface, you'd describe the information you want from him, and how to turn that into an answer from the data, and return it to him.

Racket is a new functional programming language, and is incredibly expressive. It can allow Duncan to express his machine learning algorithms and his statistical analysis, but it's powerful enough for him to launch his parallel computations over a large cluster (e.g. Amazon EC2).

But functional programming itself, and languages like Racket, aren't actually that new. Even the holy-grail of "big data", MapReduce, the algorithm that makes any of our analysis possible, is based on ideas from functional programming. The name comes from a common idiom in functional languages - to map over a data set and reduce it to something simpler.

The benefit in using Racket is that Duncan can do everything in the same stack: he can experiment and find solutions in Racket. He can do his large-scale computations over the data in Racket. And he can build an interface in Racket. Paul can access to Duncan's work more quickly.

Gone are the data bunnies. Paul is working much more closely inside Data Mirkwood. By using Racket, Duncan has built an interface on top of the data and allowed Paul access to a much more sophisticate insight, and everyone's happy.

We use functional programming, especially Racket, on a day-to-day basis at my company. Our simulation and machine learning algorithms are running on it now. I'd be happy to answer any questions you may have about it.

Post-talk thoughts

The above talk is slightly incoherent, and doesn't do a good job delivering on it's promises. In particular, I could have done the following better:

  1. It doesn't explain how functional programming is useful for building these interactive models. I just state that it's better without evidence or examples. I could have talked more about our specific setup, and highlighted where judicious use of racket made our life easier. But maybe that's best left for another talk or blog post.
  2. Additionally, the audience of the talk (and the panel discussion afterwards) were very interested in the talent shortage problem that I alluded to. Another gentleman, Jean-Christophe Desplat, Director at ICHEC, spoke about a solution in his talk: hire a team of talented people who can work together. I fully agree. That might make a good talk/blog, too.
  3. Finally, I should have discussed the analogy between a "usable interface on the data" and a "business-domain-specific language for querying the data". That's really the core bit of the talk, and I barely mentioned it!

Wednesday, October 3, 2012

Haskell Exchange at Skills Matter 10 Oct

I'm going to be at the Haskell eXchange at Skills Matter on 10 Oct. If you've been following my github (or my meetup groups), you'll know that I've been learning Haskell. Thus far, I haven't written anything to take advantage of it's type system, but I'm already getting excited at the possibilities. I'm really looking forward to learning more and meeting folks at the one-day conference!

Wednesday, February 8, 2012

Metaclasses, first class objects, and a lesson from SICP.

This week my SICP Study Group is looking at functions as first-class objects. This means, among other things, that functions can be passed as arguments and returned as values of other functions. A few members of my group feel that the material covered is too theoretical and has little real-world value. I'm always a bit bemused by this, because I use ideas from SICP nearly every day, especially when it comes to higher-order functions. So I thought I'd use an example from my day job to show how this section of SICP is particularly useful.

Part of my job includes writing data-entry applications so that other team members can interact with my statistical models. I write these applications in Python, using a library called web.py. In this library, we can assign to each web request URI a class to handle it.

What's a Class?


It's something we haven't seen before in our study group! I don't want to explain it in detail (you can check Wikipedia), but I hope it'll suffice to say this:

A class can be thought of as a way to capture, isolate, and reuse state between a group of functions. "State" is another concept we haven't seen in my group! We've only written stateless, functional programs. "State" here means data that exists independently of the functions, but is sometimes used or modified by the functions to perform its computation (exempli gratia, the balance of a bank account might be state between a group of functions that affect one's bank account). For instance, say we have a web server handling requests from the internet. Each request needs to perform a series of computations related to that request, but the data for each request, id est, the state of the request, should be kept separate from each other. This is exactly why web.py uses classes for requests. We can define what data or state our groups of functions need to share, and we can make copies of those groups (called "instances" of the class) that won't interfere with each other's state.

What's a Class look like?


Classes in Python have a name, some data members, and some functions. Classes to handle web requests with web.py generally look something like this:
class RequestHandler:
request_var1 = "data..."
request_var2 = "more data..."
def GET(self):
i = web.input()
#code to handle request...
return html_output

Web.py will look for a function GET inside the class when the web request comes with GET headers. Each time the application gets a web request, it creates a new instance of the class, id est, a fresh copy of the data and functions, and calls the new GET method with the new data.

In this specific example, the class is really just a special "wrapping" around the GET function so web.py can use it. It won't hurt the reader who is unfamiliar with classes to regard them as functions for the remainder of this post.

Now, classes to handle web request are often a lot more complex than that. In my case, I have specific requests that should respond with data that contains the current parameters of my statistical model, and the data should be formatted for easy entry into a web application (exempli gratia, in JSON). I have about two dozen of these requests that ask for different aspects of the same object. Hence if I had to write them all out, each one would look something like this:
class YetAnotherRequestHandler:
def GET(self):
# code to process incoming request data and check for errors
# 5-10 lines ***same for every request***.
...
# code that's specific to this request, 2-3 lines.
...
# code to transform output into JSON-ready format
# another 5-10 lines that are **the same for every request***
return json_text

I don't want to burden the reader with those 20 lines of code that are the same for each request (nor do I want to show my employer's source code or think of my own contrived example!) but I hope I can make the point clear: each of these two dozen request handlers, they share a significant amount of code. If I were to write them all out by hand, I'd have over 80% duplicated code. Boring. Hard to read. Messy to change all at once.

Higher-order functions...erm, Classes


What if, instead of writing two dozen classes that share the same code, we could somehow abstract the similarities and only write the difference? Say, we could write a series of functions that do the specific-to-request bit:
def processor_for_request1(args...)
# 2 lines of code here
return data

def processor_for_request2(args...)
# another 2 lines of code here
return data

# and so on...

And pass those functions into some machine that will turn them into classes? Well, we can! Luckily, in Python, functions and classes are first class citizens, so all the techniques in our section of SICP apply here. We can write a function that takes these processors as input and returns a class that uses them like so:

def request_class_creator(processing_function):
class Abstract_Request:
# code to process incoming request data and check for errors
# 5-10 lines ***same for every request***.
data = processing_function(args...)
# code to transform output into JSON-ready format
# another 5-10 lines that are **the same for every request***
return json_text

Request_Handler1 = request_class_creator(processor_for_request1)
Request_Handler2 = request_class_creator(processor_for_request2)
# ...and so on


In this way, we only write once the code that is different, we can reuse the components that work, and we can easily see what the important parts of our program are. Additionally, we can easily write separate unit tests for each processing function and for the Abstract_Request handler. We obviously aren't covering unit testing in the SICP study group, but it's an important part of writing good software nevertheless.

While the language and the example is different, this is the exact same technique described in Sec 1.3 of SICP. In fact, it's where I learned it from. Python users call this technique a metaclass. Python includes a few constructs to expand how one might use this method, but I think it's safe to say that the basic ideas of it are well-explained in SICP.

Sunday, February 5, 2012

Perceptrons in Lisp (A simple machine learning exercise)

So having missed Stanford's Machine Learning course (mostly out of laziness - I'm sure it was great) I'm trying to learn this stuff on my own. I'm going through MIT's Machine Learning notes on OpenCourseWare. They're easy [for me] to digest without being insulting, and they help me avoid searching for "The right book" to learn from (a task that would delay my learning anything but make me feel busy).

After reading the first two lectures I decided I should stop and practice what I've learned: a simple perceptron learning algorithm.

What's a Perceptron anyway?

It sounds like a Transformer. It's actually a method of deciding whether an object belongs to a certain class based on a linear combination of that object's characteristics. Id est, say we have some data on atmospheric conditions, exempli gratia humidity H, wind speed W, et cetera and we want to classify those methods as to whether the conditions will produce a storm a not. We would have a linear term:

$a_1 H + a_2 W \ldots $

We want to choose the variables $a_i$ so that the above term is positive when we'll have a storm, and negative otherwise.

More generally, say we have a vector of characteristics $x$. We want to choose another vector $\boldsymbol\sigma$ so that the dot product $\textbf{x} \cdot \boldsymbol\sigma$ is positive when $x$ belongs to our class and negative otherwise. The perceptron is the function

$ f (\textbf{x}) = \text{sign}(\boldsymbol\sigma \cdot \textbf{x}) $

How do we find $\boldsymbol\sigma$?

Our learning algorithm will tell us how to choose that $\boldsymbol\sigma$. To do this, we need some training data. Id est, some vectors $\textbf{x}_1, \ldots, \textbf{x}_n$ along with labels $y_i$ where $y_i = 1$ if $\textbf{x}_i$ belongs to our class and $y_i = -1$ otherwise.

We're going to start with any ol' $\boldsymbol\sigma$, say just a vector with 1's in all positions. For $\textbf{x}_1$, we look at $f(\textbf{x}_1)$. If it equals $y_1$, we do nothing. If not, then we update $\boldsymbol\sigma$ with $\boldsymbol\sigma = \boldsymbol\sigma + y_1 \textbf{x}_1$. One can prove that this update rule will ensure that $f$ gets better at correctly classifying each $\textbf{x}_i$. We repeat this for all of our training data, and then we have a perceptron that's ready to classify some samples!

Let's see it in practice.

This is a pretty simple algorithm and won't take us long to implement, so let's try it. For our implementation, we're going to look at 25x25 pixel black and white images. We'll train a perceptron to identify which images are of the letter A and which aren't.

I [rather foolishly] created my own training data. I used GIMP to create 25x25 pixel images, saved them as an html table. Each cell of the table corresponded to a pixel, and GIMP saved them so that each td tag included a BGCOLOR attribute. I then used sed to convert each html file into a list of colors for each pixel. You can see the sed command on the github for this post.

I saved the resulting files with a ".img" extension, and it was pretty easy to write a racket function that would convert such a file into a vector:

 (define (img->vector filename)  
(define filelist (with-input-from-file (string->path filename)
(λ ()
(define (iter file-list line)
(if (eof-object? line)
file-list
(iter (append file-list (list line)) (read-line))))
(iter '() (read-line)))))
(list->vector (map hexstr->number filelist)))

The perceptron itself is not even interesting. It's just a function that takes the sign of a vector dot product:

 (define (linear-classifier img)  
(sign (dot sigma img)))

And the dot product function is just an application of the built-in vector map function:

 (define (dot v1 v2)  
(apply + (vector->list (vector-map * v1 v2))))

Of course, the real "machine learning" bit is in the learning algorithm. This too is just a straightforward description of the above into lisp:

 (define (train-perceptron img-name)  
(define label (filename->label img-name))
(define vec (img->vector img-name))
(cond ((= (linear-classifier vec) label)
(display "good!\n"))
(else
(set! sigma (vector-add sigma (scalar-mult label vec)))
(display "bad! updated sigma, though.\n"))))

Where vector-add and scalar-mult are more applications of the vector-map function in Racket's library:

 (define (vector-add v1 v2)  
(vector-map +
v1
v2))
(define (scalar-mult s v)
(vector-map *
v
(build-vector (vector-length v) (λ (x) s))))

One final detail, I defined sigma as 625 element vector initialized to 1

 (define sigma (build-vector 625 (λ (x) 1)))  

I had 25 training images for this perceptron, and it still wrongly identified my unseen images. I noticed that $\boldsymbol\sigma$ was still changing if I trained it repeatedly on the same data, and it didn't stabilize until around the tenth time. Maybe it needs more training data, maybe the perceptron is a bad algorithm for this task, I really don't know. In any case, after repeated training, it improved slightly, but still had a false negative. In any case, we have our first simple supervised machine learning algorithm. Code for this, including training data, is on my github.

There's to particular reason I used Racket/Lisp for this: it would have been just as straightforward to write it in Ruby, Python, Javascript, et cetera. I just happen to like writing in Racket.


Edit:


An intelligent commenter on Hacker News pointed out that there are some theoretical guarantees for the accuracy of your perceptron when it encounters new data. I probably should have covered this here, but because I'm out of time and eager to do other things, I'll just mention that one can prove that the learning algorithm will always settle on a specific $\boldsymbol\sigma$ after a finite number of training points $k$. Additionally, one can think of $\boldsymbol\sigma$ as a plane dividing points in space, where on one side of the plane are points in our class, and the other side are all the other points. Theoretically, there will always be a distance $d$ between the closest point in our class and the plane. $k$, of course, will depend on $d$; id est, the smaller your $d$, the harder the learning problem, and the more data you'll need to train. (To the best of my understanding!) The proof of all these things is pretty straightforward, so long as one is comfortable with basic geometry and vector-based arguments. No higher maths needed. In fact, it's covered in the second set of lecture notes!

Wednesday, January 18, 2012

Don't Censor the Web.

Congress is considering two Orwellian-named laws, SOPA and PIPA, that are threatening free speech, internet security, and innovation.

This is a reminder to call your representatives in Congress and/or donate to the EFF today to help stop internet censorship.

Thursday, January 12, 2012

Reading group for Structure and Interpretation of Computer Programs.

I'm leading a study group in the computer science/software engineering classic: Structure and Interpretation of Computer Programs. The study group is part of the UCL Undergrad Maths Colloquium but students and professionals of any level are welcome to join. We start meeting on Thursday, 19 January at 7pm in London. You can find out more at the Study Group Page at the Colloquium's website.