Recently, I saw a really cool study of the costs that companies incur when using public clouds in a "Big Data" context. The cost data is certainly interesting, but I was more intrigued by what the presentation had to say about cloud usage, particularly with respect to big data. For example, on slide 11, it looks like 89% of the monthly public cloud instance usage is at relatively small scale. That is, along with "Big Data" there must be a little code.
Managing data at scale is a challenge, to be sure. Latencies, integrity, privacy all become much more difficult to manage with data scale. Just determining when it is truly safe to delete a datum can be a chore.
However, if Big Data is in "The Answer," then a little code is necessary to ask the question.
Herein lies the relationship between Big Data and Infrastructure as a Service (IaaS). We hear about users who run Big Data technologies (Hadoop, Cassandra, etc.) in Eucalyptus and often I'm asked to explain why. These technologies run perfectly well on their own (e.g. on "bare metal") and in a public cloud there is no alternative but to run them in virtual machines. However, in a private cloud setting, why would an administrator or user choose to use Eucalyptus (or any other private cloud) to host a big data technology?
I thought I'd try to illustrate the answer (which can be summarized as "because of a little code") with a little code. A few years ago, a colleague of mine and I taught a course at UCSB in which we discussed steganography — loosely, the process of hiding one message in another.
Many Big Data applications, such as web analytics, appear to have a stegenographic character. That is, one uses them to search for a hidden "message" in a large corpus of data using a "decoding algorithm" that is often statistical in nature.
In other words, Big Data can be like a mechanized Easter egg hunt.
Moreover, while the data is large and may change from day to day (or moment to moment), the code is (by comparison) small and relatively static.
Analytics are not the only use for Big Data technologies, to be sure. For example, Adrian Cockcroft (@adrianco) visited Eucalyptus the other day and gave a great talk on NetflixOSS. Netflix content certainly qualifies as Big Data and yet Adrian spent the better part of 75 minutes talking about the code.
Often, though, there is a "needle in a haystack" type problem or a large-scale analytics problem that calls for a Big Data solution. As a scaled-down example, I've asked Marten Mickos (@martenmickos), our CEO to provide a pearl of wisdom on any subject he prefers that he will not otherwise make public and I've used a form of steganography to encode it in this image.
Marten is an expert on many subjects and his opinions are often much in demand. In short, I've asked him to provide an Easter egg.
The image, the source code, and the build scripts for extracting Marten's hidden wisdom are available to download. To extract the message on a Linux or OSX system, save the file and run the following commands
tar -xzf sw.tgz ; make all ; ./extract-message
and the code will produce a file called "martens-message.png" in that directory that, when viewed, will contain the decoded message.
Did You Try It? Of Course Not.
If you have read this far, chances are you didn't even make an attempt to run the three commands listed above. Most likely you are too busy, but even for the curious you are probably viewing this message on a non-computing device (smart phone, tablet, etc.) or on a computing device that runs Windows. While I've tested that procedure on both Linux and OSX, I don't have the development expertise to port it to more modern platforms and/or Windows (my ancient programming skill set, as evidenced by that code, is rather dated).
Reason 1: Code Portability — This hurdle illustrates the first reason why IaaS serves an important function in a Big Data setting. Often, the Big Data technology is focused on — uh — Big Data and not on system portability. Not only is it difficult to make the functionality system independent, ensuring performance (a key requirement for many Big Data applications) across platforms is an even more complex undertaking.
IaaS makes the software environment necessary to run the application part of the application itself.
The more complex the application, the greater the affinity (typically) between its functionality, its performance, and its software environment. IaaS allows the user to ensure that the application runs in the environment for which it is designed.
Did You Try It? Of Course You Did.
For the more intrepid who managed to get the software onto a Linux or OSX platform, did it work? Maybe. First, the build process uses both gcc and make which requires Xcode on OSX. Both of these utilities are Linux staples, but some Linux distributions do not install them by default so even if you are trying Linux, the build may fail. That is,
Reason 2: Dependency Management — Big Data deployments can often have significant software dependencies. In this example, the code has both build and runtime dependencies (see below). Another function that IaaS serves in the Big Data context is to allow the user to ensure and manage any dependencies the application may have under programmatic control.
If you have made it past those hurdles, the code calls an open source library for manipulating PNG files called libpng. The current version of
libpng is version 1.5.X. The
libpng project is brilliantly managed as an open source project. So brilliantly managed that, in fact, the 1.5.X series is maintained to be backwards compatible with the 1.4.X series — the sign of an active and vibrant user community.
However, I wrote the steganography code in 2006 using
libpng version 1.2.50 and I don't know if it is compatible with the latest version or the version before it. Ubuntu 10.04 (Lucid) includes
libpng12-dev which appears to work with my original code. Figuring out whether and how my old code can be updated to use this dependency or whether the newer versions are backward compatible enough is where I stopped the coding exercise. I have a copy of the library I was using in 2006 and it works just fine with this application. Updating to a dependency that is seven years later to obtain no additional functionality is a burden I did not wish to incur. This laziness on my part leads to
Reason 3: Legacy Support — A third role that IaaS plays for Big Data is that of providing support for the code base as it becomes a legacy. While the data may change quickly, the code (except in very special circumstances) will enjoy a lifecycle that has a longer time horizon. IaaS allows the user to speed-match the aging of the data with the aging of the code in a Big Data setting.
libpng project is really well maintained. They actually have archived version 1.2.50 on their downloads page which is the version I used. If you are really stubborn and want to get this exercise to work, you'll probably need a copy of that library and to edit the makefiles in the code I've posted to find the library in the appropriate place. Make sure, though, that you don't accidentally cause a conflict with a more modern version that is installed.
Or you can just get a free account on the Eucalyptus Community Cloud, dump all of the code into an instance, and go. In other words,
Reason 4: Sandboxing — The last function that IaaS can play for Big Data is to allow for the sandboxing of incompatible code bases. The same Big Data assets may be useful in conjunction with very different code, each of which may have its own set of potentially conflicting dependencies. In this example, the potential conflict is caused by the dependence on a legacy version of
libpng. These types of version conflicts become more frequent as the complexity of the code scales. Thus an IaaS platform allows different code bases to co-exist without conflict so that they can be used against the same data assets.
Big Data and IaaS-style cloud computing are sometimes portrayed as being related and other times as being independent. My own view is that the software engineering and runtime support that IaaS makes possible under programmatic control (in addition to the scalable data management support) will enable Big Data applications and technologies to become commonplace. Big Data will be the answer, but we will all need a little code to ask the questions. After all, the Easter Bunny ultimately knows where the eggs are hidden.