Adventures in NoSQL, part 1

By Tom Ellis | February 04, 2013

You’ve deployed and setup a private Cloud platform but now what? You need an application!

I’ve been experimenting with a number of technologies to generate workloads and give some demos to
prospective Eucalyptus customers. A NoSQL database seems like a great use-case to demo as the technology benefits from
being designed for scale-out workloads and this happens to be exactly what an IaaS Cloud does best.

There are an abundance of NoSQL implementations (Cassandra, MongoDB, Couchbase, Neo4j…), written in different programming
languages and with slightly different takes on which two parts of the CAP theorem
they choose to implement and which method they will use to store and display data.

For this post I’m going to be using MongoDB, which is in the “CP” camp, it handles Consistency and Partition Tolerance
whilst forgoing Availability (Every request may not see a response), although MongoDB still provides some great availability

MongoDB is supported by 10gen, seems fairly mature and has a large community of users with
modules for a ton of different programming languages. Cassandra also interests me and I’ll tackle that in a later post.

We also need a bunch of data and whilst there are large datasets available on the internet, last week I read a
post on using the Twitter streaming API
with Ruby and storing that data in MongoDB and thought it would be cool to use it, albeit with Python instead of Ruby.

Creating an ssh keypair and application security group

To start, let’s setup a keypair and security group for MongoDB so that we can ensure it is not going to be accessed by anyone else:

Ensure we have our Eucalyptus or Amazon credentials in the environment

source ~/eucarc

Create an ssh keypair

euca-add-keypair mongodb > ~/mongodb.key
chmod 400 ~/mongodb.key

Add SSH, MongoDB and MongoDB admin interface ports to mongodb security group

euca-create-group mongodb -d “MongoDB databases”

Replace with your IP e.g. to restrict it to just your system

euca-authorize -P tcp -p 22 -s mongodb
euca-authorize -P tcp -p 27017 -s mongodb
euca-authorize -P tcp -p 28017 -s mongodb

Run an instance

We can now spin up an instance running Ubuntu 12.04 LTS x86_64 and install MongoDB on our private cloud:

euca-run-instances -k mongodb -g mongodb -t c1.xlarge emi-87F63CE5

If you are using AWS or your own cloud you’ll need to substitute the EMI ID I’ve used with
one an AMI of Ubuntu or your own image ID.
You will also need to use your own keypair.

After a few moments our instance should show as ‘running’:
$ euca-describe-instances
RESERVATION r-AB3F4645 985725263417 mongodb
INSTANCE i-D89D40E2 emi-87F63CE5 running mongodb 0 c1.xlarge 2013-02-03T22:40:26.743Z cluster1 eki-222540D6 eri-A5753DBE monitoring-disabled instance-store

Install MongoDB

Let’s connect to the instance and install MongoDB:

The MongoDB documentation
goes into the installation of MongoDB in more detail.

Ubuntu 12.04 LTS has version 2.0.4 of MongoDB in it’s repositories, 2.2.3 is the current stable version
upstream so we’ll use the repository from 10gen to install the latest package.
ssh -i mongodb.key ubuntu@ #replace with your instance IP!
sudo apt-key adv --keyserver --recv 7F0CEB10
echo "deb dist 10gen"| sudo tee -a /etc/apt/sources.list.d/10gen.list
sudo apt-get update
sudo apt-get install -y mongodb-10gen

At this point we have an instance running that has MongoDB installed and running.
You should be able to navigate to the MongoDB admin interface in your web browser:

Installing Tweetstream

Now we have MongoDB running, we need to import some twitter data. Twitter has a streaming API
that is publicly accessible (as long as you have a twitter account!) and there a number of
modules for the programming language of your choice.

Tweetstream is a Python module that provides easy access to the streaming API and we can use it
in combination with pymongo to store tweets into MongoDB.

Tweetstream isn’t packaged for Ubuntu, so I’ll use the source:
sudo apt-get install -y python-setuptools
wget -c
tar -zxvf tweetstream-1.1.1.tar.gz
cd tweetstream-1.1.1 && sudo python install

Installing pyMongo

pyMongo is the official MongoDB python driver and is available
from the Ubuntu archive.
sudo apt-get install -y python-pymongo

Writing a python script to save tweets into MongoDB

There are a number of articles detailing how to do this via curl or tweetstream and it’s very surprisingly very simple to do it.

This following script is based on some of those examples. It connects to MongoDB and stores tweets in a collection called ‘twitterstream’.
It stores the whole tweet which includes a lot of metadata, it might be useful to use this metadata later to sort tweets or index for particular
fields we are interested in querying. It’s important to note that the streaming API does not give us all tweets on twitter, it’s merely a small
percentage as the “Firehose” API that contains all tweets is not public.

``` python
import tweetstream
import pymongo

mongohost = “localhost”

connection = pymongo.Connection(mongohost, 27017)
db = connection.twitterstream

with tweetstream.TweetStream(username, password) as stream:

for tweet in stream:
        # Save the whole tweet but only show certain fields on screen
        print tweet['created_at'], tweet['id'], "Username: ", tweet['user']['screen_name'],':', tweet['text'].encode('utf-8')


If we run this, you should see a stream of tweets printed out and the whole tweets stored within MongoDB:

Use the mongo shell to see if there are entries in the database:
``` bash
$ mongo
MongoDB shell version: 2.2.3
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type “help”.
For more comprehensive documentation, see

Questions? Try the support group

show dbs
admin (empty)
local (empty)
twitterstream 0.203125GB
use twitterstream
switched to db twitterstream
show collections

The final command should output a portion of the tweets in the json document format that MongoDB queries are displayed in.

That’s it, we’re now streaming tweets into MongoDB via Python tweetstream!

In part 2, I’ll investigate scaling out the MongoDB database by spinning up new Eucalyptus instances and configuring replication and sharding.

Get Started with Eucalyptus

Use FastStart to easily deploy a private cloud on your own machine from a single command!