You’ve deployed and setup a private Cloud platform but now what? You need an application!
I’ve been experimenting with a number of technologies to generate workloads and give some demos to
prospective Eucalyptus customers. A NoSQL database seems like a great use-case to demo as the technology benefits from
being designed for scale-out workloads and this happens to be exactly what an IaaS Cloud does best.
There are an abundance of NoSQL implementations (Cassandra, MongoDB, Couchbase, Neo4j…), written in different programming
languages and with slightly different takes on which two parts of the CAP theorem
they choose to implement and which method they will use to store and display data.
For this post I’m going to be using MongoDB, which is in the “CP” camp, it handles Consistency and Partition Tolerance
whilst forgoing Availability (Every request may not see a response), although MongoDB still provides some great availability
MongoDB is supported by 10gen, seems fairly mature and has a large community of users with
modules for a ton of different programming languages. Cassandra also interests me and I’ll tackle that in a later post.
We also need a bunch of data and whilst there are large datasets available on the internet, last week I read a
post on using the Twitter streaming API
with Ruby and storing that data in MongoDB and thought it would be cool to use it, albeit with Python instead of Ruby.
Creating an ssh keypair and application security group
To start, let’s setup a keypair and security group for MongoDB so that we can ensure it is not going to be accessed by anyone else:
Ensure we have our Eucalyptus or Amazon credentials in the environment
Create an ssh keypair
euca-add-keypair mongodb > ~/mongodb.key
chmod 400 ~/mongodb.key
Add SSH, MongoDB and MongoDB admin interface ports to mongodb security group
euca-create-group mongodb -d “MongoDB databases”
Replace 0.0.0.0/0 with your IP e.g. 126.96.36.199/32 to restrict it to just your system
euca-authorize -P tcp -p 22 -s 0.0.0.0/0 mongodb
euca-authorize -P tcp -p 27017 -s 0.0.0.0/0 mongodb
euca-authorize -P tcp -p 28017 -s 0.0.0.0/0 mongodb
Run an instance
We can now spin up an instance running Ubuntu 12.04 LTS x86_64 and install MongoDB on our private cloud:
euca-run-instances -k mongodb -g mongodb -t c1.xlarge emi-87F63CE5
If you are using AWS or your own cloud you’ll need to substitute the EMI ID I’ve used with
one an AMI of Ubuntu or your own image ID.
You will also need to use your own keypair.
After a few moments our instance should show as ‘running’:
RESERVATION r-AB3F4645 985725263417 mongodb
INSTANCE i-D89D40E2 emi-87F63CE5 188.8.131.52 184.108.40.206 running mongodb 0 c1.xlarge 2013-02-03T22:40:26.743Z cluster1 eki-222540D6 eri-A5753DBE monitoring-disabled 220.127.116.11 18.104.22.168 instance-store
Let’s connect to the instance and install MongoDB:
The MongoDB documentation
goes into the installation of MongoDB in more detail.
Ubuntu 12.04 LTS has version 2.0.4 of MongoDB in it’s repositories, 2.2.3 is the current stable version
upstream so we’ll use the repository from 10gen to install the latest package.
ssh -i mongodb.key email@example.com #replace 22.214.171.124 with your instance IP!
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
echo "deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen"| sudo tee -a /etc/apt/sources.list.d/10gen.list
sudo apt-get update
sudo apt-get install -y mongodb-10gen
At this point we have an instance running that has MongoDB installed and running.
You should be able to navigate to the MongoDB admin interface in your web browser:
Now we have MongoDB running, we need to import some twitter data. Twitter has a streaming API
that is publicly accessible (as long as you have a twitter account!) and there a number of
modules for the programming language of your choice.
Tweetstream isn’t packaged for Ubuntu, so I’ll use the source:
sudo apt-get install -y python-setuptools
wget -c http://pypi.python.org/packages/source/t/tweetstream/tweetstream-1.1.1.tar.gz
tar -zxvf tweetstream-1.1.1.tar.gz
cd tweetstream-1.1.1 && sudo python setup.py install
pyMongo is the official MongoDB python driver and is available
from the Ubuntu archive.
sudo apt-get install -y python-pymongo
Writing a python script to save tweets into MongoDB
This following script is based on some of those examples. It connects to MongoDB and stores tweets in a collection called ‘twitterstream’.
It stores the whole tweet which includes a lot of metadata, it might be useful to use this metadata later to sort tweets or index for particular
fields we are interested in querying. It’s important to note that the streaming API does not give us all tweets on twitter, it’s merely a small
percentage as the “Firehose” API that contains all tweets is not public.
``` python tweet2mongo.py
username = “TWITTER_USERNAME”
password = “TIWTTER_PASSWORD”
mongohost = “localhost”
connection = pymongo.Connection(mongohost, 27017)
db = connection.twitterstream
with tweetstream.TweetStream(username, password) as stream:
for tweet in stream: try: # Save the whole tweet but only show certain fields on screen db.tweets.save(tweet) print tweet['created_at'], tweet['id'], "Username: ", tweet['user']['screen_name'],':', tweet['text'].encode('utf-8') except: pass
If we run this, you should see a stream of tweets printed out and the whole tweets stored within MongoDB:
Use the mongo shell to see if there are entries in the database:
MongoDB shell version: 2.2.3
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type “help”.
For more comprehensive documentation, see
Questions? Try the support group
switched to db twitterstream
The final command should output a portion of the tweets in the json document format that MongoDB queries are displayed in.
That’s it, we’re now streaming tweets into MongoDB via Python tweetstream!
In part 2, I’ll investigate scaling out the MongoDB database by spinning up new Eucalyptus instances and configuring replication and sharding.