Dumbo on Amazon EMR

A while ago, I received an email from Andrew in which he wrote:

Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:

$ elastic-mapreduce --create --alive

SSH into the cluster using your EC2 keypair as user hadoop and install Dumbo with the following two commands:

$ wget -O ez_setup.py http://bit.ly/ezsetup
$ sudo python ez_setup.py dumbo

Then you can run your Dumbo scripts. I was able to run the ipcount.py demo with the following command.

$ dumbo start ipcount.py -hadoop /home/hadoop \
-input s3://anhi-test-data/wordcount/input/ \
-output s3://anhi-test-data/output/dumbo/wc/

The -hadoop option is important. At this point I haven’t created an automatic Dumbo install script, so you’ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.

There was a minor hiccup that required the Amazon guys to pull the AMI with Dumbo support, but it’s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.

As an aside, it’s worth mentioning that a bug in Hadoop Streaming got fixed in the process of adding Dumbo support to EMR. I can’t wait to see what else the Amazon guys have up their sleeves.

5 Responses to Dumbo on Amazon EMR

  1. […] Dumbo on Amazon EMR « Dumbotics This entry was posted in Uncategorized. Bookmark the permalink. ← Does Technology Live Forever? Submit Your Dead Tech […]

  2. This code snipped is broken.
    $ wget http://bit.ly/ezsetup
    $ sudo python ez_setup.py dumbo

    This code snipped works.
    $ wget http://bit.ly/ezsetup
    $ sudo python ezsetup dumbo

    Also, it would be great if these instructions were also included in the github dumbo wiki. Thanks for writing such useful code!

  3. You can simplify the setup of dumbo when you start the EC2 instances by using a bootstrap actions script ( http://aws.typepad.com/aws/2010/04/new-elastic-mapreduce-feature-bootstrap-actions.html ).
    I’ve put the dumbo egg on S3. The bootstrap script copy and install it on each started instance. This process can be used to install other python packages that are need by your code.

  4. […] Dumbo on Amazon EMR « Dumbotics Amazon.com: DumboA list of products including, Dumbo Picture Book, Dumbo: My First Disney Story (Pictureboard), Walt Disney's Dumbo Pop-Up, Walt Disney's Dumbo and His New Act, Bambi … […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: