Nitin Madnani gave a talk at PyCon this weekend about how Dumbo and Amazon EC2 allowed him to process very large text corpora using the machinery provided by NLTK. Unfortunately I wasn’t there but I heard that his talk was very well received, and his slides definitely are pretty awesome.
A while ago, I received an email from Andrew in which he wrote:
Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:
$ elastic-mapreduce --create --alive
SSH into the cluster using your EC2 keypair as user
hadoopand install Dumbo with the following two commands:
$ wget -O ez_setup.py http://bit.ly/ezsetup
$ sudo python ez_setup.py dumbo
Then you can run your Dumbo scripts. I was able to run the
ipcount.pydemo with the following command.
$ dumbo start ipcount.py -hadoop /home/hadoop \
-input s3://anhi-test-data/wordcount/input/ \
-hadoopoption is important. At this point I haven’t created an automatic Dumbo install script, so you’ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
There was a minor hiccup that required the Amazon guys to pull the AMI with Dumbo support, but it’s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.
As an aside, it’s worth mentioning that a bug in Hadoop Streaming got fixed in the process of adding Dumbo support to EMR. I can’t wait to see what else the Amazon guys have up their sleeves.