Integration with Java code

June 16, 2009

Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks to a recent enhancement, this is now easily achievable. Here’s a version of that uses the example mapper and reducer from the feathers project (and thus requires -libjar feathers.jar):

import dumbo"",

You can basically mix up Python with Java in any way you like. There’s only one minor restriction: You cannot use a Python combiner when you specify a Java mapper. Things should still work in this case though, it’ll just be slow since the combiner won’t actually run. In theory, this limitation could be avoided by relying on HADOOP-4842, but personally I don’t think it’s worth the trouble.

The source code for and fm.last.feathers.reduce.Sum is just as straightforward as the code for the OutputFormat classes discussed in my previous post. All you have to keep in mind is that only the mapper input keys and values can be arbitrary writables. Every other key or value has to be a TypedBytesWritable. Writing a custom Java partitioner for Dumbo programs is equally easy by the way. The fm.last.feather.partition.Prefix class is a simple example. It can be used by specifying -partitioner fm.last.feather.partition.Prefix.

As you probably expected already, none of this will work for local runs on UNIX, but you can still test things locally fairly easily by running on Hadoop in standalone mode.

Dumbo on Cloudera’s distribution

May 31, 2009

Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to me yesterday, so the time was right to finally have a look at Cloudera’s nicely packaged and patched-up Hadoop.

I started from a chrooted Debian server, on which I installed the Cloudera distribution, Python 2.5, and Dumbo as follows:

# cat /etc/apt/sources.list
deb etch main contrib non-free
deb etch-backports main contrib non-free
deb etch contrib
deb-src etch contrib
# wget -O - | apt-key add -
# wget -O - | apt-key add -
# apt-get update
# apt-get install hadoop python2.5 python2.5-dev
# wget
# python2.5 dumbo

Then, I created a user for myself and confirmed that the program runs properly on Cloudera’s distribution in standalone mode:

# adduser klaas
# su - klaas
$ wget
$ dumbo start -input brian.txt -output brianwc \
-python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian
Brian   6

Unsurprisingly, it also worked perfectly in pseudo-distributed mode:

$ exit
# apt-get install hadoop-conf-pseudo
# /etc/init.d/hadoop-namenode start
# /etc/init.d/hadoop-secondarynamenode start
# /etc/init.d/hadoop-datanode start
# /etc/init.d/hadoop-jobtracker start
# /etc/init.d/hadoop-tasktracker start
# su - klaas
$ dumbo start -input brian.txt -output brianwc \
-python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo rm brianwc/_logs -hadoop /usr/lib/hadoop/
Deleted hdfs://localhost/user/klaas/brianwc/_logs
$ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian
Brian   6

Note that I removed the _logs directory first because dumbo cat would’ve complained about it otherwise. You can avoid this minor annoyance by disabling the creation of _logs directories.

I also verified that HADOOP-5528 got included by running the example successfully:

$ wget
$ wget
$ dumbo put hostnames.txt hostnames.txt -hadoop /usr/lib/hadoop/
$ dumbo put logs.txt logs.txt -hadoop /usr/lib/hadoop/
$ dumbo start -input hostnames.txt -input logs.txt \
-output joined -python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo rm joined/_logs -hadoop /usr/lib/hadoop
$ dumbo cat joined -hadoop /usr/lib/hadoop | grep node1
node1   5

And while I was at it, I did a quick typedbytes versus ctypedbytes comparison as well:

$ zcat /usr/share/man/man1/python2.5.1.gz >
$ for i in `seq 100000`; do cat >> python.txt; done
$ du -h python.txt
1.2G    python.txt
$ dumbo put python.txt python.txt -hadoop /usr/lib/hadoop/
$ time dumbo start -input python.txt -output pywc \
-python python2.5 -hadoop /usr/lib/hadoop/
real    17m45.473s
user    0m1.380s
sys     0m0.224s
$ exit
# apt-get install gcc libc6-dev
# su - klaas
$ python2.5 -zmaxd. ctypedbytes
$ time dumbo start -input python.txt -output pywc2 \
-python python2.5 -hadoop /usr/lib/hadoop/ \
-libegg ctypedbytes-0.1.5-py2.5-linux-i686.egg
real    13m22.420s
user    0m1.320s
sys     0m0.216s

In this particular case, ctypedbytes appears to be 25% faster. Your mileage may vary since the running times depend on many factors, but in any case I’d always expect ctypedbytes to lead to significant speed improvements.

Powered by Dumbo?

May 9, 2009

I’ve slowly started taking on the slightly daunting task of writing my Ph.D. dissertation, and I’m considering including a chapter about Dumbo and Hadoop. However, thinking about this made me realize that I’m pretty clueless as to how many people are using Dumbo, and for what purposes it’s being used outside of I know for a fact that CBSi started using it recently, and there are a few other companies like Lookery that appear to be making use of it it as well, but I don’t really know what they’re using it for exactly, and judging from the number of questions I keep getting there must be more people out there who are using Dumbo for non-toy projects. So, if you aren’t just reading this blog out of personal interest, please drop me a line at klaas at last dot fm or add a comment to this post. It’ll make my day, and you might get an honorable mention in my dissertation. When the list is long enough, I might even devote an entire wiki page to it as well.

Dumbo IP count in C

May 3, 2009

It doesn’t comply very well with the goal of making it as easy as possible to write MapReduce programs, but Dumbo mappers and reducers can also be written in C instead of Python. I just put an example on GitHub to illustrate this. Although it’s nowhere near as convenient as using Python, writing a mapper or reducer in C is not that hard since you get to use the nifty Python C API, and in some specific cases the speed gains might be worth the extra effort. Moreover, setuptools nicely takes care of all the building and compiling, and you can limit the C code to computationally expensive parts and still use Python for the rest.

Talks mentioning Dumbo

April 28, 2009

Presumably, most of you have seen the slides from my lightning talk about Dumbo at the first HUGUK already, since they’ve been featured fairly prominently on the wiki for quite a while now. However, if you’re eager to find out more about Hadoop in general, how Dumbo relates to it exactly, and why and in what ways Dumbo is currently being used at, you might also want to have a look at the following talks:

  • “Hadoop at Yahoo!” by Owen O’Malley [slides]
  • “Hadoop Ecosystem Tour” by Aaron Kimball [slides, video]
  • “Practical MapReduce” by Tom White [slides, video]
  • “Lots of Data, Little Money” by Martin Dittus [slides, video]

If you’ve still not had enough after going through all these slides and videos, you could also have a peek at the slides from my HUGUK #2 lightning talk, in which I briefly explained why we’ve recently been putting some effort in making Dumbo programs run faster.

How to contribute

March 12, 2009

As part of the attempt to get more organized, this post outlines how to contribute to Dumbo. I might turn this into a wiki page eventually, but a blog post will probably get more attention, and people who are not planning to contribute to Dumbo any time soon might still be interested in the process described in the remainder of this post.

If you haven’t done so already, create a GitHub account and fork the master tree. Then go through the following steps:

  1. Either create a new ticket for the changes you have in mind, or add a comment to the corresponding existing ticket to inform everyone that you started working on it.
  2. Make the necessary changes (and add unit tests for them).
  3. Run all unit tests:

    $ python test
  4. If none of the tests fail, commit your changes using a commit message that GitHub’s issue tracker understands:

    $ git commit -a -m "Closes GH-<ticket number>"
  5. Push your commit to GitHub:

    $ git push
  6. Send a pull request to me.

If you want to be able to easily get back to it later, you can also create a separate branch for the ticket and push this branch to GitHub instead. This might, for instance, save you some time when I spot a bug in your changes, and ask you to send another pull request after fixing this bug.

Getting organized

March 11, 2009

I created Assembla issue trackers for both Dumbo and Typedbytes a few moments ago. They aren’t quite as powerful as JIRA, but they integrate nicely with GitHub and should be more than sufficient for the time being.

UPDATE: We now use GitHub’s native issue trackers: