Dumbo backends

August 12, 2010

I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends.

Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on Hadoop Streaming. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.

Personally, I would very much like Dumbo to get a backend for Avro Tether at some point. The two main starting points for making this happen would probably be my main refactoring commit and the Java implementation of a Tether client in the Avro unit tests.

Moving to Hadoop 0.20

November 23, 2009

We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process:

  • We’re probably going to base our 0.20 build on Cloudera‘s 0.20 distribution, and I found out the hard way that Dumbo doesn’t work on version 0.20.1+133 of this distribution because it includes a patch for MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you’re still trying to get Dumbo to work on Cloudera’s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., “module wordcount not found” in your tasks’ stderr logs.
  • Also, the Cloudera guys apparently haven’t added the patch for MAPREDUCE-764 to their distribution yet, so you’ll still have to apply this patch yourself if you want to avoid strange encoding problems in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera’s 0.20 distribution soon.
  • The Twitter guys put together a pretty awesome patched and backported version of hadoop-gpl-compression for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you’re interested in this stuff, you might want to have a look at this guest post from Kevin and Eric on the Cloudera blog.


July 15, 2009

Unfortunately, the list of Hadoop patches required for making Dumbo work properly just expanded a bit, since I traced down a strange encoding bug to an issue in Streaming’s typed bytes code. Hence, you might want to apply the MAPREDUCE-764 patch to your Hadoop build if you use Dumbo, even though the bug only leads to problems in very specific cases and usually isn’t hard to work around. Hopefully this patch will make it into Hadoop 0.21.

This isn’t all bad news, however. The encoding bug was initially reported on the dumbo-user mailing list, which apparently has 12 subscribers already and is starting to attract fairly regular traffic. I haven’t promoted this mailing list much so far and never really expected that people would actually start using it to be honest, but obviously I was wrong. Everyone who reads this blog should consider subscribing, I’m sure you won’t regret it!

Talks mentioning Dumbo

April 28, 2009

Presumably, most of you have seen the slides from my lightning talk about Dumbo at the first HUGUK already, since they’ve been featured fairly prominently on the wiki for quite a while now. However, if you’re eager to find out more about Hadoop in general, how Dumbo relates to it exactly, and why and in what ways Dumbo is currently being used at Last.fm, you might also want to have a look at the following talks:

  • “Hadoop at Yahoo!” by Owen O’Malley [slides]
  • “Hadoop Ecosystem Tour” by Aaron Kimball [slides, video]
  • “Practical MapReduce” by Tom White [slides, video]
  • “Lots of Data, Little Money” by Martin Dittus [slides, video]

If you’ve still not had enough after going through all these slides and videos, you could also have a peek at the slides from my HUGUK #2 lightning talk, in which I briefly explained why we’ve recently been putting some effort in making Dumbo programs run faster.

Fast Python module for typed bytes

April 13, 2009

Over the past few days, I spent some time implementing a typed bytes Python module in C. It’s probably not quite ready for production use yet, and it still falls back to the pure python module for floats, but it seems to work fine and already leads to substantial speedups.

For example, the Python program

from typedbytes import Output
Output(open("test.tb", "wb")).writes(xrange(10**7))

needs 18.8 secs to finish on this laptop, whereas it requires only 0.9 secs after replacing typedbytes with ctypedbytes. Similarly, the running time for

from typedbytes import Input
for item in Input(open("test.tb", "rb")).reads(): pass

can be reduced from 22.9 to merely 1.7 secs by using ctypedbytes instead of typedbytes.

Obviously, Dumbo programs can benefit from this faster typed bytes module as well, but the gains probably won’t be as spectacular as for the simple test programs above. To give it a go, make sure you’re using the latest version of Dumbo, build an egg for the ctypedbytes module, and add the following option to your start command:

-libegg <path to ctypedbytes egg>

From what I’ve seen so far, this can speed up Dumbo programs by 30%, which definitely makes it worth the effort if you ask me. In fact, the Dumbo program would now probably beat the Java program in the benchmark discussed here, but, unfortunately, this wouldn’t be a very fair comparison. Johan recently made me aware of the fact that it’s better to avoid Java’s split() method for strings when you don’t need regular expression support, and using a combination of substring() and indexOf() instead seems to make the Java program about 40% faster. So we’re not quite as fast as Java yet, but at least the gap got narrowed down some more.


April 5, 2009

It looks like my complaining might’ve paid off, since HADOOP-5450 got committed on Friday, which has the fortunate consequence that Hadoop 0.21 won’t require any patching to make Dumbo work. Although having to apply a few patches is far from the end of the world, it might still be a show-stopper for some people, and using Dumbo on Cloudera’s distribution or Amazon’s Elastic MapReduce might only become feasible when Hadoop supports it “out of the box”.

I didn’t mean to suggest that Hadoop is a badly-organized open source project or anything like that, by the way. On the contrary, it’s far better organized than many of the other projects I’m familiar with. The only message I wanted to get across is that it would make sense to look for ways to get patches reviewed and committed more quickly. I heard some rumours about organizing commit fests, for instance, which sounds like a great potential solution to me.


April 1, 2009

HADOOP-5528 got committed yesterday. From Hadoop 0.21 onwards, join keys will work “out of the box”, without requiring any patching. Since the patch evolved somewhat before it got committed, it won’t work anymore with Dumbo 0.20.3 though. Therefore, I released Dumbo 0.21.4 this morning, for which the list of changes includes fixing the incompatibility with the final HADOOP-5528 patch.

So far, my luck with getting Hadoop patches reviewed and committed has varied quite a bit. From my limited personal experience, it seems that it’s more difficult to get a committer to look at a bugfix or an important enhancement, while such contributions can actually be considered more important than new features. It is of course possible that these particular issues just happened to get overlooked somehow, or maybe there’s a procedure for attracting the committers’ attention that I’m not aware of, but nevertheless I’m still under the impression that Hadoop’s patch handling currently is not as smooth and efficient as it could be. The fact that, as of this writing, not less than 47 issues are in the “Patch available” state, seems to confirm this impression.

Mapper and reducer interfaces

March 31, 2009

In Dumbo 0.21.3, an alternative interface for mappers and reducers got added. Using this interface, the “wordcount” example

def mapper(key, value):
    for word in value.split(): 
        yield word, 1

def reducer(key, values):
    yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(mapper, reducer, combiner=reducer)

can be written as follows:

def mapper(data):
    for key, value in data:
        for word in value.split(): 
            yield word, 1

def reducer(data):
    for key, values in data:
        yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(mapper, reducer, combiner=reducer)

Dumbo automatically detects which interface is being used by the function, and calls it appropriately. In theory, the alternative version is faster since it involves less function calls, but the real reason why the new interface got added is because it is more low-level and can make integration with existing Python code easier in some cases.

Just like the original interface, the alternative one also works for mapper and reducer classes. Adapting the first example above such that a class is used for both the mapper and reducer results in:

class Mapper:
    def __call__(self, key, value):
        for word in value.split(): 
            yield word, 1

class Reducer:
    def __call__(self, key, values):
        yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(Mapper, Reducer, combiner=Reducer)

Applying the same transformation to the version using the alternative interface leads to:

class Mapper:
    def __call__(self, data):
        for key, value in data:
            for word in value.split(): 
                yield word, 1

class Reducer:
    def __call__(self, data):
        for key, values in data:
            yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(Mapper, Reducer, combiner=Reducer)

Since mapper and reducer functions that use the alternative interface are called only once, you don’t need classes to add initialization or cleanup logic when using this interface, but they can still be useful if you want to access the fields that Dumbo automatically adds to them.

Join keys

March 20, 2009

Earlier today, I released Dumbo 0.21.3, which adds support for so called “join keys” (amongst other things). Here’s an example of a Dumbo program that uses such keys:

def mapper(key, value):
    key.isprimary = "hostnames" in key.body[0]
    key.body, value = value.split("\t", 1)
    yield key, value
class Reducer:
    def __init__(self):
        self.key = None
    def __call__(self, key, values):
        if key.isprimary:
            self.key = key.body
            self.hostname = values.next()
        elif self.key == key.body:
            key.body = self.hostname
            for value in values:
                yield key, value
def runner(job):
    job.additer(mapper, Reducer)
def starter(prog):
    prog.addopt("addpath", "yes")
    prog.addopt("joinkeys", "yes")

if __name__ == "__main__":
    from dumbo import main
    main(runner, starter)

When you put this code in join.py, you can join the files hostnames.txt and logs.txt as follows:

$ wget http://users.ugent.be/~klbostee/files/hostnames.txt
$ wget http://users.ugent.be/~klbostee/files/logs.txt
$ dumbo join.py -input hostnames.txt -input logs.txt \
-output joined.code
$ dumbo cat joined.code > joined.txt

In order to make join keys work for non-local runs, however, you need to apply the patch from HADOOP-5528, which requires Hadoop 0.20 or higher. More precisely, Dumbo relies on the BinaryPartitioner from HADOOP-5528 to make sure that:

  1. All keys that differ only in the .isprimary attribute are passed to the same reducer.
  2. The primary keys are always reduced before the non-primary ones.

If you want to find out how this works exactly, you might want to watch Cloudera’s “MapReduce Algorithms” lecture, since joining by means of a custom partitioner is one of the common idioms discussed in this lecture.

UPDATE: Dumbo 0.21.3 is not compatible with the evolved version of the patch for HADOOP-5528. To make things work with the final patch, you need to upgrade to Dumbo 0.21.4.

How to contribute

March 12, 2009

As part of the attempt to get more organized, this post outlines how to contribute to Dumbo. I might turn this into a wiki page eventually, but a blog post will probably get more attention, and people who are not planning to contribute to Dumbo any time soon might still be interested in the process described in the remainder of this post.

If you haven’t done so already, create a GitHub account and fork the master tree. Then go through the following steps:

  1. Either create a new ticket for the changes you have in mind, or add a comment to the corresponding existing ticket to inform everyone that you started working on it.
  2. Make the necessary changes (and add unit tests for them).
  3. Run all unit tests:

    $ python setup.py test
  4. If none of the tests fail, commit your changes using a commit message that GitHub’s issue tracker understands:

    $ git commit -a -m "Closes GH-<ticket number>"
  5. Push your commit to GitHub:

    $ git push
  6. Send a pull request to me.

If you want to be able to easily get back to it later, you can also create a separate branch for the ticket and push this branch to GitHub instead. This might, for instance, save you some time when I spot a bug in your changes, and ask you to send another pull request after fixing this bug.


Get every new post delivered to your Inbox.