Dumbo backends

August 12, 2010

I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends.

Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on Hadoop Streaming. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.

Personally, I would very much like Dumbo to get a backend for Avro Tether at some point. The two main starting points for making this happen would probably be my main refactoring commit and the Java implementation of a Tether client in the Avro unit tests.

Moving to Hadoop 0.20

November 23, 2009

We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process:

  • We’re probably going to base our 0.20 build on Cloudera‘s 0.20 distribution, and I found out the hard way that Dumbo doesn’t work on version 0.20.1+133 of this distribution because it includes a patch for MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you’re still trying to get Dumbo to work on Cloudera’s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., “module wordcount not found” in your tasks’ stderr logs.
  • Also, the Cloudera guys apparently haven’t added the patch for MAPREDUCE-764 to their distribution yet, so you’ll still have to apply this patch yourself if you want to avoid strange encoding problems in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera’s 0.20 distribution soon.
  • The Twitter guys put together a pretty awesome patched and backported version of hadoop-gpl-compression for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you’re interested in this stuff, you might want to have a look at this guest post from Kevin and Eric on the Cloudera blog.


July 15, 2009

Unfortunately, the list of Hadoop patches required for making Dumbo work properly just expanded a bit, since I traced down a strange encoding bug to an issue in Streaming’s typed bytes code. Hence, you might want to apply the MAPREDUCE-764 patch to your Hadoop build if you use Dumbo, even though the bug only leads to problems in very specific cases and usually isn’t hard to work around. Hopefully this patch will make it into Hadoop 0.21.

This isn’t all bad news, however. The encoding bug was initially reported on the dumbo-user mailing list, which apparently has 12 subscribers already and is starting to attract fairly regular traffic. I haven’t promoted this mailing list much so far and never really expected that people would actually start using it to be honest, but obviously I was wrong. Everyone who reads this blog should consider subscribing, I’m sure you won’t regret it!

Talks mentioning Dumbo

April 28, 2009

Presumably, most of you have seen the slides from my lightning talk about Dumbo at the first HUGUK already, since they’ve been featured fairly prominently on the wiki for quite a while now. However, if you’re eager to find out more about Hadoop in general, how Dumbo relates to it exactly, and why and in what ways Dumbo is currently being used at Last.fm, you might also want to have a look at the following talks:

  • “Hadoop at Yahoo!” by Owen O’Malley [slides]
  • “Hadoop Ecosystem Tour” by Aaron Kimball [slides, video]
  • “Practical MapReduce” by Tom White [slides, video]
  • “Lots of Data, Little Money” by Martin Dittus [slides, video]

If you’ve still not had enough after going through all these slides and videos, you could also have a peek at the slides from my HUGUK #2 lightning talk, in which I briefly explained why we’ve recently been putting some effort in making Dumbo programs run faster.

Fast Python module for typed bytes

April 13, 2009

Over the past few days, I spent some time implementing a typed bytes Python module in C. It’s probably not quite ready for production use yet, and it still falls back to the pure python module for floats, but it seems to work fine and already leads to substantial speedups.

For example, the Python program

from typedbytes import Output
Output(open("test.tb", "wb")).writes(xrange(10**7))

needs 18.8 secs to finish on this laptop, whereas it requires only 0.9 secs after replacing typedbytes with ctypedbytes. Similarly, the running time for

from typedbytes import Input
for item in Input(open("test.tb", "rb")).reads(): pass

can be reduced from 22.9 to merely 1.7 secs by using ctypedbytes instead of typedbytes.

Obviously, Dumbo programs can benefit from this faster typed bytes module as well, but the gains probably won’t be as spectacular as for the simple test programs above. To give it a go, make sure you’re using the latest version of Dumbo, build an egg for the ctypedbytes module, and add the following option to your start command:

-libegg <path to ctypedbytes egg>

From what I’ve seen so far, this can speed up Dumbo programs by 30%, which definitely makes it worth the effort if you ask me. In fact, the Dumbo program would now probably beat the Java program in the benchmark discussed here, but, unfortunately, this wouldn’t be a very fair comparison. Johan recently made me aware of the fact that it’s better to avoid Java’s split() method for strings when you don’t need regular expression support, and using a combination of substring() and indexOf() instead seems to make the Java program about 40% faster. So we’re not quite as fast as Java yet, but at least the gap got narrowed down some more.


April 5, 2009

It looks like my complaining might’ve paid off, since HADOOP-5450 got committed on Friday, which has the fortunate consequence that Hadoop 0.21 won’t require any patching to make Dumbo work. Although having to apply a few patches is far from the end of the world, it might still be a show-stopper for some people, and using Dumbo on Cloudera’s distribution or Amazon’s Elastic MapReduce might only become feasible when Hadoop supports it “out of the box”.

I didn’t mean to suggest that Hadoop is a badly-organized open source project or anything like that, by the way. On the contrary, it’s far better organized than many of the other projects I’m familiar with. The only message I wanted to get across is that it would make sense to look for ways to get patches reviewed and committed more quickly. I heard some rumours about organizing commit fests, for instance, which sounds like a great potential solution to me.


April 1, 2009

HADOOP-5528 got committed yesterday. From Hadoop 0.21 onwards, join keys will work “out of the box”, without requiring any patching. Since the patch evolved somewhat before it got committed, it won’t work anymore with Dumbo 0.20.3 though. Therefore, I released Dumbo 0.21.4 this morning, for which the list of changes includes fixing the incompatibility with the final HADOOP-5528 patch.

So far, my luck with getting Hadoop patches reviewed and committed has varied quite a bit. From my limited personal experience, it seems that it’s more difficult to get a committer to look at a bugfix or an important enhancement, while such contributions can actually be considered more important than new features. It is of course possible that these particular issues just happened to get overlooked somehow, or maybe there’s a procedure for attracting the committers’ attention that I’m not aware of, but nevertheless I’m still under the impression that Hadoop’s patch handling currently is not as smooth and efficient as it could be. The fact that, as of this writing, not less than 47 issues are in the “Patch available” state, seems to confirm this impression.