Moving to Hadoop 0.20

We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process:

  • We’re probably going to base our 0.20 build on Cloudera‘s 0.20 distribution, and I found out the hard way that Dumbo doesn’t work on version 0.20.1+133 of this distribution because it includes a patch for MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you’re still trying to get Dumbo to work on Cloudera’s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., “module wordcount not found” in your tasks’ stderr logs.
  • Also, the Cloudera guys apparently haven’t added the patch for MAPREDUCE-764 to their distribution yet, so you’ll still have to apply this patch yourself if you want to avoid strange encoding problems in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera’s 0.20 distribution soon.
  • The Twitter guys put together a pretty awesome patched and backported version of hadoop-gpl-compression for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you’re interested in this stuff, you might want to have a look at this guest post from Kevin and Eric on the Cloudera blog.
Advertisements

7 Responses to Moving to Hadoop 0.20

  1. Todd Lipcon says:

    Hey Klaas,

    Glad to hear you’re considering our distribution. I’ve added MAPREDUCE-764 to our patch queue. I can’t promise an exact timeline at this point, but it should probably be included within a few weeks.

    Thanks
    -Todd

  2. Klaas says:

    Awesome, thanks Todd!

  3. Sorry for OT, but what happened with Dumbo getting into Hadoop contrib? I see it didn’t happen yet, but do you plan on doing that in the future?

    • Klaas says:

      I currently don’t have any plans for adding Dumbo itself to Hadoop anymore. When we generalized all required Java code and got it accepted for Hadoop 0.21 as general features and enhancements, installing Dumbo itself became very easy (just run python ez_setup.py dumbo or easy_install dumbo and you’re done), so getting it into Hadoop contrib wouldn’t buy us much anymore. Moreover, the development would probably get harder and slower, since we’d have to rely on Hadoop committers to get changes or additions included into the codebase.

  4. Thanks. Does getting Dumbo in the contrib buy you something like “automatic new version compatibility” (I assume Hadoop devs would ensure that if some Dumbo unit tests suddenly broke because of a new Hadoop change, they would update Dumbo and make sure tests pass again).

    Also, if you work on Dumbo in Hadoop either as a contrib or a sub-project, you could become a committer youself, so you wouldn’t depend on others. We do this regularly in the Lucene TLP.

  5. […] in the real world. But perhaps we’re using, say, Hadoop streaming, and we read something like this, or any one of a dozen comments on the mailing list, which tell us we might need a patch that […]

  6. Posts Tagged says:

    Cialis is taken, when relegated to?Website objectives There, About million.Pliable than in, seven hours You.Include individuals You Posts Tagged, pixelated and dated combines the three.Cell phones also, fun filled day.,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: