April 5, 2009
It looks like my complaining might’ve paid off, since HADOOP-5450 got committed on Friday, which has the fortunate consequence that Hadoop 0.21 won’t require any patching to make Dumbo work. Although having to apply a few patches is far from the end of the world, it might still be a show-stopper for some people, and using Dumbo on Cloudera’s distribution or Amazon’s Elastic MapReduce might only become feasible when Hadoop supports it “out of the box”.
I didn’t mean to suggest that Hadoop is a badly-organized open source project or anything like that, by the way. On the contrary, it’s far better organized than many of the other projects I’m familiar with. The only message I wanted to get across is that it would make sense to look for ways to get patches reviewed and committed more quickly. I heard some rumours about organizing commit fests, for instance, which sounds like a great potential solution to me.
March 11, 2009
I finally took the time to release Dumbo 0.21. All Java code is gone now, and the Python code got split up into submodules to make things more clear and maintainable. The installation of Dumbo itself is now very easy, but until Hadoop 0.21 gets released it requires a patched Hadoop, since it relies on HADOOP-1722 for replacing the functionality provided by the removed Java code.
Thanks to Daniel Lescohier, the typedbytes Python module — which is now a required dependency for Dumbo — also had its version number increased. Version 0.3 of this module incorporates various improvements and optimizations, but one of these improvements requires another patch to be applied to Hadoop, unfortunately.
So, for now, you have to apply the patches from both HADOOP-1722 and HADOOP-5450 in order to be able to use the latest Dumbo offerings, but this minor annoyance will hopefully be remedied by Hadoop 0.21, since the plan is to get HADOOP-5450 in Hadoop 0.21, just like HADOOP-1722.