The Hadoop guys recently accepted a patch that adds support for binary communication formats to Streaming, the Hadoop component on which Dumbo is based. As described in the comments for the corresponding JIRA issue (HADOOP-1722), this patch basically abstracts away the used communication format by introducing the classes InputWriter and OutputReader. It provides extensions of these classes that implement three different communication formats, including the so called typed bytes format, which happens to be a great alternative to the functionality provided by the Java code that is currently part of Dumbo.
A while ago, Johan and I ran four versions of the IP count program on about 300 gigs of weblogs, which resulted in the following timings:
- Java: 8 minutes
- Dumbo with typed bytes: 10 minutes
- Hive: 13 minutes
- Dumbo without typed bytes: 16 minutes
Hence, thanks to typed bytes, Java is only 20% faster than Dumbo anymore, and there might still be some room for narrowing this gap even further. For instance, I wouldn’t be surprised if another minute could be chopped off by implementing the typed bytes serialization at the Python side in C instead of pure Python, just like the Thrift guys did with their fastbinary Python module. Since typed bytes are also a cleaner and more maintainable solution, the Java code will probably be removed in Dumbo 0.21, making typed bytes the only supported approach.
Unfortunately, all Hadoop releases before 0.21.0 will not yet include typed bytes support, but I do plan to attach backported patches to HADOOP-1722. The patch for 0.18 is already there, and backports for more recent versions of Hadoop will follow as soon as we start using those versions at Last.fm. In order to use typed bytes with Dumbo 0.20, you just have to make sure the typedbytes module is installed on the machine from which you run your Dumbo programs, add typedbytes: all and type: typedbytes to the sections [streaming] and [cat], respectively, of your /etc/dumbo.conf or ~/.dumborc file, and apply the HADOOP-1722 patch to your Hadoop build:
$ cd /path/to/hadoop-0.18
$ wget 'https://issues.apache.org/jira/secure/attachment/'\
$ patch -p0 < HADOOP-1722-branch-0.18.patch
$ ant package
For now, Dumbo will still look for its jar though, so you cannot get rid of src/contrib/dumbo in your Hadoop directory just yet.