HADOOP-1722 and typed bytes

The Hadoop guys recently accepted a patch that adds support for binary communication formats to Streaming, the Hadoop component on which Dumbo is based. As described in the comments for the corresponding JIRA issue (HADOOP-1722), this patch basically abstracts away the used communication format by introducing the classes InputWriter and OutputReader. It provides extensions of these classes that implement three different communication formats, including the so called typed bytes format, which happens to be a great alternative to the functionality provided by the Java code that is currently part of Dumbo.

A while ago, Johan and I ran four versions of the IP count program on about 300 gigs of weblogs, which resulted in the following timings:

  1. Java: 8 minutes
  2. Dumbo with typed bytes: 10 minutes
  3. Hive: 13 minutes
  4. Dumbo without typed bytes: 16 minutes

Hence, thanks to typed bytes, Java is only 20% faster than Dumbo anymore, and there might still be some room for narrowing this gap even further. For instance, I wouldn’t be surprised if another minute could be chopped off by implementing the typed bytes serialization at the Python side in C instead of pure Python, just like the Thrift guys did with their fastbinary Python module. Since typed bytes are also a cleaner and more maintainable solution, the Java code will probably be removed in Dumbo 0.21, making typed bytes the only supported approach.

Unfortunately, all Hadoop releases before 0.21.0 will not yet include typed bytes support, but I do plan to attach backported patches to HADOOP-1722. The patch for 0.18 is already there, and backports for more recent versions of Hadoop will follow as soon as we start using those versions at Last.fm. In order to use typed bytes with Dumbo 0.20, you just have to make sure the typedbytes module is installed on the machine from which you run your Dumbo programs, add typedbytes: all and type: typedbytes to the sections [streaming] and [cat], respectively, of your /etc/dumbo.conf or ~/.dumborc file, and apply the HADOOP-1722 patch to your Hadoop build:

$ cd /path/to/hadoop-0.18
$ wget 'https://issues.apache.org/jira/secure/attachment/'\
$ patch -p0 < HADOOP-1722-branch-0.18.patch
$ ant package

For now, Dumbo will still look for its jar though, so you cannot get rid of src/contrib/dumbo in your Hadoop directory just yet.

6 Responses to HADOOP-1722 and typed bytes

  1. Johan Oskarsson says:

    It’s worth mentioning that Dumbo’s combiner is was outputting significantly fewer entries then the Java combiner, suggesting that the Hadoop combiner can be made more efficient.

  2. […] Posts HADOOP-1722 and typed bytes Mapper and reducer classesRandom samplingAboutRecommending JIRA […]

  3. […] Top Posts Recommending JIRA issuesHADOOP-1722 and typed bytes […]

  4. […] me. In fact, the Dumbo program would now probably beat the Java program in the benchmark discussed here, but, unfortunately, this wouldn’t be a very fair comparison. Johan recently made me aware of […]

  5. Most people are resorting to many different types of this, as conventional methods are getting more complicated and displaying more side effects. your posting explores a few of these different sorts of methods and just how the benefit us, thanks! thanks

  6. Gabriela says:

    each time i used to read smaller content that also clear their motive, and that is also happening with this paragraph which I
    am reading at this time.

Leave a Reply to Johan Oskarsson Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: