<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Dumbotics &#187; Explanations</title>
	<atom:link href="http://dumbotics.com/category/explanations/feed/" rel="self" type="application/rss+xml" />
	<link>http://dumbotics.com</link>
	<description>Pseudo-random ramblings about Dumbo and Hadoop</description>
	<lastBuildDate>Mon, 30 Apr 2012 21:13:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='dumbotics.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Dumbotics &#187; Explanations</title>
		<link>http://dumbotics.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://dumbotics.com/osd.xml" title="Dumbotics" />
	<atom:link rel='hub' href='http://dumbotics.com/?pushpress=hub'/>
		<item>
		<title>Dumbo backends</title>
		<link>http://dumbotics.com/2010/08/12/dumbo-backends/</link>
		<comments>http://dumbotics.com/2010/08/12/dumbo-backends/#comments</comments>
		<pubDate>Thu, 12 Aug 2010 13:21:51 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[backends]]></category>
		<category><![CDATA[tether]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1354</guid>
		<description><![CDATA[I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends. Dumbo currently has two different backends, one that runs locally on UNIX and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1354&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I released <a href="http://github.com/klbostee/dumbo/downloads">Dumbo 0.21.26</a> the other day. As usual we <a href="http://github.com/klbostee/dumbo/issues/closed">fixed various bugs</a>, but this release also incorporates an enhancement that makes it a bit more special, namely, <a href="http://github.com/klbostee/dumbo/issues/closed#issue/8">some refactoring</a> that can be regarded a first but important step towards plugable backends.</p>
<p>Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on <a href="http://hadoop.apache.org/common/docs/r0.20.2/streaming.html">Hadoop Streaming</a>. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.</p>
<p>Personally, I would very much like Dumbo to get a backend for <a href="http://svn.apache.org/viewvc/avro/trunk/lang/java/src/java/org/apache/avro/mapred/tether/">Avro Tether</a> at some point. The two main starting points for making this happen would probably be <a href="http://github.com/klbostee/dumbo/commit/535ae797ba53b86dad4bffa51d1838e9a1c04018">my main refactoring commit</a> and the <a href="http://svn.apache.org/viewvc/avro/trunk/lang/java/src/test/java/org/apache/avro/mapred/tether">Java implementation of a Tether client in the Avro unit tests</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1354/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1354/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1354&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2010/08/12/dumbo-backends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Moving to Hadoop 0.20</title>
		<link>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/</link>
		<comments>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 09:26:29 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop 0.20]]></category>
		<category><![CDATA[hadoop-gpl-compression]]></category>
		<category><![CDATA[hadoop-lzo]]></category>
		<category><![CDATA[mapreduce-764]]></category>
		<category><![CDATA[mapreduce-967]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1233</guid>
		<description><![CDATA[We&#8217;ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process: We&#8217;re probably going to base our 0.20 build on Cloudera&#8216;s 0.20 distribution, and I found out the hard way that Dumbo doesn&#8217;t work on [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1233&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve finally started looking into moving from Hadoop 0.18 to 0.20 at <a href="http://last.fm">Last.fm</a>, and I thought it might be useful to share a few Dumbo-related things I learned in the process:</p>
<ul>
<li>We&#8217;re probably going to base our 0.20 build on <a href="http://cloudera.com">Cloudera</a>&#8216;s <a href="http://archive.cloudera.com/cdh/testing/">0.20 distribution</a>, and I found out the hard way that Dumbo doesn&#8217;t work on version 0.20.1+133 of this distribution because it includes a patch for <a href="http://issues.apache.org/jira/browse/MAPREDUCE-967">MAPREDUCE-967</a> that <a href="http://issues.apache.org/jira/browse/MAPREDUCE-967?focusedCommentId=12770121&amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12770121">breaks</a> some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you&#8217;re still trying to get Dumbo to work on Cloudera&#8217;s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., &#8220;module wordcount not found&#8221; in your tasks&#8217; stderr logs.</li>
<li>Also, the Cloudera guys apparently haven&#8217;t added the patch for <a href="http://issues.apache.org/jira/browse/MAPREDUCE-764">MAPREDUCE-764</a> to their distribution yet, so you&#8217;ll still have to apply this patch yourself if you want to avoid <a href="http://dumbotics.com/2009/07/15/mapreduce-764/">strange encoding problems</a> in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera&#8217;s 0.20 distribution soon.</li>
<li>The <a href="http://twitter.com">Twitter</a> guys put together a pretty awesome <a href="http://github.com/kevinweil/hadoop-lzo">patched and backported version</a> of <a href="http://code.google.com/p/hadoop-gpl-compression/">hadoop-gpl-compression</a> for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you&#8217;re interested in this stuff, you might want to have a look at <a href="http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/">this</a> guest post from <a href="http://twitter.com/kevinWeil">Kevin</a> and <a href="http://twitter.com/emaland">Eric</a> on the Cloudera blog.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1233/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1233&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>MAPREDUCE-764</title>
		<link>http://dumbotics.com/2009/07/15/mapreduce-764/</link>
		<comments>http://dumbotics.com/2009/07/15/mapreduce-764/#comments</comments>
		<pubDate>Wed, 15 Jul 2009 10:09:20 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[dumbo-user]]></category>
		<category><![CDATA[mailing lists]]></category>
		<category><![CDATA[mapreduce-764]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1189</guid>
		<description><![CDATA[Unfortunately, the list of Hadoop patches required for making Dumbo work properly just expanded a bit, since I traced down a strange encoding bug to an issue in Streaming&#8217;s typed bytes code. Hence, you might want to apply the MAPREDUCE-764 patch to your Hadoop build if you use Dumbo, even though the bug only leads [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1189&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Unfortunately, the <a href="http://wiki.github.com/klbostee/dumbo/building-and-installing">list of Hadoop patches</a> required for making Dumbo work properly just expanded a bit, since I traced down a <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/54">strange encoding bug</a> to an <a href="http://issues.apache.org/jira/browse/MAPREDUCE-764">issue</a> in Streaming&#8217;s typed bytes code. Hence, you might want to apply the <a href="https://issues.apache.org/jira/browse/MAPREDUCE-764">MAPREDUCE-764</a> patch to your Hadoop build if you use Dumbo, even though the bug only leads to problems in very specific cases and usually isn&#8217;t hard to work around. Hopefully this patch will make it into Hadoop 0.21.</p>
<p>This isn&#8217;t all bad news, however. The encoding bug was initially <a href="http://groups.google.com/group/dumbo-user/browse_thread/thread/535e87a015a3ff44">reported</a> on the <a href="http://groups.google.com/group/dumbo-user">dumbo-user mailing list</a>, which apparently has 12 subscribers already and is starting to attract fairly regular traffic. I haven&#8217;t promoted this mailing list much so far and never really expected that people would actually start using it to be honest, but obviously I was wrong. Everyone who reads this blog should consider subscribing, I&#8217;m sure you won&#8217;t regret it!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1189/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=1189&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/07/15/mapreduce-764/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Talks mentioning Dumbo</title>
		<link>http://dumbotics.com/2009/04/28/talks-mentioning-dumbo/</link>
		<comments>http://dumbotics.com/2009/04/28/talks-mentioning-dumbo/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 21:52:00 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[dumbo]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[slides]]></category>
		<category><![CDATA[talks]]></category>
		<category><![CDATA[videos]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=796</guid>
		<description><![CDATA[Presumably, most of you have seen the slides from my lightning talk about Dumbo at the first HUGUK already, since they&#8217;ve been featured fairly prominently on the wiki for quite a while now. However, if you&#8217;re eager to find out more about Hadoop in general, how Dumbo relates to it exactly, and why and in [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=796&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Presumably, most of you have seen the <a href="http://skillsmatter.com/custom/presentations/dumbo_hadoop_streaming_made_elegant_and_easy-klaas_bosteels.pdf">slides</a> from my lightning talk about Dumbo at the first <a href="http://huguk.org">HUGUK</a> already, since they&#8217;ve been featured fairly prominently on the <a href="http://wiki.github.com/klbostee/dumbo">wiki</a> for quite a while now. However, if you&#8217;re eager to find out more about Hadoop in general, how Dumbo relates to it exactly, and why and in what ways Dumbo is currently being used at <a href="http://last.fm">Last.fm</a>, you might also want to have a look at the following talks:</p>
<ul>
<li><i>&#8220;Hadoop at Yahoo!&#8221;</i> by <i>Owen O&#8217;Malley</i> [<a href="http://www.slideshare.net/acarlos1000/hadoop-basics-presentation">slides</a>]</li>
<li><i>&#8220;Hadoop Ecosystem Tour&#8221;</i> by <i>Aaron Kimball</i> [<a href="http://www.cloudera.com/sites/default/files/3-HadoopEcosystem.pdf">slides</a>, <a href="http://www.cloudera.com/hadoop-training-ecosystem-tour">video</a>]</li>
<li><i>&#8220;Practical MapReduce&#8221;</i> by <i>Tom White</i> [<a href="http://static.last.fm/johan/huguk-20090414/tom_white-practical_map_reduce.pdf">slides</a>, <a href="http://skillsmatter.com/podcast/cloud-grid/practical-mapreduce">video</a>]</li>
<li><i>&#8220;Lots of Data, Little Money&#8221;</i> by <i>Martin Dittus</i> [<a href="http://playground.audioscrobbler.com/martind/talks/2009-04-23%20big%20data%20little%20money.pdf">slides</a>, <a href="http://skillsmatter.com/podcast/erlang/what-to-do-when-the-data-you-have-to-analyse-keeps-growing-but-your-budget-doesnt-692">video</a>]</li>
</ul>
<p>If you&#8217;ve still not had enough after going through all these slides and videos, you could also have a peek at the <a href="http://static.last.fm/johan/huguk-20090414/klaas-hadoop-1722.pdf">slides</a> from my <a href="http://huguk.org/2009/04/huguk-2-wrap-up.html">HUGUK #2</a> lightning talk, in which I briefly explained why we&#8217;ve recently been putting some effort in making Dumbo programs run faster.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/796/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/796/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/796/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=796&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/04/28/talks-mentioning-dumbo/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Fast Python module for typed bytes</title>
		<link>http://dumbotics.com/2009/04/13/fast-python-module-for-typed-bytes/</link>
		<comments>http://dumbotics.com/2009/04/13/fast-python-module-for-typed-bytes/#comments</comments>
		<pubDate>Mon, 13 Apr 2009 16:42:37 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[ctypedbytes]]></category>
		<category><![CDATA[typed bytes]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=710</guid>
		<description><![CDATA[Over the past few days, I spent some time implementing a typed bytes Python module in C. It&#8217;s probably not quite ready for production use yet, and it still falls back to the pure python module for floats, but it seems to work fine and already leads to substantial speedups. For example, the Python program [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=710&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Over the past few days, I spent some time implementing a <a href="http://github.com/klbostee/ctypedbytes">typed bytes Python module in C</a>.  It&#8217;s probably not quite ready for production use yet, and it still falls back to the <a href="http://github.com/klbostee/typedbytes">pure python module</a> for floats, but it seems to work fine and already leads to substantial speedups.</p>
<p>For example, the Python program</p>
<blockquote><pre>
from typedbytes import Output
Output(open("test.tb", "wb")).writes(xrange(10**7))
</pre>
</blockquote>
<p>needs <i>18.8</i> secs to finish on this laptop, whereas it requires only <i>0.9</i> secs after replacing <i>typedbytes</i> with <i>ctypedbytes</i>. Similarly, the running time for</p>
<blockquote><pre>
from typedbytes import Input
for item in Input(open("test.tb", "rb")).reads(): pass
</pre>
</blockquote>
<p>can be reduced from <i>22.9</i> to merely <i>1.7</i> secs by using <i>ctypedbytes</i> instead of <i>typedbytes</i>.</p>
<p>Obviously, Dumbo programs can benefit from this faster typed bytes module as well, but the gains probably won&#8217;t be as spectacular as for the simple test programs above. To give it a go, make sure you&#8217;re using the <a href="http://github.com/klbostee/dumbo/tree/release-0.21.5">latest version</a> of Dumbo, <a href="http://peak.telecommunity.com/DevCenter/PythonEggs#building-eggs">build an egg</a> for the <i>ctypedbytes</i> module, and add the following option to your start command:</p>
<blockquote><pre>
-libegg <i>&lt;path to ctypedbytes egg&gt;</i>
</pre>
</blockquote>
<p>From what I&#8217;ve seen so far, this can speed up Dumbo programs by 30%, which definitely makes it worth the effort if you ask me. In fact, the Dumbo program would now probably beat the Java program in the benchmark discussed <a href="http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/">here</a>, but, unfortunately, this wouldn&#8217;t be a very fair comparison. <a href="http://blog.oskarsson.nu">Johan</a> recently made me aware of the fact that it&#8217;s better to avoid Java&#8217;s <i>split()</i> method for strings when you don&#8217;t need regular expression support, and using a combination of <i>substring()</i> and <i>indexOf()</i> instead seems to make the Java program about 40% faster. So we&#8217;re not quite as fast as Java yet, but at least the gap got narrowed down some more.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/710/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/710/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/710/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=710&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/04/13/fast-python-module-for-typed-bytes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>HADOOP-5450</title>
		<link>http://dumbotics.com/2009/04/05/hadoop-5450/</link>
		<comments>http://dumbotics.com/2009/04/05/hadoop-5450/#comments</comments>
		<pubDate>Sun, 05 Apr 2009 15:58:43 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[commit fests]]></category>
		<category><![CDATA[hadoop-5450]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=688</guid>
		<description><![CDATA[It looks like my complaining might&#8217;ve paid off, since HADOOP-5450 got committed on Friday, which has the fortunate consequence that Hadoop 0.21 won&#8217;t require any patching to make Dumbo work. Although having to apply a few patches is far from the end of the world, it might still be a show-stopper for some people, and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=688&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>It looks like my <a href="http://dumbotics.com/2009/04/01/hadoop-5528/">complaining</a> might&#8217;ve paid off, since <a href="http://issues.apache.org/jira/browse/HADOOP-5450">HADOOP-5450</a> got committed on Friday, which has the fortunate consequence that Hadoop 0.21 won&#8217;t require any patching to make Dumbo work. Although having to apply a few patches is far from the end of the world, it might still be a show-stopper for some people, and using Dumbo on Cloudera&#8217;s <a href="http://www.cloudera.com/blog/2009/03/15/cloudera-distribution-for-hadoop/">distribution</a> or Amazon&#8217;s <a href="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">Elastic MapReduce</a> might only become feasible when Hadoop supports it &#8220;out of the box&#8221;. </p>
<p>I didn&#8217;t mean to suggest that Hadoop is a badly-organized open source project or anything like that, by the way. On the contrary, it&#8217;s far better organized than many of the other projects I&#8217;m familiar with. The only message I wanted to get across is that it would make sense to look for ways to get patches reviewed and committed more quickly. I heard some rumours about organizing commit fests, for instance, which sounds like a great potential solution to me.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/688/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=688&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/04/05/hadoop-5450/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>HADOOP-5528</title>
		<link>http://dumbotics.com/2009/04/01/hadoop-5528/</link>
		<comments>http://dumbotics.com/2009/04/01/hadoop-5528/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 11:23:15 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[hadoop-5528]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=659</guid>
		<description><![CDATA[HADOOP-5528 got committed yesterday. From Hadoop 0.21 onwards, join keys will work &#8220;out of the box&#8221;, without requiring any patching. Since the patch evolved somewhat before it got committed, it won&#8217;t work anymore with Dumbo 0.20.3 though. Therefore, I released Dumbo 0.21.4 this morning, for which the list of changes includes fixing the incompatibility with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=659&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://issues.apache.org/jira/browse/HADOOP-5528">HADOOP-5528</a> got committed yesterday. From Hadoop 0.21 onwards, <a href="http://dumbotics.com/2009/03/20/join-keys/">join keys</a> will work &#8220;out of the box&#8221;, without requiring any patching. Since the patch evolved somewhat before it got committed, it won&#8217;t work anymore with <a href="http://github.com/klbostee/dumbo/tree/release-0.21.3">Dumbo 0.20.3</a> though. Therefore, I released <a href="http://github.com/klbostee/dumbo/tree/release-0.21.4">Dumbo 0.21.4</a> this morning, for which the <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/custom_report/2082">list of changes</a> includes <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/11-adapt-for-HADOOP-5528-changes">fixing the incompatibility with the final HADOOP-5528 patch</a>.</p>
<p>So far, my luck with getting Hadoop patches reviewed and committed has varied quite a bit. From my limited personal experience, it seems that it&#8217;s more difficult to get a committer to look at <a href="http://issues.apache.org/jira/browse/HADOOP-5252">a bugfix</a> or <a href="http://issues.apache.org/jira/browse/HADOOP-5450">an important enhancement</a>, while such contributions can actually be considered more important than new features. It is of course possible that these particular issues just happened to get overlooked somehow, or maybe there&#8217;s a procedure for attracting the committers&#8217; attention that I&#8217;m not aware of, but nevertheless I&#8217;m still under the impression that Hadoop&#8217;s patch handling currently is not as smooth and efficient as it could be. The fact that, as of this writing, not less than <a href="http://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;pid=12310240&amp;status=10002">47 issues are in the &#8220;Patch available&#8221; state</a>, seems to confirm this impression.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/659/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=659&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/04/01/hadoop-5528/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Mapper and reducer interfaces</title>
		<link>http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/</link>
		<comments>http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 18:51:39 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Explanations]]></category>
		<category><![CDATA[interfaces]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=613</guid>
		<description><![CDATA[In Dumbo 0.21.3, an alternative interface for mappers and reducers got added. Using this interface, the &#8220;wordcount&#8221; example def mapper(key, value): for word in value.split(): yield word, 1 def reducer(key, values): yield key, sum(values) if __name__ == "__main__": from dumbo import run run(mapper, reducer, combiner=reducer) can be written as follows: def mapper(data): for key, value [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=613&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://github.com/klbostee/dumbo/tree/release-0.21.3">Dumbo 0.21.3</a>, an <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/7-alternative-interface-to-mappers-reducers">alternative interface for mappers and reducers</a> got added. Using this interface, the &#8220;wordcount&#8221; example</p>
<blockquote><pre>
def mapper(key, value):
    for word in value.split(): 
        yield word, 1

def reducer(key, values):
    yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(mapper, reducer, combiner=reducer)
</pre>
</blockquote>
<p>can be written as follows:</p>
<blockquote><pre>
def mapper(data):
    for key, value in data:
        for word in value.split(): 
            yield word, 1

def reducer(data):
    for key, values in data:
        yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(mapper, reducer, combiner=reducer)
</pre>
</blockquote>
<p>Dumbo automatically detects which interface is being used by the function, and calls it appropriately. In theory, the alternative version is faster since it involves less function calls, but the real reason why the new interface got added is because it is more low-level and can make <a href="https://dumbo.assembla.com/spaces/dumbo/tickets/7-alternative-interface-to-mappers-reducers">integration with existing Python code</a> easier in some cases.</p>
<p>Just like the original interface, the alternative one also works for <a href="http://dumbotics.com/2009/02/26/mapper-and-reducer-classes/">mapper and reducer classes</a>. Adapting the first example above such that a class is used for both the mapper and reducer results in:</p>
<blockquote><pre>
class Mapper:
    def __call__(self, key, value):
        for word in value.split(): 
            yield word, 1

class Reducer:
    def __call__(self, key, values):
        yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(Mapper, Reducer, combiner=Reducer)
</pre>
</blockquote>
<p>Applying the same transformation to the version using the alternative interface leads to:</p>
<blockquote><pre>
class Mapper:
    def __call__(self, data):
        for key, value in data:
            for word in value.split(): 
                yield word, 1

class Reducer:
    def __call__(self, data):
        for key, values in data:
            yield key, sum(values)

if __name__ == "__main__":
    from dumbo import run
    run(Mapper, Reducer, combiner=Reducer)
</pre>
</blockquote>
<p>Since mapper and reducer functions that use the alternative interface are called only once, you don&#8217;t need classes to add initialization or cleanup logic when using this interface, but they can still be useful if you want to access <a href="http://dumbotics.com/2009/02/26/mapper-and-reducer-classes/">the fields that Dumbo automatically adds to them</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/613/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/613/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/613/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=613&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/03/31/mapper-and-reducer-interfaces/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Join keys</title>
		<link>http://dumbotics.com/2009/03/20/join-keys/</link>
		<comments>http://dumbotics.com/2009/03/20/join-keys/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 13:56:49 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Explanations]]></category>
		<category><![CDATA[hadoop-5528]]></category>
		<category><![CDATA[join keys]]></category>
		<category><![CDATA[joining]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=581</guid>
		<description><![CDATA[Earlier today, I released Dumbo 0.21.3, which adds support for so called &#8220;join keys&#8221; (amongst other things). Here&#8217;s an example of a Dumbo program that uses such keys: def mapper(key, value): key.isprimary = "hostnames" in key.body[0] key.body, value = value.split("\t", 1) yield key, value class Reducer: def __init__(self): self.key = None def __call__(self, key, values): [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=581&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Earlier today, I released <a href="http://github.com/klbostee/dumbo/tree/release-0.21.3">Dumbo 0.21.3</a>, which adds support for so called &#8220;<a href="https://dumbo.assembla.com/spaces/dumbo/tickets/6-Join-keys">join keys</a>&#8221; (amongst <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/custom_report/2081">other things</a>). Here&#8217;s an example of a Dumbo program that uses such keys:</p>
<blockquote><pre>
def mapper(key, value):
    key.isprimary = "hostnames" in key.body[0]
    key.body, value = value.split("\t", 1)
    yield key, value
    
class Reducer:
    def __init__(self):
        self.key = None
    def __call__(self, key, values):
        if key.isprimary:
            self.key = key.body
            self.hostname = values.next()
        elif self.key == key.body:
            key.body = self.hostname
            for value in values:
                yield key, value
    
def runner(job):
    job.additer(mapper, Reducer)
    
def starter(prog):
    prog.addopt("addpath", "yes")
    prog.addopt("joinkeys", "yes")

if __name__ == "__main__":
    from dumbo import main
    main(runner, starter)
</pre>
</blockquote>
<p>When you put this code in <i>join.py</i>, you can join the files <a href="http://users.ugent.be/~klbostee/files/hostnames.txt">hostnames.txt</a> and <a href="http://users.ugent.be/~klbostee/files/logs.txt">logs.txt</a> as follows:</p>
<blockquote><pre>
$ wget http://users.ugent.be/~klbostee/files/hostnames.txt
$ wget http://users.ugent.be/~klbostee/files/logs.txt
$ dumbo join.py -input hostnames.txt -input logs.txt \
-output joined.code
$ dumbo cat joined.code &gt; joined.txt
</pre>
</blockquote>
<p>In order to make join keys work for non-local runs, however, you need to apply the patch from <a href="https://issues.apache.org/jira/browse/HADOOP-5528">HADOOP-5528</a>, which requires Hadoop 0.20 or higher. More precisely, Dumbo relies on the <i>BinaryPartitioner</i> from <a href="https://issues.apache.org/jira/browse/HADOOP-5528">HADOOP-5528</a> to make sure that:</p>
<ol>
<li>All keys that differ only in the <i>.isprimary</i> attribute are passed to the same reducer.</li>
<li>The primary keys are always reduced before the non-primary ones.</li>
</ol>
<p>If you want to find out how this works exactly, you might want to watch Cloudera&#8217;s <a href="http://vimeo.com/3591404">&#8220;MapReduce Algorithms&#8221; lecture</a>, since joining by means of a custom partitioner is one of the common idioms discussed in this lecture.</p>
<p><em>UPDATE: <a href="http://github.com/klbostee/dumbo/tree/release-0.21.3">Dumbo 0.21.3</a> is not compatible with the evolved version of the patch for <a href="http://dumbotics.com/2009/04/01/hadoop-5528/">HADOOP-5528</a>. To make things work with the final patch, you need to upgrade to <a href="http://github.com/klbostee/dumbo/tree/release-0.21.4">Dumbo 0.21.4</a>.</em></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/581/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/581/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=581&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/03/20/join-keys/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>How to contribute</title>
		<link>http://dumbotics.com/2009/03/12/how-to-contribute/</link>
		<comments>http://dumbotics.com/2009/03/12/how-to-contribute/#comments</comments>
		<pubDate>Thu, 12 Mar 2009 12:00:47 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[assembla]]></category>
		<category><![CDATA[dumbo]]></category>
		<category><![CDATA[github]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=529</guid>
		<description><![CDATA[As part of the attempt to get more organized, this post outlines how to contribute to Dumbo. I might turn this into a wiki page eventually, but a blog post will probably get more attention, and people who are not planning to contribute to Dumbo any time soon might still be interested in the process [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=529&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>As part of the attempt to <a href="http://dumbotics.com/2009/03/11/getting-organized/">get more organized</a>, this post outlines how to contribute to <a href="http://wiki.github.com/klbostee/dumbo">Dumbo</a>. I might turn this into a wiki page eventually, but a blog post will probably get more attention, and people who are not planning to contribute to Dumbo any time soon might still be interested in the process described in the remainder of this post.</p>
<p>If you haven&#8217;t done so already, create a <a href="https://github.com/">GitHub</a> account and <a href="http://github.com/guides/fork-a-project-and-submit-your-modifications">fork</a> the <a href="http://github.com/klbostee/dumbo/tree/master">master tree</a>. Then go through the following steps:</p>
<ol>
<li>
Either create a new <a href="http://github.com/klbostee/dumbo/issues">ticket</a> for the changes you have in mind, or add a comment to the corresponding existing ticket to inform everyone that you started working on it.
</li>
<li>
Make the necessary changes (and add unit tests for them).
</li>
<li>
Run all unit tests:</p>
<blockquote><pre>$ python setup.py test</pre>
</blockquote>
</li>
<li>
If none of the tests fail, commit your changes using a <a href="http://github.com/blog/411-github-issue-tracker">commit message that GitHub&#8217;s issue tracker understands</a>:</p>
<blockquote><pre>$ git commit -a -m "Closes GH-&lt;ticket number&gt;</i>"</pre>
</blockquote>
</li>
<li>
Push your commit to GitHub:</p>
<blockquote><pre>$ git push</pre>
</blockquote>
</li>
<li>
Send a <a href="http://github.com/guides/pull-requests">pull request</a> to <a href="http://github.com/klbostee">me</a>.
</li>
</ol>
<p>If you want to be able to easily get back to it later, you can also create a separate branch for the ticket and push this branch to GitHub instead. This might, for instance, save you some time when I spot a bug in your changes, and ask you to send another pull request after fixing this bug.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/529/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/529/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/529/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&#038;blog=6701349&#038;post=529&#038;subd=dumbotics&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/03/12/how-to-contribute/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
	</channel>
</rss>
