<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Dumbotics &#187; Tips and tricks</title>
	<atom:link href="http://dumbotics.com/category/tips-and-tricks/feed/" rel="self" type="application/rss+xml" />
	<link>http://dumbotics.com</link>
	<description>Pseudo-random ramblings about Dumbo and Hadoop</description>
	<lastBuildDate>Sun, 05 Feb 2012 13:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='dumbotics.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Dumbotics &#187; Tips and tricks</title>
		<link>http://dumbotics.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://dumbotics.com/osd.xml" title="Dumbotics" />
	<atom:link rel='hub' href='http://dumbotics.com/?pushpress=hub'/>
		<item>
		<title>Consuming Dumbo output with Pig</title>
		<link>http://dumbotics.com/2010/02/05/consuming-dumbo-output-with-pig/</link>
		<comments>http://dumbotics.com/2010/02/05/consuming-dumbo-output-with-pig/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 10:39:51 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[pigtail]]></category>
		<category><![CDATA[typed bytes]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1295</guid>
		<description><![CDATA[Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1295&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases it makes sense to work at an even higher level and use something like <a href="http://hadoop.apache.org/pig">Pig</a> or <a href="http://hadoop.apache.org/hive">Hive</a>. In fact, sometimes it makes sense to combine the two and do some parts of your data flow in Dumbo and others in Pig. To make this possible, I recently wrote a Pig <a href="http://github.com/klbostee/pigtail/blob/master/src/main/java/fm/last/pigtail/storage/TypedBytesSequenceFileLoader.java">loader function for sequence files that contain <tt>TypedBytesWritable</tt>s</a>, which is the file format Dumbo uses by default to store all its output on Hadoop. Here&#8217;s an example of a Pig script that reads Dumbo output:</p>
<blockquote><pre>
register pigtail.jar;  -- http://github.com/klbostee/pigtail

a = load '/hdfs/path/to/dumbo/output'
    using fm.last.pigtail.storage.TypedBytesSequenceFileLoader()
    as (artist:int, val:(listeners:int, listens:int));
b = foreach a generate artist, val.listeners as listeners;
c = order b by listeners;
d = limit c 100;

dump d;
</pre>
</blockquote>
<p>You basically just have to specify names and types for the components of the key/value pairs and you&#8217;re good to go.</p>
<p>A possibly useful side-effect of writing this loader is the ability it creates of reading all sorts of file formats with Pig. Everything that Dumbo can read can also be consumed by Pig scripts now, all you have to do is write a simple Dumbo script that converts it to typed bytes sequence files:</p>
<blockquote><pre>
from dumbo import run
from dumbo.lib import identitymapper

if __name__ == "__main__":
    run(identitymapper)
</pre>
</blockquote>
<p>The proper solution is of course to write custom Pig loaders, but this gets the job done too and doesn&#8217;t slow things down that much.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1295/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1295/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1295&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2010/02/05/consuming-dumbo-output-with-pig/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Reading Hadoop records in Python</title>
		<link>http://dumbotics.com/2009/12/23/reading-hadoop-records-in-python/</link>
		<comments>http://dumbotics.com/2009/12/23/reading-hadoop-records-in-python/#comments</comments>
		<pubDate>Wed, 23 Dec 2009 20:32:45 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[ctypedbytes]]></category>
		<category><![CDATA[hadoop records]]></category>
		<category><![CDATA[hadoop_record]]></category>
		<category><![CDATA[jute]]></category>
		<category><![CDATA[recordio]]></category>
		<category><![CDATA[typedbytes]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1272</guid>
		<description><![CDATA[At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows: Hadoop records &#160;&#160;&#160;&#160; &#8594; CsvRecordInput &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#8594; hadoop_record Python module Although it&#8217;s a nice and very systematic solution, I couldn&#8217;t resist blogging about an already existing alternative solution [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1272&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>At the <a href="http://developer.yahoo.net/blogs/hadoop/2009/11/1118_hadoop_bay_area_user_grou.html">11/18 Bay Area HUG</a>, <a href="http://blog.paulisageek.com/">Paul Tarjan</a> apparently <a href="http://www.slideshare.net/hadoopusergroup/hadoop-record-reader-in-python-2635453?src=embed">presented</a> an approach for reading <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/record/package-summary.html">Hadoop records</a> in Python. In summary, his approach seems to work as follows:</p>
<blockquote><p>
Hadoop records<br />
&nbsp;&nbsp;&nbsp;&nbsp; &rarr; <a href="http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/record/CsvRecordInput.java?view=markup"><code>CsvRecordInput</code></a><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &rarr; <a href="http://github.com/ptarjan/hadoop_record"><code>hadoop_record</code> Python module</a>
</p></blockquote>
<p>Although it&#8217;s a nice and very systematic solution, I couldn&#8217;t resist blogging about an already existing alternative solution for this problem: </p>
<blockquote><p>
Hadoop records<br />
&nbsp;&nbsp;&nbsp;&nbsp; &rarr; <a href="http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/streaming/src/java/org/apache/hadoop/typedbytes/TypedBytesRecordInput.java?view=markup"><code>TypedBytesRecordInput</code></a><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &rarr; <a href="http://github.com/klbostee/typedbytes"><code>typedbytes</code> Python module</a>
</p></blockquote>
<p>Not only would this have saved Paul a lot of work, it probably also would&#8217;ve been more efficient, especially when using <a href="http://github.com/klbostee/ctypedbytes">ctypedbytes</a>, the speedy variant of the <a href="http://github.com/klbostee/typedbytes">typedbytes module</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1272/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1272/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1272&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/12/23/reading-hadoop-records-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Dumbo on Amazon EMR</title>
		<link>http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr/</link>
		<comments>http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr/#comments</comments>
		<pubDate>Wed, 23 Dec 2009 09:24:56 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[mapreduce-1293]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1253</guid>
		<description><![CDATA[A while ago, I received an email from Andrew in which he wrote: Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so: $ elastic-mapreduce --create --alive SSH into the cluster using your EC2 keypair as user hadoop and install Dumbo [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1253&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A while ago, I received an email from <a href="http://andrewhitchcock.org/">Andrew</a> in which he wrote:</p>
<blockquote><p>
Now you should be able to run Dumbo jobs on <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a>. To start a cluster, you can use the Ruby client as so:</p>
<p><code>$ elastic-mapreduce --create --alive</code></p>
<p>SSH into the cluster using your <a href="http://aws.amazon.com/ec2/">EC2</a> keypair as user <code>hadoop</code> and install Dumbo with the following two commands:</p>
<p><code>$ wget -O ez_setup.py http://bit.ly/ezsetup</code><br />
<code>$ sudo python ez_setup.py dumbo</code></p>
<p>Then you can run your Dumbo scripts. I was able to run the <code>ipcount.py</code> demo with the following command.</p>
<p><code>$ dumbo start ipcount.py -hadoop /home/hadoop \<br />
-input s3://anhi-test-data/wordcount/input/ \<br />
-output s3://anhi-test-data/output/dumbo/wc/</code></p>
<p>The <code>-hadoop</code> option is important. At this point I haven&#8217;t created an automatic Dumbo install script, so you&#8217;ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
</p></blockquote>
<p>There was a <a href="http://groups.google.com/group/dumbo-user/msg/70c910f1250b1d63">minor hiccup</a> that required the Amazon guys to pull the AMI with Dumbo support, but it&#8217;s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.</p>
<p>As an aside, it&#8217;s worth mentioning that a bug in Hadoop Streaming <a href="http://issues.apache.org/jira/browse/MAPREDUCE-1293">got fixed</a> in the process of adding Dumbo support to EMR. I can&#8217;t wait to see what else the Amazon guys have up their sleeves.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1253/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1253&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/12/23/dumbo-on-amazon-emr/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Moving to Hadoop 0.20</title>
		<link>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/</link>
		<comments>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 09:26:29 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Explanations]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop 0.20]]></category>
		<category><![CDATA[hadoop-gpl-compression]]></category>
		<category><![CDATA[hadoop-lzo]]></category>
		<category><![CDATA[mapreduce-764]]></category>
		<category><![CDATA[mapreduce-967]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1233</guid>
		<description><![CDATA[We&#8217;ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process: We&#8217;re probably going to base our 0.20 build on Cloudera&#8216;s 0.20 distribution, and I found out the hard way that Dumbo doesn&#8217;t work on [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1233&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve finally started looking into moving from Hadoop 0.18 to 0.20 at <a href="http://last.fm">Last.fm</a>, and I thought it might be useful to share a few Dumbo-related things I learned in the process:</p>
<ul>
<li>We&#8217;re probably going to base our 0.20 build on <a href="http://cloudera.com">Cloudera</a>&#8216;s <a href="http://archive.cloudera.com/cdh/testing/">0.20 distribution</a>, and I found out the hard way that Dumbo doesn&#8217;t work on version 0.20.1+133 of this distribution because it includes a patch for <a href="http://issues.apache.org/jira/browse/MAPREDUCE-967">MAPREDUCE-967</a> that <a href="http://issues.apache.org/jira/browse/MAPREDUCE-967?focusedCommentId=12770121&amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12770121">breaks</a> some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you&#8217;re still trying to get Dumbo to work on Cloudera&#8217;s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., &#8220;module wordcount not found&#8221; in your tasks&#8217; stderr logs.</li>
<li>Also, the Cloudera guys apparently haven&#8217;t added the patch for <a href="http://issues.apache.org/jira/browse/MAPREDUCE-764">MAPREDUCE-764</a> to their distribution yet, so you&#8217;ll still have to apply this patch yourself if you want to avoid <a href="http://dumbotics.com/2009/07/15/mapreduce-764/">strange encoding problems</a> in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera&#8217;s 0.20 distribution soon.</li>
<li>The <a href="http://twitter.com">Twitter</a> guys put together a pretty awesome <a href="http://github.com/kevinweil/hadoop-lzo">patched and backported version</a> of <a href="http://code.google.com/p/hadoop-gpl-compression/">hadoop-gpl-compression</a> for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you&#8217;re interested in this stuff, you might want to have a look at <a href="http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/">this</a> guest post from <a href="http://twitter.com/kevinWeil">Kevin</a> and <a href="http://twitter.com/emaland">Eric</a> on the Cloudera blog.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1233/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1233/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1233/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1233&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/11/23/moving-to-hadoop-0-20/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Dumbo over HBase</title>
		<link>http://dumbotics.com/2009/07/31/dumbo-over-hbase/</link>
		<comments>http://dumbotics.com/2009/07/31/dumbo-over-hbase/#comments</comments>
		<pubDate>Fri, 31 Jul 2009 13:46:37 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[inputformat]]></category>
		<category><![CDATA[outputformat]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1205</guid>
		<description><![CDATA[This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or producing HBase tables from Dumbo programs. Here&#8217;s a silly but illustrative example: from dumbo import opt, run @opt("inputformat", "fm.last.hbase.mapred.TypedBytesTableInputFormat") @opt("hadoopconf", "hbase.mapred.tablecolumns=testfamily:testqualifier") [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1205&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This should be <a href="http://groups.google.com/group/dumbo-user/browse_thread/thread/fb74b3be4600e85b">old news</a> for <a href="http://groups.google.com/group/dumbo-user">dumbo-user</a> subscribers, but <a href="http://nectarius.net/">Tim</a> has, <a href="http://dumbotics.com/2009/05/06/binarypartitioner-backported-to-018/">once again</a>, put his Java coding skills to <a href="http://twitter.com/roserpens/statuses/2891376424">good use</a>. This time around he created nifty <a href="http://github.com/tims/lasthbase/blob/81c3b3410f0609c7a899d462d27ce18597ccffea/src/java/fm/last/hbase/mapred/TypedBytesTableInputFormat.java">input</a> and <a href="http://github.com/tims/lasthbase/blob/81c3b3410f0609c7a899d462d27ce18597ccffea/src/java/fm/last/hbase/mapred/TypedBytesTableOutputFormat.java">output</a> formats for consuming and/or producing <a href="http://hadoop.apache.org/hbase/">HBase</a> tables from Dumbo programs. Here&#8217;s a silly but illustrative example: </p>
<blockquote><pre>
from dumbo import opt, run

@opt("inputformat", "fm.last.hbase.mapred.TypedBytesTableInputFormat")
@opt("hadoopconf", "hbase.mapred.tablecolumns=testfamily:testqualifier")
def mapper(key, columns):
    for family, column in columns.iteritems():
        for qualifier, value in column.iteritems():
            yield key, (family, qualifier, value)

@opt("outputformat", "fm.last.hbase.mapred.TypedBytesTableOutputFormat")
@opt("hadoopconf", "hbase.mapred.outputtable=output_table")
def reducer(key, values):
    columns = {}
    for family, qualifier, value in values:
        column = columns.get(family, {})
        column[qualifier] = value
    yield key, columns

if __name__ == "__main__":
    run(mapper, reducer)
</pre>
</blockquote>
<p>Have a look at the <a href="http://github.com/tims/lasthbase/blob/81c3b3410f0609c7a899d462d27ce18597ccffea/README.txt">readme</a> for more information.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1205/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1205&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/07/31/dumbo-over-hbase/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Integration with Java code</title>
		<link>http://dumbotics.com/2009/06/16/integration-with-java-code/</link>
		<comments>http://dumbotics.com/2009/06/16/integration-with-java-code/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 22:15:04 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[dumbo]]></category>
		<category><![CDATA[feathers]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1108</guid>
		<description><![CDATA[Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks to a recent enhancement, this is now easily achievable. Here&#8217;s a version of wordcount.py that uses the example mapper and reducer [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1108&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks to a <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/46">recent enhancement</a>, this is now easily achievable. Here&#8217;s a version of <a href="http://github.com/klbostee/dumbo/blob/962900f7d2d006d537d4daf50f31c425e7ad77ae/examples/wordcount.py">wordcount.py</a> that uses the example mapper and reducer from the <a href="http://github.com/klbostee/feathers">feathers project</a> (and thus requires <i>-libjar feathers.jar</i>):</p>
<blockquote><pre>
import dumbo
dumbo.run("fm.last.feathers.map.Words",
          "fm.last.feathers.reduce.Sum",
          combiner="fm.last.feathers.reduce.Sum")
</pre>
</blockquote>
<p>You can basically mix up Python with Java in any way you like. There&#8217;s only one minor restriction: You cannot use a Python combiner when you specify a Java mapper. Things should still work in this case though, it&#8217;ll just be slow since the combiner won&#8217;t actually run. In theory, this limitation could be avoided by relying on <a href="http://issues.apache.org/jira/browse/HADOOP-4842">HADOOP-4842</a>, but personally I don&#8217;t think it&#8217;s worth the trouble.</p>
<p>The source code for <a href="http://github.com/klbostee/feathers/blob/8c215323b0d7db6e5975a29396b9660b2d47e1dd/src/map/Words.java"><i>fm.last.feathers.map.Words</i></a> and <a href="http://github.com/klbostee/feathers/blob/8c215323b0d7db6e5975a29396b9660b2d47e1dd/src/reduce/Sum.java"><i>fm.last.feathers.reduce.Sum</i></a> is just as straightforward as the code for the <i>OutputFormat</i> classes discussed in my <a href="http://dumbotics.com/2009/06/08/multiple-outputs/">previous post</a>. All you have to keep in mind is that only the mapper input keys and values can be arbitrary writables. Every other key or value has to be a <a href="http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-trunk/javadoc/org/apache/hadoop/typedbytes/TypedBytesWritable.html"><i>TypedBytesWritable</i></a>. Writing a custom Java partitioner for Dumbo programs is equally easy by the way. The <a href="http://github.com/klbostee/feathers/blob/ae854f6b4f78fc42e8b3fbb8e216319cbdae1343/src/partition/Prefix.java"><i>fm.last.feather.partition.Prefix</i></a> class is a simple example. It can be used by specifying <i>-partitioner fm.last.feather.partition.Prefix</i>.</p>
<p>As you probably expected already, none of this will work for local runs on UNIX, but you can still test things locally fairly easily by running on Hadoop in <a href="http://www.cloudera.com/hadoop-deb#installing_hadoop__standalone_mode_">standalone mode</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1108/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1108&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/06/16/integration-with-java-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Multiple outputs</title>
		<link>http://dumbotics.com/2009/06/08/multiple-outputs/</link>
		<comments>http://dumbotics.com/2009/06/08/multiple-outputs/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 12:11:21 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[feathers]]></category>
		<category><![CDATA[getpath]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[outputformat]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1063</guid>
		<description><![CDATA[Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here&#8217;s an example: from dumbo import run, sumreducer, opt def mapper(key, value): for word in value.split(): yield word, 1 @opt("getpath", "yes") def reducer(key, values): yield (key[0].upper(), key), sum(values) if __name__ == "__main__": run(mapper, reducer, combiner=sumreducer) Running this splitwordcount.py program on my chrooted Cloudera-flavored [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1063&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Dumbo <a href="http://dumbo.assembla.com/spaces/dumbo/milestones/92665-0-21-20">0.21.20</a> adds support for multiple outputs by providing a <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/48"><i>-getpath</i> option</a>. Here&#8217;s an example:</p>
<blockquote><pre>
from dumbo import run, sumreducer, opt

def mapper(key, value):
    for word in value.split():
        yield word, 1

@opt("getpath", "yes")
def reducer(key, values):
    yield (key[0].upper(), key), sum(values)

if __name__ == "__main__":
    run(mapper, reducer, combiner=sumreducer)
</pre>
</blockquote>
<p>Running this <i>splitwordcount.py</i> program on my <a href="http://dumbotics.com/2009/05/31/dumbo-on-clouderas-distribution/">chrooted Cloudera-flavored Hadoop server</a> (after updating Dumbo  and building <a href="http://github.com/klbostee/feathers"><i>feathers.jar</i></a>) gave me the following results:</p>
<blockquote><pre>
$ dumbo splitwordcount.py -input brian.txt -output brianwc \
-hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar
[...]
$ dumbo ls brianwc -hadoop /usr/lib/hadoop/
Found 17 items
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/A
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/B
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/C
[...]
$ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/
be      2
boy     1
Brian   6
became  2
</pre>
</blockquote>
<p>So each <i>((&lt;path&gt;, &lt;key&gt;), &lt;value&gt;)</i> pair got stored as <i>(&lt;key&gt;, &lt;value&gt;)</i> in <i>&lt;outputdir&gt;/&lt;path&gt;</i>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.</p>
<p>Under the hood, <i>-getpath yes</i> basically just makes sure that <i>-outputformat sequencefile</i> (which is the default when running on Hadoop) and <i>-outputformat text</i> get translated to <i>-outputformat <a href="http://github.com/klbostee/feathers/blob/e703c9c0948232a0f483497b4358e7417990ded8/src/output/MultipleSequenceFiles.java">fm.last.feathers.output.MultipleSequenceFiles</a></i> and <i>-outputformat <a href="http://github.com/klbostee/feathers/blob/e703c9c0948232a0f483497b4358e7417990ded8/src/output/MultipleTextFiles.java">fm.last.feathers.output.MultipleTextFiles</a></i>, respectively. These <i>OutputFormat</i> implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new <a href="http://github.com/klbostee/feathers">feathers project</a> already provides a few other Java classes that can also easily be used by Dumbo programs, including a <a href="http://github.com/klbostee/feathers/blob/8c215323b0d7db6e5975a29396b9660b2d47e1dd/src/map/Words.java">mapper</a> and a <a href="http://github.com/klbostee/feathers/blob/e703c9c0948232a0f483497b4358e7417990ded8/src/reduce/Sum.java">reducer</a>. I&#8217;ll try to find some time to ramble a bit about those as well, but that&#8217;s for another post.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1063/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1063/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1063/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1063&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/06/08/multiple-outputs/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Dumbo on Cloudera&#8217;s distribution</title>
		<link>http://dumbotics.com/2009/05/31/dumbo-on-clouderas-distribution/</link>
		<comments>http://dumbotics.com/2009/05/31/dumbo-on-clouderas-distribution/#comments</comments>
		<pubDate>Sun, 31 May 2009 15:49:53 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[dumbo]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=1019</guid>
		<description><![CDATA[Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera&#8217;s Hadoop distribution. Todd confirmed this to me yesterday, so the time was right to finally have a look at Cloudera&#8217;s nicely packaged and patched-up Hadoop. I started [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1019&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of <a href="http://www.cloudera.com/hadoop">Cloudera&#8217;s Hadoop distribution</a>. <a href="http://twitter.com/tlipcon">Todd</a> confirmed this to me yesterday, so the time was right to finally have a look at Cloudera&#8217;s nicely packaged and patched-up Hadoop. </p>
<p>I started from a <a href="http://wiki.freaks-unidos.net/chrooted-debian-server">chrooted Debian server</a>, on which I installed the Cloudera distribution, Python 2.5, and Dumbo as follows:</p>
<blockquote><pre>
# cat /etc/apt/sources.list
deb http://ftp.be.debian.org/debian etch main contrib non-free
deb http://www.backports.org/debian etch-backports main contrib non-free
deb http://archive.cloudera.com/debian etch contrib
deb-src http://archive.cloudera.com/debian etch contrib
# wget -O - http://backports.org/debian/archive.key | apt-key add -
# wget -O - http://archive.cloudera.com/debian/archive.key | apt-key add -
# apt-get update
# apt-get install hadoop python2.5 python2.5-dev
# wget http://peak.telecommunity.com/dist/ez_setup.py
# python2.5 ez_setup.py dumbo
</pre>
</blockquote>
<p>Then, I created a user for myself and confirmed that the <a href="http://bit.ly/wordcountpy"><i>wordcount.py</i></a> program runs properly on Cloudera&#8217;s distribution in <a href="http://www.cloudera.com/hadoop-deb#installing_hadoop__standalone_mode_">standalone mode</a>:</p>
<blockquote><pre>
# adduser klaas
# su - klaas
$ wget http://bit.ly/wordcountpy http://bit.ly/briantxt
$ dumbo start wordcount.py -input brian.txt -output brianwc \
-python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian
Brian   6
</pre>
</blockquote>
<p>Unsurprisingly, it also worked perfectly in <a href="http://www.cloudera.com/hadoop-deb#installing_hadoop__pseudodistributed_mode_">pseudo-distributed mode</a>:</p>
<blockquote><pre>
$ exit
# apt-get install hadoop-conf-pseudo
# /etc/init.d/hadoop-namenode start
# /etc/init.d/hadoop-secondarynamenode start
# /etc/init.d/hadoop-datanode start
# /etc/init.d/hadoop-jobtracker start
# /etc/init.d/hadoop-tasktracker start
# su - klaas
$ dumbo start wordcount.py -input brian.txt -output brianwc \
-python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo rm brianwc/_logs -hadoop /usr/lib/hadoop/
Deleted hdfs://localhost/user/klaas/brianwc/_logs
$ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian
Brian   6
</pre>
</blockquote>
<p>Note that I removed the <i>_logs</i> directory first because <i>dumbo cat</i> would&#8217;ve complained about it otherwise. You can avoid this minor annoyance by <a href="http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html#Logging">disabling the creation of <i>_logs</i> directories</a>.</p>
<p>I also verified that <a href="http://issues.apache.org/jira/browse/HADOOP-5528">HADOOP-5528</a> got included by running the <a href="http://bit.ly/joinpy"><i>join.py</i></a> example successfully:</p>
<blockquote><pre>
$ wget http://bit.ly/joinpy
$ wget http://bit.ly/hostnamestxt http://bit.ly/logstxt
$ dumbo put hostnames.txt hostnames.txt -hadoop /usr/lib/hadoop/
$ dumbo put logs.txt logs.txt -hadoop /usr/lib/hadoop/
$ dumbo start join.py -input hostnames.txt -input logs.txt \
-output joined -python python2.5 -hadoop /usr/lib/hadoop/
$ dumbo rm joined/_logs -hadoop /usr/lib/hadoop
$ dumbo cat joined -hadoop /usr/lib/hadoop | grep node1
node1   5
</pre>
</blockquote>
<p>And while I was at it, I did a quick <a href="http://github.com/klbostee/typedbytes"><i>typedbytes</i></a> versus <a href="http://github.com/klbostee/ctypedbytes"><i>ctypedbytes</i></a> comparison as well:</p>
<blockquote><pre>
$ zcat /usr/share/man/man1/python2.5.1.gz &gt; python.man
$ for i in `seq 100000`; do cat python.man &gt;&gt; python.txt; done
$ du -h python.txt
1.2G    python.txt
$ dumbo put python.txt python.txt -hadoop /usr/lib/hadoop/
$ time dumbo start wordcount.py -input python.txt -output pywc \
-python python2.5 -hadoop /usr/lib/hadoop/
real    17m45.473s
user    0m1.380s
sys     0m0.224s
$ exit
# apt-get install gcc libc6-dev
# su - klaas
$ python2.5 ez_setup.py -zmaxd. ctypedbytes
$ time dumbo start wordcount.py -input python.txt -output pywc2 \
-python python2.5 -hadoop /usr/lib/hadoop/ \
-libegg ctypedbytes-0.1.5-py2.5-linux-i686.egg
real    13m22.420s
user    0m1.320s
sys     0m0.216s
</pre>
</blockquote>
<p>In this particular case, <i>ctypedbytes</i> appears to be 25% faster. Your mileage may vary since the running times depend on many factors, but in any case I&#8217;d always expect <i>ctypedbytes</i> to lead to significant speed improvements.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/1019/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/1019/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/1019/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=1019&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/05/31/dumbo-on-clouderas-distribution/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>Virtual Python environments</title>
		<link>http://dumbotics.com/2009/05/24/virtual-pythonenvironments/</link>
		<comments>http://dumbotics.com/2009/05/24/virtual-pythonenvironments/#comments</comments>
		<pubDate>Sun, 24 May 2009 16:30:18 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[virtualenv]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=983</guid>
		<description><![CDATA[Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to write a quick post about them. The virtualenv tool can be installed as follows: $ wget http://peak.telecommunity.com/dist/ez_setup.py $ python ez_setup.py virtualenv While you&#8217;re [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=983&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to write a quick post about them. </p>
<p>The <a href="http://pypi.python.org/pypi/virtualenv"><i>virtualenv</i></a> tool can be installed as follows:</p>
<blockquote><pre>
$ wget http://peak.telecommunity.com/dist/ez_setup.py
$ python ez_setup.py virtualenv
</pre>
</blockquote>
<p>While you&#8217;re at it, you might also want to install <a href="http://code.google.com/p/python-nose/"><i>nose</i></a> by doing</p>
<blockquote><pre>
$ python ez_setup.py nose
</pre>
</blockquote>
<p>since running unit tests sometimes doesn&#8217;t work if you don&#8217;t install this module manually. Once you got <i>virtualenv</i> installed, you can create and activate a virtual Python environment as follows:</p>
<blockquote><pre>
$ mkdir ~/envs
$ virtualenv ~/envs/dumbo
$ source ~/envs/dumbo/bin/activate
</blockquote>
</pre>
<p>You then get a slightly different prompt to remind you that you&#8217;re using the isolated virtual environment:</p>
<blockquote><pre>
(dumbo)$ which python
/home/username/envs/dumbo/bin/python
(dumbo)$ deactivate
$ which python
/usr/bin/python
</pre>
</blockquote>
<p>Such a virtual environment can be very convenient for developing and debugging Dumbo:</p>
<blockquote><pre>
$ source ~/envs/dumbo/bin/activate
(dumbo)$ git clone git://github.com/klbostee/dumbo.git
(dumbo)$ cd dumbo
(dumbo)$ python setup.py test  # run unit tests
(dumbo)$ python setup.py develop  # install symlinks
(dumbo)$ which dumbo
/home/username/envs/dumbo/bin/dumbo
(dumbo)$ cd examples
(dumbo)$ dumbo start wordcount.py -input brian.txt -output out.code
(dumbo)$ dumbo cat out.code | head -n 2
A       2
And     4
</pre>
</blockquote>
<p>Anything you change to the source code will then immediately affect the behavior of <i>dumbo</i> in the virtual environment, and none of this interferes with your global Python installation in any way. Soon enough, you&#8217;ll start wondering how you ever managed to live without virtual Python environments.</p>
<p>Note that running on Hadoop won&#8217;t work when you installed Dumbo via <i>python setup.py develop</i>. The develop command installs symlinks to the source files (such that you don&#8217;t have to run it after each change when you&#8217;re developing), but in order to be able to run on Hadoop an egg needs to be generated and installed, which is precisely what <i>python setup.py install</i> does.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/983/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=983&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/05/24/virtual-pythonenvironments/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
		<item>
		<title>TF-IDF revisited</title>
		<link>http://dumbotics.com/2009/05/17/tf-idf-revisited/</link>
		<comments>http://dumbotics.com/2009/05/17/tf-idf-revisited/#comments</comments>
		<pubDate>Sun, 17 May 2009 08:54:28 +0000</pubDate>
		<dc:creator>Klaas</dc:creator>
				<category><![CDATA[Examples]]></category>
		<category><![CDATA[Tips and tricks]]></category>
		<category><![CDATA[buffering]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[tf-idf]]></category>

		<guid isPermaLink="false">http://dumbotics.com/?p=948</guid>
		<description><![CDATA[Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera&#8216;s free Hadoop training? Thanks to the new joining abstraction (and the minor fixes and enhancements in Dumbo 0.21.13 and 0.21.14), these problems can now easily be avoided: from dumbo import * from [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=948&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Remember the buffering problems for the TF-IDF program discussed in a <a href="http://dumbotics.com/2009/03/15/computing-tf-idf-weights/">previous post</a> as well as the <a href="http://vimeo.com/3591404">lecture about MapReduce algorithms</a> from <a href="http://cloudera.com">Cloudera</a>&#8216;s <a href="http://www.cloudera.com/blog/2009/03/13/clouderas-basic-hadoop-training-now-free-online/">free Hadoop training</a>? Thanks to the <a href="http://dumbotics.com/2009/05/15/hiding-join-keys/">new joining abstraction</a> (and the minor fixes and enhancements in Dumbo <a href="http://dumbo.assembla.com/spaces/dumbo/milestones/88557-0-21-13">0.21.13</a> and <a href="http://dumbo.assembla.com/spaces/dumbo/milestones/88923-0-21-14">0.21.14</a>), these problems can now easily be avoided:</p>
<blockquote><pre>
from dumbo import *
from math import log

@opt("addpath", "yes")
def mapper1(key, value):
    for word in value.split():
        yield (key[0], word), 1

@primary
def mapper2a(key, value):
    yield key[0], value

@secondary
def mapper2b(key, value):
    yield key[0], (key[1], value)

@primary
def mapper3a(key, value):
    yield value[0], 1

@secondary
def mapper3b(key, value):
    yield value[0], (key, value[1])

class Reducer(JoinReducer):
    def __init__(self):
        self.sum = 0
    def primary(self, key, values):
        self.sum = sum(values)

class Combiner(JoinCombiner):
    def primary(self, key, values):
        yield key, sum(values)

class Reducer1(Reducer):
    def secondary(self, key, values):
        for (doc, n) in values:
            yield key, (doc, float(n) / self.sum)

class Reducer2(Reducer):
    def __init__(self):
        Reducer.__init__(self)
        self.doccount = float(self.params["doccount"])
    def secondary(self, key, values):
        idf = log(self.doccount / self.sum)
        for (doc, tf) in values:
            yield (key, doc), tf * idf

def runner(job):
    job.additer(mapper1, sumreducer, combiner=sumreducer)
    multimapper = MultiMapper()
    multimapper.add("", mapper2a)
    multimapper.add("", mapper2b)
    job.additer(multimapper, Reducer1, Combiner)
    multimapper = MultiMapper()
    multimapper.add("", mapper3a)
    multimapper.add("", mapper3b)
    job.additer(multimapper, Reducer2, Combiner)

if __name__ == "__main__":
    main(runner)
</pre>
</blockquote>
<p>Most of this Dumbo program shouldn&#8217;t be hard to understand if you had a peek at the posts about <a href="http://dumbotics.com/2009/05/15/hiding-join-keys/">hiding join keys</a> and <a href="http://dumbotics.com/2009/05/13/the-opt-decorator/">the <i>@opt</i> decorator</a>, except maybe for the following things:</p>
<ul>
<li>The first argument supplied to <i>MultiMapper</i>&#8216;s <i>add</i> method is a string corresponding to the pattern that has to occur in the file path in order for the added mapper to run on the key/value pairs in a given file. Since the empty string <i>&#8220;&#8221;</i> is considered to occur in every possible path string, all added mappers run on each input file in this example program.</li>
<li>It is <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/33">possible (but not necessary)</a> to yield key/value pairs in a <i>JoinReducer</i>&#8216;s <i>primary</i> method, as illustrated by the <i>Combiner</i> class in this example.</li>
<li>The <a href="http://dumbo.assembla.com/spaces/dumbo/tickets/34">default implementations</a> of <i>JoinReducer</i>&#8216;s <i>primary</i> and <i>secondary</i> methods are identity operations, so <i>Combiner</i> combines the primary pairs and just passes on the secondary ones.</li>
</ul>
<p>Writing this program went surprisingly smooth and didn&#8217;t take much effort at all. Apparently, the &#8220;primary/secondary abstraction&#8221; works really well for me.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/dumbotics.wordpress.com/948/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/dumbotics.wordpress.com/948/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/dumbotics.wordpress.com/948/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=dumbotics.com&amp;blog=6701349&amp;post=948&amp;subd=dumbotics&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://dumbotics.com/2009/05/17/tf-idf-revisited/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Klaas</media:title>
		</media:content>
	</item>
	</channel>
</rss>
