Dumbo 0.21.30 got released this week. Apart from several bugfixes, it includes some cool new functionality that allows you to output Tokyo Cabinet or Constant DB files directly by using a special reducer in combination with the nifty output formats that got added to Feathers a while ago. Many thanks to Daniel Graña and Bruno Rezende for contributing these awesome new features!
DAG jobs and mapredtest
December 17, 2010Dumbo 0.21.29 went out the other day and it includes two exciting new features you might be interested in:
- Support for jobs that are DAGs instead of just chains, by David Chiang.
- A neat unit testing module inspired on Cloudera’s MRUnit, by Adam Ever-Hadani.
It’s always great to get such high-quality contributions. Please keep them coming – I promise I’ll do everything I can to get them into my master branch, and eventually in a release, as quick as possible.
Dumbo backends
August 12, 2010I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends.
Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on Hadoop Streaming. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.
Personally, I would very much like Dumbo to get a backend for Avro Tether at some point. The two main starting points for making this happen would probably be my main refactoring commit and the Java implementation of a Tether client in the Avro unit tests.
HUGUK #4
May 18, 2010In response to Johan‘s desperate request I’ve decided to organize a 4th HUGUK meetup. More info will follow on the official HUGUK blog soon, but since it’s going to be fairly short notice I thought it made sense to already share some details now:
- Date: Thursday 3rd of June
- Time: 18.30
- Place: new Skills Matter building
The two main talks will be:
“Introduction to Sqoop” by Aaron Kimball
– Synopsis –
This talk introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop’s distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We’ll also cover some deeper technical details of Sqoop’s architecture, and take a look at some upcoming aspects of Sqoop’s development roadmap.
– Bio –
Aaron Kimball has been working with Hadoop since early 2007. Aaron has worked with the NSF and several other universities nationally and internationally to advance education in the field of large-scale data-intensive computing. He helped create and deliver academic course materials first used at the University of Washington (and later adopted by many other academic institutions) as well as Hadoop training materials used by several industry partners. Aaron has also worked as an independent consultant focusing on Hadoop and Amazon EC2-based systems. At Cloudera, he continues to actively develop Hadoop and related tools, as well as focus on training and user education. Aaron holds a B.S. in Computer Science from Cornell University, and an M.S. in Computer Science and Engineering from the University of Washington.
“Hive at Last.fm” by Tim Sell
– Synopsis –
This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
– Bio –
Tim Sell is a Data Engineer at Last.fm who works with Hive and Hadoop on a daily basis.
As usual we’ll try to provide some free beer at the end and anyone is welcome to give a short lightning talk after the main presentations.
Dumbo at PyCon
February 22, 2010Nitin Madnani gave a talk at PyCon this weekend about how Dumbo and Amazon EC2 allowed him to process very large text corpora using the machinery provided by NLTK. Unfortunately I wasn’t there but I heard that his talk was very well received, and his slides definitely are pretty awesome.
Consuming Dumbo output with Pig
February 5, 2010Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases it makes sense to work at an even higher level and use something like Pig or Hive. In fact, sometimes it makes sense to combine the two and do some parts of your data flow in Dumbo and others in Pig. To make this possible, I recently wrote a Pig loader function for sequence files that contain TypedBytesWritables, which is the file format Dumbo uses by default to store all its output on Hadoop. Here’s an example of a Pig script that reads Dumbo output:
register pigtail.jar; -- http://github.com/klbostee/pigtail
a = load '/hdfs/path/to/dumbo/output'
using fm.last.pigtail.storage.TypedBytesSequenceFileLoader()
as (artist:int, val:(listeners:int, listens:int));
b = foreach a generate artist, val.listeners as listeners;
c = order b by listeners;
d = limit c 100;
dump d;
You basically just have to specify names and types for the components of the key/value pairs and you’re good to go.
A possibly useful side-effect of writing this loader is the ability it creates of reading all sorts of file formats with Pig. Everything that Dumbo can read can also be consumed by Pig scripts now, all you have to do is write a simple Dumbo script that converts it to typed bytes sequence files:
from dumbo import run
from dumbo.lib import identitymapper
if __name__ == "__main__":
run(identitymapper)
The proper solution is of course to write custom Pig loaders, but this gets the job done too and doesn’t slow things down that much.
Reading Hadoop records in Python
December 23, 2009At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows:
Hadoop records
→CsvRecordInput
→hadoop_recordPython module
Although it’s a nice and very systematic solution, I couldn’t resist blogging about an already existing alternative solution for this problem:
Hadoop records
→TypedBytesRecordInput
→typedbytesPython module
Not only would this have saved Paul a lot of work, it probably also would’ve been more efficient, especially when using ctypedbytes, the speedy variant of the typedbytes module.
Dumbo on Amazon EMR
December 23, 2009A while ago, I received an email from Andrew in which he wrote:
Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:
$ elastic-mapreduce --create --aliveSSH into the cluster using your EC2 keypair as user
hadoopand install Dumbo with the following two commands:
$ wget -O ez_setup.py http://bit.ly/ezsetup
$ sudo python ez_setup.py dumboThen you can run your Dumbo scripts. I was able to run the
ipcount.pydemo with the following command.
$ dumbo start ipcount.py -hadoop /home/hadoop \
-input s3://anhi-test-data/wordcount/input/ \
-output s3://anhi-test-data/output/dumbo/wc/The
-hadoopoption is important. At this point I haven’t created an automatic Dumbo install script, so you’ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
There was a minor hiccup that required the Amazon guys to pull the AMI with Dumbo support, but it’s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.
As an aside, it’s worth mentioning that a bug in Hadoop Streaming got fixed in the process of adding Dumbo support to EMR. I can’t wait to see what else the Amazon guys have up their sleeves.
Moving to Hadoop 0.20
November 23, 2009We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process:
- We’re probably going to base our 0.20 build on Cloudera‘s 0.20 distribution, and I found out the hard way that Dumbo doesn’t work on version 0.20.1+133 of this distribution because it includes a patch for MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you’re still trying to get Dumbo to work on Cloudera’s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., “module wordcount not found” in your tasks’ stderr logs.
- Also, the Cloudera guys apparently haven’t added the patch for MAPREDUCE-764 to their distribution yet, so you’ll still have to apply this patch yourself if you want to avoid strange encoding problems in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera’s 0.20 distribution soon.
- The Twitter guys put together a pretty awesome patched and backported version of hadoop-gpl-compression for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you’re interested in this stuff, you might want to have a look at this guest post from Kevin and Eric on the Cloudera blog.
Dumbo over HBase
July 31, 2009This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or producing HBase tables from Dumbo programs. Here’s a silly but illustrative example:
from dumbo import opt, run
@opt("inputformat", "fm.last.hbase.mapred.TypedBytesTableInputFormat")
@opt("hadoopconf", "hbase.mapred.tablecolumns=testfamily:testqualifier")
def mapper(key, columns):
for family, column in columns.iteritems():
for qualifier, value in column.iteritems():
yield key, (family, qualifier, value)
@opt("outputformat", "fm.last.hbase.mapred.TypedBytesTableOutputFormat")
@opt("hadoopconf", "hbase.mapred.outputtable=output_table")
def reducer(key, values):
columns = {}
for family, qualifier, value in values:
column = columns.get(family, {})
column[qualifier] = value
yield key, columns
if __name__ == "__main__":
run(mapper, reducer)
Have a look at the readme for more information.
Posted by Klaas