Outputting Tokyo Cabinet or Constant DB files

April 29, 2011

Dumbo 0.21.30 got released this week. Apart from several bugfixes, it includes some cool new functionality that allows you to output Tokyo Cabinet or Constant DB files directly by using a special reducer in combination with the nifty output formats that got added to Feathers a while ago. Many thanks to Daniel Graña and Bruno Rezende for contributing these awesome new features!

DAG jobs and mapredtest

December 17, 2010

Dumbo 0.21.29 went out the other day and it includes two exciting new features you might be interested in:

It’s always great to get such high-quality contributions. Please keep them coming – I promise I’ll do everything I can to get them into my master branch, and eventually in a release, as quick as possible.

Dumbo backends

August 12, 2010

I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends.

Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on Hadoop Streaming. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.

Personally, I would very much like Dumbo to get a backend for Avro Tether at some point. The two main starting points for making this happen would probably be my main refactoring commit and the Java implementation of a Tether client in the Avro unit tests.


May 18, 2010

In response to Johan‘s desperate request I’ve decided to organize a 4th HUGUK meetup. More info will follow on the official HUGUK blog soon, but since it’s going to be fairly short notice I thought it made sense to already share some details now:

The two main talks will be:

“Introduction to Sqoop” by Aaron Kimball

— Synopsis —

This talk introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop’s distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.

After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We’ll also cover some deeper technical details of Sqoop’s architecture, and take a look at some upcoming aspects of Sqoop’s development roadmap.

— Bio —

Aaron Kimball has been working with Hadoop since early 2007. Aaron has worked with the NSF and several other universities nationally and internationally to advance education in the field of large-scale data-intensive computing. He helped create and deliver academic course materials first used at the University of Washington (and later adopted by many other academic institutions) as well as Hadoop training materials used by several industry partners. Aaron has also worked as an independent consultant focusing on Hadoop and Amazon EC2-based systems. At Cloudera, he continues to actively develop Hadoop and related tools, as well as focus on training and user education. Aaron holds a B.S. in Computer Science from Cornell University, and an M.S. in Computer Science and Engineering from the University of Washington.

“Hive at Last.fm” by Tim Sell

— Synopsis —

This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.

— Bio —

Tim Sell is a Data Engineer at Last.fm who works with Hive and Hadoop on a daily basis.

As usual we’ll try to provide some free beer at the end and anyone is welcome to give a short lightning talk after the main presentations.

Dumbo at PyCon

February 22, 2010

Nitin Madnani gave a talk at PyCon this weekend about how Dumbo and Amazon EC2 allowed him to process very large text corpora using the machinery provided by NLTK. Unfortunately I wasn’t there but I heard that his talk was very well received, and his slides definitely are pretty awesome.

Consuming Dumbo output with Pig

February 5, 2010

Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases it makes sense to work at an even higher level and use something like Pig or Hive. In fact, sometimes it makes sense to combine the two and do some parts of your data flow in Dumbo and others in Pig. To make this possible, I recently wrote a Pig loader function for sequence files that contain TypedBytesWritables, which is the file format Dumbo uses by default to store all its output on Hadoop. Here’s an example of a Pig script that reads Dumbo output:

register pigtail.jar;  -- http://github.com/klbostee/pigtail

a = load '/hdfs/path/to/dumbo/output'
    using fm.last.pigtail.storage.TypedBytesSequenceFileLoader()
    as (artist:int, val:(listeners:int, listens:int));
b = foreach a generate artist, val.listeners as listeners;
c = order b by listeners;
d = limit c 100;

dump d;

You basically just have to specify names and types for the components of the key/value pairs and you’re good to go.

A possibly useful side-effect of writing this loader is the ability it creates of reading all sorts of file formats with Pig. Everything that Dumbo can read can also be consumed by Pig scripts now, all you have to do is write a simple Dumbo script that converts it to typed bytes sequence files:

from dumbo import run
from dumbo.lib import identitymapper

if __name__ == "__main__":

The proper solution is of course to write custom Pig loaders, but this gets the job done too and doesn’t slow things down that much.

Reading Hadoop records in Python

December 23, 2009

At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows:

Hadoop records
     → CsvRecordInput
         → hadoop_record Python module

Although it’s a nice and very systematic solution, I couldn’t resist blogging about an already existing alternative solution for this problem:

Hadoop records
     → TypedBytesRecordInput
         → typedbytes Python module

Not only would this have saved Paul a lot of work, it probably also would’ve been more efficient, especially when using ctypedbytes, the speedy variant of the typedbytes module.