August 12, 2010
I released Dumbo 0.21.26 the other day. As usual we fixed various bugs, but this release also incorporates an enhancement that makes it a bit more special, namely, some refactoring that can be regarded a first but important step towards plugable backends.
Dumbo currently has two different backends, one that runs locally on UNIX and another that runs on Hadoop Streaming. The code for both of these backends used to be interwoven with the core Dumbo logic, but now we abstracted it away behind a proper backend interface which will hopefully make it easier to add more backends in the future.
Personally, I would very much like Dumbo to get a backend for Avro Tether at some point. The two main starting points for making this happen would probably be my main refactoring commit and the Java implementation of a Tether client in the Avro unit tests.
Leave a Comment » |
Explanations | Tagged: backends, avro, tether |
Permalink
Posted by Klaas
May 18, 2010
In response to Johan‘s desperate request I’ve decided to organize a 4th HUGUK meetup. More info will follow on the official HUGUK blog soon, but since it’s going to be fairly short notice I thought it made sense to already share some details now:
The two main talks will be:
“Introduction to Sqoop” by Aaron Kimball
– Synopsis –
This talk introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop’s distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We’ll also cover some deeper technical details of Sqoop’s architecture, and take a look at some upcoming aspects of Sqoop’s development roadmap.
– Bio –
Aaron Kimball has been working with Hadoop since early 2007. Aaron has worked with the NSF and several other universities nationally and internationally to advance education in the field of large-scale data-intensive computing. He helped create and deliver academic course materials first used at the University of Washington (and later adopted by many other academic institutions) as well as Hadoop training materials used by several industry partners. Aaron has also worked as an independent consultant focusing on Hadoop and Amazon EC2-based systems. At Cloudera, he continues to actively develop Hadoop and related tools, as well as focus on training and user education. Aaron holds a B.S. in Computer Science from Cornell University, and an M.S. in Computer Science and Engineering from the University of Washington.
“Hive at Last.fm” by Tim Sell
– Synopsis –
This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
– Bio –
Tim Sell is a Data Engineer at Last.fm who works with Hive and Hadoop on a daily basis.
As usual we’ll try to provide some free beer at the end and anyone is welcome to give a short lightning talk after the main presentations.
3 Comments |
Uncategorized |
Permalink
Posted by Klaas
February 5, 2010
Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases it makes sense to work at an even higher level and use something like Pig or Hive. In fact, sometimes it makes sense to combine the two and do some parts of your data flow in Dumbo and others in Pig. To make this possible, I recently wrote a Pig loader function for sequence files that contain TypedBytesWritables, which is the file format Dumbo uses by default to store all its output on Hadoop. Here’s an example of a Pig script that reads Dumbo output:
register pigtail.jar; -- http://github.com/klbostee/pigtail
a = load '/hdfs/path/to/dumbo/output'
using fm.last.pigtail.storage.TypedBytesSequenceFileLoader()
as (artist:int, val:(listeners:int, listens:int));
b = foreach a generate artist, val.listeners as listeners;
c = order b by listeners;
d = limit c 100;
dump d;
You basically just have to specify names and types for the components of the key/value pairs and you’re good to go.
A possibly useful side-effect of writing this loader is the ability it creates of reading all sorts of file formats with Pig. Everything that Dumbo can read can also be consumed by Pig scripts now, all you have to do is write a simple Dumbo script that converts it to typed bytes sequence files:
from dumbo import run
from dumbo.lib import identitymapper
if __name__ == "__main__":
run(identitymapper)
The proper solution is of course to write custom Pig loaders, but this gets the job done too and doesn’t slow things down that much.
Leave a Comment » |
Examples, Tips and tricks | Tagged: pig, pigtail, typed bytes |
Permalink
Posted by Klaas
December 23, 2009
A while ago, I received an email from Andrew in which he wrote:
Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:
$ elastic-mapreduce --create --alive
SSH into the cluster using your EC2 keypair as user hadoop and install Dumbo with the following two commands:
$ wget http://bit.ly/ezsetup
$ sudo python ez_setup.py dumbo
Then you can run your Dumbo scripts. I was able to run the ipcount.py demo with the following command.
$ dumbo start ipcount.py -hadoop /home/hadoop \
-input s3://anhi-test-data/wordcount/input/ \
-output s3://anhi-test-data/output/dumbo/wc/
The -hadoop option is important. At this point I haven’t created an automatic Dumbo install script, so you’ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
There was a minor hiccup that required the Amazon guys to pull the AMI with Dumbo support, but it’s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.
As an aside, it’s worth mentioning that a bug in Hadoop Streaming got fixed in the process of adding Dumbo support to EMR. I can’t wait to see what else the Amazon guys have up their sleeves.
3 Comments |
Examples, Tips and tricks | Tagged: amazon, elastic mapreduce, ec2, mapreduce-1293 |
Permalink
Posted by Klaas
November 23, 2009
We’ve finally started looking into moving from Hadoop 0.18 to 0.20 at Last.fm, and I thought it might be useful to share a few Dumbo-related things I learned in the process:
- We’re probably going to base our 0.20 build on Cloudera‘s 0.20 distribution, and I found out the hard way that Dumbo doesn’t work on version 0.20.1+133 of this distribution because it includes a patch for MAPREDUCE-967 that breaks some of the Hadoop Streaming functionality on which Dumbo relies. Luckily, the Cloudera guys fixed it in 0.20.1+152 by reverting this patch, but if you’re still trying to get Dumbo to work on Cloudera’s 0.20.1+133 distribution for some reason then you can expect to get NullPointerExceptions and errors like, e.g., “module wordcount not found” in your tasks’ stderr logs.
- Also, the Cloudera guys apparently haven’t added the patch for MAPREDUCE-764 to their distribution yet, so you’ll still have to apply this patch yourself if you want to avoid strange encoding problems in certain corner cases. This patch has now been reviewed and accepted for Hadoop 0.21 for quite a while already though, so maybe we can be hopeful about it getting included in Cloudera’s 0.20 distribution soon.
- The Twitter guys put together a pretty awesome patched and backported version of hadoop-gpl-compression for Hadoop 0.20. It includes several bugfixes and it also provides an InputFormat for the old API, which is useful for Hadoop Streaming (and hence also Dumbo) users since Streaming has not been converted to the new API yet. If you’re interested in this stuff, you might want to have a look at this guest post from Kevin and Eric on the Cloudera blog.
7 Comments |
Explanations, Tips and tricks | Tagged: cloudera, mapreduce-764, hadoop 0.20, hadoop-lzo, hadoop-gpl-compression, mapreduce-967 |
Permalink
Posted by Klaas
July 31, 2009
This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or producing HBase tables from Dumbo programs. Here’s a silly but illustrative example:
from dumbo import opt, run
@opt("inputformat", "fm.last.hbase.mapred.TypedBytesTableInputFormat")
@opt("hadoopconf", "hbase.mapred.tablecolumns=testfamily:testqualifier")
def mapper(key, columns):
for family, column in columns.iteritems():
for qualifier, value in column.iteritems():
yield key, (family, qualifier, value)
@opt("outputformat", "fm.last.hbase.mapred.TypedBytesTableOutputFormat")
def reducer(key, values):
columns = {}
for family, qualifier, value in values:
column = columns.get(family, {})
column[qualifier] = value
yield key, columns
if __name__ == "__main__":
run(mapper, reducer)
Have a look at the readme for more information.
Leave a Comment » |
Examples, Tips and tricks | Tagged: hbase, inputformat, outputformat |
Permalink
Posted by Klaas
July 15, 2009
Unfortunately, the list of Hadoop patches required for making Dumbo work properly just expanded a bit, since I traced down a strange encoding bug to an issue in Streaming’s typed bytes code. Hence, you might want to apply the MAPREDUCE-764 patch to your Hadoop build if you use Dumbo, even though the bug only leads to problems in very specific cases and usually isn’t hard to work around. Hopefully this patch will make it into Hadoop 0.21.
This isn’t all bad news, however. The encoding bug was initially reported on the dumbo-user mailing list, which apparently has 12 subscribers already and is starting to attract fairly regular traffic. I haven’t promoted this mailing list much so far and never really expected that people would actually start using it to be honest, but obviously I was wrong. Everyone who reads this blog should consider subscribing, I’m sure you won’t regret it!
3 Comments |
Explanations | Tagged: dumbo-user, mailing lists, mapreduce-764 |
Permalink
Posted by Klaas
June 18, 2009
The Cloudera guys blogged about using Pig for examining Apache logs yesterday. Although it nicely illustrates several lesser-known Pig features, I’m not overly impressed with the described program to be honest. Having to revert to three different scripting languages to do some GeoIP lookups just complicates things too much if you ask me. Personally, I’d much prefer writing something like:
class Mapper:
def __init__(self):
from re import compile
self.regex = compile(r'(?P<ip>[\d\.\-]+) (?P<id>[\w\-]+) ' \
r'(?P<user>[\w\-]+) \[(?P<time>[^\]]+)\] ' \
r'"(?P<request>[^"]+)" (?P<status>[\d\-]+) ' \
r'(?P<bytes>[\d\-]+) "(?P<referer>[^"]+)" ' \
r'"(?P<agent>[^"]+)"')
from pygeoip import GeoIP, MEMORY_CACHE
self.geoip = GeoIP(self.params["geodata"], flags=MEMORY_CACHE)
def __call__(self, key, value):
mo = self.regex.match(value)
if mo:
request, bytes = mo.group("request"), mo.group("bytes")
if request.startswith("GET") and bytes != "-":
rec = self.geoip.record_by_addr(mo.group("ip"))
country = rec["country_code"] if rec else "-"
yield country, (1, int(bytes))
if __name__ == "__main__":
from dumbo import run, sumsreducer
run(Mapper, sumsreducer, combiner=sumsreducer)
After installing Python 2.6, I tested this hits_by_country.py program on my chrooted Cloudera-flavored Hadoop server as follows:
$ wget http://pygeoip.googlecode.com/files/pygeoip-0.1.1-py2.6.egg
$ wget http://bit.ly/geolitecity
$ wget http://bit.ly/randomapachelog # found via Google
$ dumbo put access.log access.log -hadoop /usr/lib/hadoop
$ dumbo start hits_by_country.py -hadoop /usr/lib/hadoop \
-input access.log -output hits_by_country \
-python python2.6 -libegg pygeoip-0.1.1-py2.6.egg \
-file GeoLiteCity.dat -param geodata=GeoLiteCity.dat
$ dumbo cat hits_by_country/part-00000 -hadoop /usr/lib/hadoop/ | \
sort -k2,2nr | head -n 5
US 9400 388083137
KR 6714 2655270
DE 1859 32131992
RU 1838 44073038
CA 1055 23035208
At Last.fm, we use the GeoIP Python bindings instead of the pure-Python pygeoip module, which is nearly identical API-wise but might be a bit slower. Also, we abstract away the format of our Apache logs by using a parser class and we have some library code for identifying hits from robots as well, much like the IsBotUA() method in the Pig example.
Leave a Comment » |
Examples | Tagged: apache logs, pig, geoip, pygeoip |
Permalink
Posted by Klaas