Nitin Madnani gave a talk at PyCon this weekend about how Dumbo and Amazon EC2 allowed him to process very large text corpora using the machinery provided by NLTK. Unfortunately I wasn’t there but I heard that his talk was very well received, and his slides definitely are pretty awesome.
Consuming Dumbo output with Pig
February 5, 2010Although it abstracts and simplifies it all quite a bit, Dumbo still forces you to think in MapReduce, which might not be ideal if you want to implement complex data flows in a limited amount of time. Personally, I think that Dumbo still occupies a useful space within the Hadoop ecosystem, but in some cases it makes sense to work at an even higher level and use something like Pig or Hive. In fact, sometimes it makes sense to combine the two and do some parts of your data flow in Dumbo and others in Pig. To make this possible, I recently wrote a Pig loader function for sequence files that contain TypedBytesWritables, which is the file format Dumbo uses by default to store all its output on Hadoop. Here’s an example of a Pig script that reads Dumbo output:
register pigtail.jar; -- http://github.com/klbostee/pigtail
a = load '/hdfs/path/to/dumbo/output'
using fm.last.pigtail.storage.TypedBytesSequenceFileLoader()
as (artist:int, val:(listeners:int, listens:int));
b = foreach a generate artist, val.listeners as listeners;
c = order b by listeners;
d = limit c 100;
dump d;
You basically just have to specify names and types for the components of the key/value pairs and you’re good to go.
A possibly useful side-effect of writing this loader is the ability it creates of reading all sorts of file formats with Pig. Everything that Dumbo can read can also be consumed by Pig scripts now, all you have to do is write a simple Dumbo script that converts it to typed bytes sequence files:
from dumbo import run
from dumbo.lib import identitymapper
if __name__ == "__main__":
run(identitymapper)
The proper solution is of course to write custom Pig loaders, but this gets the job done too and doesn’t slow things down that much.
Dumbo on Amazon EMR
December 23, 2009A while ago, I received an email from Andrew in which he wrote:
Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:
$ elastic-mapreduce --create --aliveSSH into the cluster using your EC2 keypair as user
hadoopand install Dumbo with the following two commands:
$ wget http://bit.ly/ezsetup
$ sudo python ez_setup.py dumboThen you can run your Dumbo scripts. I was able to run the
ipcount.pydemo with the following command.
$ dumbo start ipcount.py -hadoop /home/hadoop \
-input s3://anhi-test-data/wordcount/input/ \
-output s3://anhi-test-data/output/dumbo/wc/The
-hadoopoption is important. At this point I haven’t created an automatic Dumbo install script, so you’ll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
There was a minor hiccup that required the Amazon guys to pull the AMI with Dumbo support, but it’s back now and they seem to be confident that Dumbo support is going to remain available from now on. They are also still planning to make things even easier by providing an automatic Dumbo installation script.
As an aside, it’s worth mentioning that a bug in Hadoop Streaming got fixed in the process of adding Dumbo support to EMR. I can’t wait to see what else the Amazon guys have up their sleeves.
Dumbo over HBase
July 31, 2009This should be old news for dumbo-user subscribers, but Tim has, once again, put his Java coding skills to good use. This time around he created nifty input and output formats for consuming and/or producing HBase tables from Dumbo programs. Here’s a silly but illustrative example:
from dumbo import opt, run
@opt("inputformat", "fm.last.hbase.mapred.TypedBytesTableInputFormat")
@opt("hadoopconf", "hbase.mapred.tablecolumns=testfamily:testqualifier")
def mapper(key, columns):
for family, column in columns.iteritems():
for qualifier, value in column.iteritems():
yield key, (family, qualifier, value)
@opt("outputformat", "fm.last.hbase.mapred.TypedBytesTableOutputFormat")
def reducer(key, values):
columns = {}
for family, qualifier, value in values:
column = columns.get(family, {})
column[qualifier] = value
yield key, columns
if __name__ == "__main__":
run(mapper, reducer)
Have a look at the readme for more information.
Analysing Apache logs
June 18, 2009The Cloudera guys blogged about using Pig for examining Apache logs yesterday. Although it nicely illustrates several lesser-known Pig features, I’m not overly impressed with the described program to be honest. Having to revert to three different scripting languages to do some GeoIP lookups just complicates things too much if you ask me. Personally, I’d much prefer writing something like:
class Mapper:
def __init__(self):
from re import compile
self.regex = compile(r'(?P<ip>[\d\.\-]+) (?P<id>[\w\-]+) ' \
r'(?P<user>[\w\-]+) \[(?P<time>[^\]]+)\] ' \
r'"(?P<request>[^"]+)" (?P<status>[\d\-]+) ' \
r'(?P<bytes>[\d\-]+) "(?P<referer>[^"]+)" ' \
r'"(?P<agent>[^"]+)"')
from pygeoip import GeoIP, MEMORY_CACHE
self.geoip = GeoIP(self.params["geodata"], flags=MEMORY_CACHE)
def __call__(self, key, value):
mo = self.regex.match(value)
if mo:
request, bytes = mo.group("request"), mo.group("bytes")
if request.startswith("GET") and bytes != "-":
rec = self.geoip.record_by_addr(mo.group("ip"))
country = rec["country_code"] if rec else "-"
yield country, (1, int(bytes))
if __name__ == "__main__":
from dumbo import run, sumsreducer
run(Mapper, sumsreducer, combiner=sumsreducer)
After installing Python 2.6, I tested this hits_by_country.py program on my chrooted Cloudera-flavored Hadoop server as follows:
$ wget http://pygeoip.googlecode.com/files/pygeoip-0.1.1-py2.6.egg $ wget http://bit.ly/geolitecity $ wget http://bit.ly/randomapachelog # found via Google $ dumbo put access.log access.log -hadoop /usr/lib/hadoop $ dumbo start hits_by_country.py -hadoop /usr/lib/hadoop \ -input access.log -output hits_by_country \ -python python2.6 -libegg pygeoip-0.1.1-py2.6.egg \ -file GeoLiteCity.dat -param geodata=GeoLiteCity.dat $ dumbo cat hits_by_country/part-00000 -hadoop /usr/lib/hadoop/ | \ sort -k2,2nr | head -n 5 US 9400 388083137 KR 6714 2655270 DE 1859 32131992 RU 1838 44073038 CA 1055 23035208
At Last.fm, we use the GeoIP Python bindings instead of the pure-Python pygeoip module, which is nearly identical API-wise but might be a bit slower. Also, we abstract away the format of our Apache logs by using a parser class and we have some library code for identifying hits from robots as well, much like the IsBotUA() method in the Pig example.
Integration with Java code
June 16, 2009Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks to a recent enhancement, this is now easily achievable. Here’s a version of wordcount.py that uses the example mapper and reducer from the feathers project (and thus requires -libjar feathers.jar):
import dumbo
dumbo.run("fm.last.feathers.map.Words",
"fm.last.feathers.reduce.Sum",
combiner="fm.last.feathers.reduce.Sum")
You can basically mix up Python with Java in any way you like. There’s only one minor restriction: You cannot use a Python combiner when you specify a Java mapper. Things should still work in this case though, it’ll just be slow since the combiner won’t actually run. In theory, this limitation could be avoided by relying on HADOOP-4842, but personally I don’t think it’s worth the trouble.
The source code for fm.last.feathers.map.Words and fm.last.feathers.reduce.Sum is just as straightforward as the code for the OutputFormat classes discussed in my previous post. All you have to keep in mind is that only the mapper input keys and values can be arbitrary writables. Every other key or value has to be a TypedBytesWritable. Writing a custom Java partitioner for Dumbo programs is equally easy by the way. The fm.last.feather.partition.Prefix class is a simple example. It can be used by specifying -partitioner fm.last.feather.partition.Prefix.
As you probably expected already, none of this will work for local runs on UNIX, but you can still test things locally fairly easily by running on Hadoop in standalone mode.
Multiple outputs
June 8, 2009Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example:
from dumbo import run, sumreducer, opt
def mapper(key, value):
for word in value.split():
yield word, 1
@opt("getpath", "yes")
def reducer(key, values):
yield (key[0].upper(), key), sum(values)
if __name__ == "__main__":
run(mapper, reducer, combiner=sumreducer)
Running this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results:
$ dumbo splitwordcount.py -input brian.txt -output brianwc \ -hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar [...] $ dumbo ls brianwc -hadoop /usr/lib/hadoop/ Found 17 items drwxr-xr-x - klaas [...] /user/klaas/brianwc/A drwxr-xr-x - klaas [...] /user/klaas/brianwc/B drwxr-xr-x - klaas [...] /user/klaas/brianwc/C [...] $ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/ be 2 boy 1 Brian 6 became 2
So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.
Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.
Dumbo on Cloudera’s distribution
May 31, 2009Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to me yesterday, so the time was right to finally have a look at Cloudera’s nicely packaged and patched-up Hadoop.
I started from a chrooted Debian server, on which I installed the Cloudera distribution, Python 2.5, and Dumbo as follows:
# cat /etc/apt/sources.list deb http://ftp.be.debian.org/debian etch main contrib non-free deb http://www.backports.org/debian etch-backports main contrib non-free deb http://archive.cloudera.com/debian etch contrib deb-src http://archive.cloudera.com/debian etch contrib # wget -O - http://backports.org/debian/archive.key | apt-key add - # wget -O - http://archive.cloudera.com/debian/archive.key | apt-key add - # apt-get update # apt-get install hadoop python2.5 python2.5-dev # wget http://peak.telecommunity.com/dist/ez_setup.py # python2.5 ez_setup.py dumbo
Then, I created a user for myself and confirmed that the wordcount.py program runs properly on Cloudera’s distribution in standalone mode:
# adduser klaas # su - klaas $ wget http://bit.ly/wordcountpy http://bit.ly/briantxt $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Unsurprisingly, it also worked perfectly in pseudo-distributed mode:
$ exit # apt-get install hadoop-conf-pseudo # /etc/init.d/hadoop-namenode start # /etc/init.d/hadoop-secondarynamenode start # /etc/init.d/hadoop-datanode start # /etc/init.d/hadoop-jobtracker start # /etc/init.d/hadoop-tasktracker start # su - klaas $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm brianwc/_logs -hadoop /usr/lib/hadoop/ Deleted hdfs://localhost/user/klaas/brianwc/_logs $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Note that I removed the _logs directory first because dumbo cat would’ve complained about it otherwise. You can avoid this minor annoyance by disabling the creation of _logs directories.
I also verified that HADOOP-5528 got included by running the join.py example successfully:
$ wget http://bit.ly/joinpy $ wget http://bit.ly/hostnamestxt http://bit.ly/logstxt $ dumbo put hostnames.txt hostnames.txt -hadoop /usr/lib/hadoop/ $ dumbo put logs.txt logs.txt -hadoop /usr/lib/hadoop/ $ dumbo start join.py -input hostnames.txt -input logs.txt \ -output joined -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm joined/_logs -hadoop /usr/lib/hadoop $ dumbo cat joined -hadoop /usr/lib/hadoop | grep node1 node1 5
And while I was at it, I did a quick typedbytes versus ctypedbytes comparison as well:
$ zcat /usr/share/man/man1/python2.5.1.gz > python.man $ for i in `seq 100000`; do cat python.man >> python.txt; done $ du -h python.txt 1.2G python.txt $ dumbo put python.txt python.txt -hadoop /usr/lib/hadoop/ $ time dumbo start wordcount.py -input python.txt -output pywc \ -python python2.5 -hadoop /usr/lib/hadoop/ real 17m45.473s user 0m1.380s sys 0m0.224s $ exit # apt-get install gcc libc6-dev # su - klaas $ python2.5 ez_setup.py -zmaxd. ctypedbytes $ time dumbo start wordcount.py -input python.txt -output pywc2 \ -python python2.5 -hadoop /usr/lib/hadoop/ \ -libegg ctypedbytes-0.1.5-py2.5-linux-i686.egg real 13m22.420s user 0m1.320s sys 0m0.216s
In this particular case, ctypedbytes appears to be 25% faster. Your mileage may vary since the running times depend on many factors, but in any case I’d always expect ctypedbytes to lead to significant speed improvements.
TF-IDF revisited
May 17, 2009Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera’s free Hadoop training? Thanks to the new joining abstraction (and the minor fixes and enhancements in Dumbo 0.21.13 and 0.21.14), these problems can now easily be avoided:
from dumbo import *
from math import log
@opt("addpath", "yes")
def mapper1(key, value):
for word in value.split():
yield (key[0], word), 1
@primary
def mapper2a(key, value):
yield key[0], value
@secondary
def mapper2b(key, value):
yield key[0], (key[1], value)
@primary
def mapper3a(key, value):
yield value[0], 1
@secondary
def mapper3b(key, value):
yield value[0], (key, value[1])
class Reducer(JoinReducer):
def __init__(self):
self.sum = 0
def primary(self, key, values):
self.sum = sum(values)
class Combiner(JoinReducer):
def primary(self, key, values):
yield key, sum(values)
class Reducer1(Reducer):
def secondary(self, key, values):
for (doc, n) in values:
yield key, (doc, float(n) / self.sum)
class Reducer2(Reducer):
def __init__(self):
Reducer.__init__(self)
self.doccount = float(self.params["doccount"])
def secondary(self, key, values):
idf = log(self.doccount / self.sum)
for (doc, tf) in values:
yield (key, doc), tf * idf
def runner(job):
job.additer(mapper1, sumreducer, combiner=sumreducer)
multimapper = MultiMapper()
multimapper.add("", mapper2a)
multimapper.add("", mapper2b)
job.additer(multimapper, Reducer1, Combiner)
multimapper = MultiMapper()
multimapper.add("", mapper3a)
multimapper.add("", mapper3b)
job.additer(multimapper, Reducer2, Combiner)
if __name__ == "__main__":
main(runner)
Most of this Dumbo program shouldn’t be hard to understand if you had a peek at the posts about hiding join keys and the @opt decorator, except maybe for the following things:
- The first argument supplied to MultiMapper’s add method is a string corresponding to the pattern that has to occur in the file path in order for the added mapper to run on the key/value pairs in a given file. Since the empty string “” is considered to occur in every possible path string, all added mappers run on each input file in this example program.
- It is possible (but not necessary) to yield key/value pairs in a JoinReducer’s primary method, as illustrated by the Combiner class in this example.
- The default implementations of JoinReducer’s primary and secondary methods are identity operations, so Combiner combines the primary pairs and just passes on the secondary ones.
Writing this program went surprisingly smooth and didn’t take much effort at all. Apparently, the “primary/secondary abstraction” works really well for me.
Hiding join keys
May 15, 2009Dumbo 0.21.12 further simplifies joining by hiding the join keys from the “end-programmer”. For instance, the example discussed here can now be written as follows:
import dumbo
from dumbo.lib import MultiMapper, JoinReducer
from dumbo.decor import primary, secondary
def mapper(key, value):
yield value.split("\t", 1)
class Reducer(JoinReducer):
def primary(self, key, values):
self.hostname = values.next()
def secondary(self, key, values):
key = self.hostname
for value in values:
yield key, value
if __name__ == "__main__":
multimapper = MultiMapper()
multimapper.add("hostnames", primary(mapper))
multimapper.add("logs", secondary(mapper))
dumbo.run(multimapper, Reducer)
These are the things to note in this fancier version:
- The mapping is implemented by combining decorated mappers into one MultiMapper.
- The reducing is implemented by extending JoinReducer.
- There is no direct interaction with the join keys. In fact, the -joinkeys yes option doesn’t even get specified explicitly (the decorated mappers and JoinReducer automatically make sure this option gets added via their opts attributes).
The primary and secondary decorators can, of course, also be applied using the @decorator syntax, i.e., I could also have written
@primary
def mapper1(key, value):
yield value.split("\t", 1)
@secondary
def mapper2(key, value):
yield value.split("\t", 1)
and
multimapper.add("hostnames", mapper1)
multimapper.add("logs", mapper2)
This is less convenient for this particular example, but it might be preferable when your primary and secondary mapper have different implementations.
Posted by Klaas