June 16, 2009
Although Python has many advantages, you might still want to write some of your mappers or reducers in Java once in a while. Flexibility and speed are probably the most likely potential reasons. Thanks to a recent enhancement, this is now easily achievable. Here’s a version of wordcount.py that uses the example mapper and reducer from the feathers project (and thus requires -libjar feathers.jar):
You can basically mix up Python with Java in any way you like. There’s only one minor restriction: You cannot use a Python combiner when you specify a Java mapper. Things should still work in this case though, it’ll just be slow since the combiner won’t actually run. In theory, this limitation could be avoided by relying on HADOOP-4842, but personally I don’t think it’s worth the trouble.
The source code for fm.last.feathers.map.Words and fm.last.feathers.reduce.Sum is just as straightforward as the code for the OutputFormat classes discussed in my previous post. All you have to keep in mind is that only the mapper input keys and values can be arbitrary writables. Every other key or value has to be a TypedBytesWritable. Writing a custom Java partitioner for Dumbo programs is equally easy by the way. The fm.last.feather.partition.Prefix class is a simple example. It can be used by specifying -partitioner fm.last.feather.partition.Prefix.
As you probably expected already, none of this will work for local runs on UNIX, but you can still test things locally fairly easily by running on Hadoop in standalone mode.
June 8, 2009
Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example:
from dumbo import run, sumreducer, opt
def mapper(key, value):
for word in value.split():
yield word, 1
def reducer(key, values):
yield (key.upper(), key), sum(values)
if __name__ == "__main__":
run(mapper, reducer, combiner=sumreducer)
Running this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results:
$ dumbo splitwordcount.py -input brian.txt -output brianwc \
-hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar
$ dumbo ls brianwc -hadoop /usr/lib/hadoop/
Found 17 items
drwxr-xr-x - klaas [...] /user/klaas/brianwc/A
drwxr-xr-x - klaas [...] /user/klaas/brianwc/B
drwxr-xr-x - klaas [...] /user/klaas/brianwc/C
$ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/
So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.
Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.