Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example:
from dumbo import run, sumreducer, opt
def mapper(key, value):
for word in value.split():
yield word, 1
@opt("getpath", "yes")
def reducer(key, values):
yield (key[0].upper(), key), sum(values)
if __name__ == "__main__":
run(mapper, reducer, combiner=sumreducer)
Running this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results:
$ dumbo splitwordcount.py -input brian.txt -output brianwc \ -hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar [...] $ dumbo ls brianwc -hadoop /usr/lib/hadoop/ Found 17 items drwxr-xr-x - klaas [...] /user/klaas/brianwc/A drwxr-xr-x - klaas [...] /user/klaas/brianwc/B drwxr-xr-x - klaas [...] /user/klaas/brianwc/C [...] $ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/ be 2 boy 1 Brian 6 became 2
So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.
Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.
[...] is just as straightforward as the code for the OutputFormat classes discussed in my previous post. All you have to keep in mind is that only the mapper input keys and values can be arbitrary [...]
Thanks for Feathers! I was looking for this.