from dumbo import run, sumreducer, opt def mapper(key, value): for word in value.split(): yield word, 1 @opt("getpath", "yes") def reducer(key, values): yield (key.upper(), key), sum(values) if __name__ == "__main__": run(mapper, reducer, combiner=sumreducer)
$ dumbo splitwordcount.py -input brian.txt -output brianwc \ -hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar [...] $ dumbo ls brianwc -hadoop /usr/lib/hadoop/ Found 17 items drwxr-xr-x - klaas [...] /user/klaas/brianwc/A drwxr-xr-x - klaas [...] /user/klaas/brianwc/B drwxr-xr-x - klaas [...] /user/klaas/brianwc/C [...] $ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/ be 2 boy 1 Brian 6 became 2
So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.
Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.