Multiple outputs

June 8, 2009

Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example:

from dumbo import run, sumreducer, opt

def mapper(key, value):
    for word in value.split():
        yield word, 1

@opt("getpath", "yes")
def reducer(key, values):
    yield (key[0].upper(), key), sum(values)

if __name__ == "__main__":
    run(mapper, reducer, combiner=sumreducer)

Running this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results:

$ dumbo splitwordcount.py -input brian.txt -output brianwc \
-hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar
[...]
$ dumbo ls brianwc -hadoop /usr/lib/hadoop/
Found 17 items
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/A
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/B
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/C
[...]
$ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/
be      2
boy     1
Brian   6
became  2

So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.

Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.

Advertisements