Multiple outputs

Dumbo 0.21.20 adds support for multiple outputs by providing a -getpath option. Here’s an example:

from dumbo import run, sumreducer, opt

def mapper(key, value):
    for word in value.split():
        yield word, 1

@opt("getpath", "yes")
def reducer(key, values):
    yield (key[0].upper(), key), sum(values)

if __name__ == "__main__":
    run(mapper, reducer, combiner=sumreducer)

Running this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results:

$ dumbo splitwordcount.py -input brian.txt -output brianwc \
-hadoop /usr/lib/hadoop/ -python python2.5 -libjar feathers.jar
[...]
$ dumbo ls brianwc -hadoop /usr/lib/hadoop/
Found 17 items
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/A
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/B
drwxr-xr-x   - klaas [...] /user/klaas/brianwc/C
[...]
$ dumbo cat brianwc/B -hadoop /usr/lib/hadoop/
be      2
boy     1
Brian   6
became  2

So each ((<path>, <key>), <value>) pair got stored as (<key>, <value>) in <outputdir>/<path>. This only works when running on Hadoop, by the way. For a local run on UNIX everything would still end up in one file.

Under the hood, -getpath yes basically just makes sure that -outputformat sequencefile (which is the default when running on Hadoop) and -outputformat text get translated to -outputformat fm.last.feathers.output.MultipleSequenceFiles and -outputformat fm.last.feathers.output.MultipleTextFiles, respectively. These OutputFormat implementations are nice illustrations of how easy it can be to integrate Java code with Dumbo programs. The brand-new feathers project already provides a few other Java classes that can also easily be used by Dumbo programs, including a mapper and a reducer. I’ll try to find some time to ramble a bit about those as well, but that’s for another post.

About these ads

2 Responses to Multiple outputs

  1. [...] is just as straightforward as the code for the OutputFormat classes discussed in my previous post. All you have to keep in mind is that only the mapper input keys and values can be arbitrary [...]

  2. lramos85 says:

    Thanks for Feathers! I was looking for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: