import dumbo from dumbo.lib import MultiMapper, JoinReducer from dumbo.decor import primary, secondary def mapper(key, value): yield value.split("\t", 1) class Reducer(JoinReducer): def primary(self, key, values): self.hostname = values.next() def secondary(self, key, values): key = self.hostname for value in values: yield key, value if __name__ == "__main__": multimapper = MultiMapper() multimapper.add("hostnames", primary(mapper)) multimapper.add("logs", secondary(mapper)) dumbo.run(multimapper, Reducer)
These are the things to note in this fancier version:
- The mapping is implemented by combining decorated mappers into one MultiMapper.
- The reducing is implemented by extending JoinReducer.
- There is no direct interaction with the join keys. In fact, the -joinkeys yes option doesn’t even get specified explicitly (the decorated mappers and JoinReducer automatically make sure this option gets added via their opts attributes).
The primary and secondary decorators can, of course, also be applied using the @decorator syntax, i.e., I could also have written
@primary def mapper1(key, value): yield value.split("\t", 1) @secondary def mapper2(key, value): yield value.split("\t", 1)
multimapper.add("hostnames", mapper1) multimapper.add("logs", mapper2)
This is less convenient for this particular example, but it might be preferable when your primary and secondary mapper have different implementations.