As explained in the short tutorial, Dumbo provides the ability to use a class as mapper and/or reducer (and also as combiner, of course, but a combiner is really just a reducer with a slightly different purpose). Up until now, the main benefit of this was the possibility that it creates to use the constructor for initializations like, e.g., loading the contents of a file into memory. Version 0.20.27 introduces another reason to use classes instead of functions, however, as illustrated by the following extended version of the sampling program from my previous post:
class Mapper: def __init__(self): self.status = "Initialization started" from random import Random self.randgen = Random() self.cutoff = float(self.params["percentage"]) / 100 self.samplesize = self.counters["Sample size"] self.status = "Initialization done" def __call__(self, key, value): if self.randgen.random() < self.cutoff: self.samplesize += 1 yield key, value if __name__ == "__main__": from dumbo import run run(Mapper)
The things to note in this code are the class variables params, counters, and status, which seem to come out of nowhere. From version 0.20.27 onwards, Dumbo will instantiate a dynamically generated class that inherits from both the class supplied by the programmer and dumbo.MapRedBase, resulting in the seemingly magical addition of the following fields:
- params: A dictionary that contains the parameters specified using -param <key>=<value> options.
- counters: You can think of this field as a defaultdict containing dumbo.Counter objects. When a given key has no corresponding counter yet, a new counter is created using the key as name for it. You can still change the name afterwards by assigning a different string to the counter’s name field, but by using a suitable key you can put out two candles with one blow.
- status: Strings assigned to this field will show up as status message for the task in Hadoop’s web interface.
And as if this is not enough, you can now also use the
counter += amount
syntax to increment counters, instead of the less fancy (and harder to remember)