In an attempt to come up with a more interesting and elaborate example of a Dumbo program than the ones that can be found here, I wrote a simple recommender for Hadoop JIRA issues yesterday. This is how I obtained the data for this recommender:
$ wget -O jira_comments.xml 'http://issues.apache.org/jira/sr/jira.issueviews:'\ 'searchrequest-comments-rss/temp/SearchRequest.xml?pid=12310240' $ grep '<guid>' jira_comments.xml | sed 's/^.*HADOOP-\([0-9]*\).*$/\1/' > issues.txt $ grep '<author>' jira_comments.xml | sed 's/^.*\(.*\).*$/\1/' > authors.txt $ paste issues.txt authors.txt | egrep -v "Hudson|Hadoop QA" > comments.txt
These commands lead to a file comments.txt that lists the comment authors for each JIRA issue. From this file, we can compute Amazon-style recommendations of the form “people who commented on HADOOP-X, also commented on HADOOP-Y and HADOOP-Z” using the following Dumbo program:
from dumbo import sumreducer, nlargestreducer, nlargestcombiner, main from heapq import nlargest from math import sqrt def mapper1(key, value): issuenr, commenter = value.split("\t") yield (int(issuenr), commenter), 1 def mapper2(key, value): yield key, (value, key) def reducer2(key, values): values = nlargest(1000, values) norm = sqrt(sum(value**2 for value in values)) for value in values: yield (value, norm, key), value def mapper3(key, value): yield value, key def mapper4(key, value): for left, right in ((l, r) for l in value for r in value if l != r): yield (left[1:], right[1:]), left*right def mapper5(key, value): left, right = key yield left, (value / (left*right), right) def runner(job): job.additer(mapper1, sumreducer, combiner=sumreducer) job.additer(mapper2, reducer2) job.additer(mapper3, nlargestreducer(10000), nlargestcombiner(10000)) job.additer(mapper4, sumreducer, combiner=sumreducer) job.additer(mapper5, nlargestreducer(5), nlargestcombiner(5)) if __name__ == "__main__": main(runner)
Summing it up in one sentence, this program generates a similarity value for each pair of issues that have at least one comment author in common by computing the cosine similarity between the vectors consisting of the comment counts for each possible comment author, and then retains the five most similar issues for each issue. It might look a bit complicated at first, but it really is very simple. None of the code in this program should be difficult to understand for people who are already somewhat familiar with Dumbo (because they read the short tutorial, for instance), except for the probably not very well known functions sumreducer, nlargestreducer, and nlargestcombiner maybe. The names of these functions should be rather self-explanatory though, and if that’s not enough than you can still have a look at their nice and short definitions:
def sumreducer(key, values): yield key, sum(values) def nlargestreducer(n, key=None): def reducer(key_, values): yield key_, heapq.nlargest(n, itertools.chain(*values), key=key) return reducer def nlargestcombiner(n, key=None): def combiner(key_, values): yield key_, heapq.nlargest(n, values, key=key) return combiner
These are not the only convenient reducers and combiners defined in dumbo.py by the way. You might also want to have a look at sumsreducer, nsmallestreducer, nsmallestcombiner, statsreducer, and statscombiner, for instance.
A local run of the program on a single UNIX machine required 14 minutes to complete, while it took only 4 minutes on a modest 8-node Hadoop cluster. Hence, this example already illustrates the benefits of Hadoop quite nicely, even though the comsumed dataset is rather tiny. After using dumbo cat to save the computed output to recs.txt, you can then generate recommendations as follows:
$ grep '^4304' recs.txt | \ sed "s/^\([0-9]*\)/If you're interested in HADOOP-\1, you might also like:/" | \ sed "s/(\([^,]*\), \([0-9]*\))/\nHADOOP-\2\t(\1)/g" If you're interested in HADOOP-4304, you might also like: HADOOP-5252 (0.92827912163291404) HADOOP-4842 (0.92827912163291404) HADOOP-1722 (0.86128713072409002) HADOOP-150 (0.371884218998905) HADOOP-567 (0.37146329728054306)
In this case, the recommender claims that people who are interested HADOOP-4304 (“Add Dumbo to contrib”), might also be interested in
- HADOOP-5252 (“Streaming overrides -inputformat option”),
- HADOOP-4842 (“Streaming combiner should allow command, not just JavaClass”), and
- HADOOP-1722 (“Make streaming to handle non-utf8 byte array”),
which makes sense to me. Since the comments data is very sparse, the recommendations might not always make this much sense though, but that’s not the point of this post really. Instead, what you should remember from all this is that Dumbo allows you to write a scalable item-to-item collaborative filtering recommender by implementing six very simple generator functions, the longest one consisting of merely four lines of code.