Recommending JIRA issues

In an attempt to come up with a more interesting and elaborate example of a Dumbo program than the ones that can be found here, I wrote a simple recommender for Hadoop JIRA issues yesterday. This is how I obtained the data for this recommender:

$ wget -O jira_comments.xml 'http://issues.apache.org/jira/sr/jira.issueviews:'\
'searchrequest-comments-rss/temp/SearchRequest.xml?pid=12310240'
$ grep '<guid>' jira_comments.xml | sed 's/^.*HADOOP-\([0-9]*\).*$/\1/' > issues.txt
$ grep '<author>' jira_comments.xml | sed 's/^.*\(.*\).*$/\1/' > authors.txt
$ paste issues.txt authors.txt | egrep -v "Hudson|Hadoop QA" > comments.txt

These commands lead to a file comments.txt that lists the comment authors for each JIRA issue. From this file, we can compute Amazon-style recommendations of the form “people who commented on HADOOP-X, also commented on HADOOP-Y and HADOOP-Z” using the following Dumbo program:

from dumbo import sumreducer, nlargestreducer, nlargestcombiner, main
from heapq import nlargest
from math import sqrt

def mapper1(key, value):
    issuenr, commenter = value.split("\t")
    yield (int(issuenr), commenter), 1

def mapper2(key, value):
    yield key[0], (value, key[1])

def reducer2(key, values):
    values = nlargest(1000, values)
    norm = sqrt(sum(value[0]**2 for value in values))
    for value in values:
        yield (value[0], norm, key), value[1]

def mapper3(key, value):
    yield value, key

def mapper4(key, value):
    for left, right in ((l, r) for l in value for r in value if l != r):
        yield (left[1:], right[1:]), left[0]*right[0]

def mapper5(key, value):
    left, right = key
    yield left[1], (value / (left[0]*right[0]), right[1])

def runner(job):
    job.additer(mapper1, sumreducer, combiner=sumreducer)
    job.additer(mapper2, reducer2)
    job.additer(mapper3, nlargestreducer(10000), nlargestcombiner(10000))
    job.additer(mapper4, sumreducer, combiner=sumreducer)
    job.additer(mapper5, nlargestreducer(5), nlargestcombiner(5))

if __name__ == "__main__":
    main(runner)

Summing it up in one sentence, this program generates a similarity value for each pair of issues that have at least one comment author in common by computing the cosine similarity between the vectors consisting of the comment counts for each possible comment author, and then retains the five most similar issues for each issue. It might look a bit complicated at first, but it really is very simple. None of the code in this program should be difficult to understand for people who are already somewhat familiar with Dumbo (because they read the short tutorial, for instance), except for the probably not very well known functions sumreducer, nlargestreducer, and nlargestcombiner maybe. The names of these functions should be rather self-explanatory though, and if that’s not enough than you can still have a look at their nice and short definitions:

def sumreducer(key, values):
    yield key, sum(values)

def nlargestreducer(n, key=None):
    def reducer(key_, values):
        yield key_, heapq.nlargest(n, itertools.chain(*values), key=key)
    return reducer

def nlargestcombiner(n, key=None):
    def combiner(key_, values):
        yield key_, heapq.nlargest(n, values, key=key)
    return combiner

These are not the only convenient reducers and combiners defined in dumbo.py by the way. You might also want to have a look at sumsreducer, nsmallestreducer, nsmallestcombiner, statsreducer, and statscombiner, for instance.

A local run of the program on a single UNIX machine required 14 minutes to complete, while it took only 4 minutes on a modest 8-node Hadoop cluster. Hence, this example already illustrates the benefits of Hadoop quite nicely, even though the comsumed dataset is rather tiny. After using dumbo cat to save the computed output to recs.txt, you can then generate recommendations as follows:

$ grep '^4304' recs.txt | \
sed "s/^\([0-9]*\)/If you're interested in HADOOP-\1, you might also like:/" | \
sed "s/(\([^,]*\), \([0-9]*\))/\nHADOOP-\2\t(\1)/g"
If you're interested in HADOOP-4304, you might also like:
HADOOP-5252	(0.92827912163291404)
HADOOP-4842	(0.92827912163291404)
HADOOP-1722	(0.86128713072409002)
HADOOP-150	(0.371884218998905)
HADOOP-567	(0.37146329728054306)

In this case, the recommender claims that people who are interested HADOOP-4304 (“Add Dumbo to contrib”), might also be interested in

  • HADOOP-5252 (“Streaming overrides -inputformat option”),
  • HADOOP-4842 (“Streaming combiner should allow command, not just JavaClass”), and
  • HADOOP-1722 (“Make streaming to handle non-utf8 byte array”),

which makes sense to me. Since the comments data is very sparse, the recommendations might not always make this much sense though, but that’s not the point of this post really. Instead, what you should remember from all this is that Dumbo allows you to write a scalable item-to-item collaborative filtering recommender by implementing six very simple generator functions, the longest one consisting of merely four lines of code.

About these ads

9 Responses to Recommending JIRA issues

  1. [...] Consider the file commentcounts.txt, derived as follows from the comments.txt file that was generated as part of an earlier post: [...]

  2. [...] Top Posts HADOOP-1722 and typed bytes Mapper and reducer classesRandom samplingAboutRecommending JIRA issues [...]

  3. [...] Posts Recommending JIRA issuesHADOOP-1722 and typed [...]

  4. Just wanted to thank you guys for the great work on Dumbo. I did a similar Python example, but I didn’t use Dumbo because it isn’t supported yet on AWS. I’m pushing to get Dumbo included in the next release of Elastic MapReduce, it should be easier when AWS moves beyond 18.3,

    -Pete

    • Klaas says:

      Glad you like Dumbo, Pete. It would definitely be great if AWS would support it, and future versions of Hadoop (>= 0.21) should indeed make that easier since they won’t need any patching to make Dumbo work.

  5. Apple iPad says:

    great blog, you deserve a free iPad: http://bit.ly/freeipad6

  6. I don’t really like it when folks just copy and paste comments…A lot of people just comment to acquire a backlink, however, I can’t blame them because backlinking is one of the fundamentals to get a good pagerank and visitors. However if the comment is relevant to the topic, then it’s actually a far better backlink than a simple copy and paste and the importance of having loads of comments is so big. Thanks for your post.

  7. Sergey says:

    Actually for those interested in finding a professional solution for automatic issue recommendation for JIRA there is a SuggestiMate plugin for JIRA which smoothly integrates with JIRA and does exactly this: finds and recommends potentially duplicate or similar issues.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: