The HADOOP-1722 graph

I really enjoyed reading Paul’s fascinating blog post about generating artist similarity graphs. Personally, I would’ve used a different API and programming language, but his end result is just amazing. Since I computed similarities between Hadoop JIRA issues the other day, this made me wonder if it would be possible to generate such a graph for HADOOP-1722.

In addition to the similarities, Paul’s graph generation algorithm also requires familiarity/popularity scores, which conveniently allows me to add the obligatory Dumbo flavor to this post. Even if comments.txt would be a huge file consisting of millions of (issue, commenter) pairs, the following Dumbo program could still be used to compute both the number of comments and the numbers of comment authors for each issue:

from dumbo import sumreducer, sumsreducer, main

def mapper1(key, value):
    issuenr, commenter = value.split("\t")
    yield (int(issuenr), commenter), 1

def mapper2(key, value):
    yield key[0], (value, 1)

def runner(job):
    job.additer(mapper1, sumreducer, combiner=sumreducer)
    job.additer(mapper2, sumsreducer, combiner=sumsreducer)

if __name__ == "__main__":
    main(runner)

In my experience, such two-step counting is a rather common pattern in MapReduce programs. It might remind you of the number of plays and listeners shown on Last.fm‘s artist, album, and track pages, for example. Usually, the second count is a better measures of popularity, so that’s what I’ll be using to generate my graph.

The file pops.code is what you obtain when you run the program above locally on comments.txt. It’s in a convenient human-readable text format that preserves the type information. By using the -ascode yes option with dumbo cat, you can save files from the HDFS in this format as well, which is how I obtained the file recs.code. The following Python program generates the necessary Graphviz instructions from these files:

from dumbo import loadcode
pops = dict(loadcode(open("pops.code")))
recs = dict(loadcode(open("recs.code")))

todos = [1722]
plotted = set(todos)
print 'digraph {'
print '"%s" [label="HADOOP-%s"]' % (todos[0], todos[0])
while todos:
    todo  = todos.pop(0)
    for score, issue in recs[todo]:
        if score > 0.5 and pops[issue][1] < pops[todo][1]:
            print '"%s" -> "%s"' % (todo, issue)
            if not issue in plotted:
                todos.append(issue)
                plotted.add(issue)
                print '"%s" [label="HADOOP-%s"]' % (issue, issue)
print '}'

Here’s the resulting graph, which seems to be a nice graphical overview of the four Hadoop issues I commented on:

The HADOOP-1722 graph

Not too bad a result given the small amount of time and effort this required I guess, but it’s not nearly as cool as Paul’s Led Zeppelin graph. I strongly encourage his wife to watch a few more episodes of Dr. House in the future.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: