Indexing typed bytes

In order to be able to do something useful with the output of a Dumbo program, you often need to index it first. Suppose, for instance, that you ran a program that computes JIRA issue recommendations on lots and lots of issues, saving the output to jira/recs on the DFS. If the HADOOP-1722 patch has been applied to your Hadoop build, you can then dump this output to a fairly compact typed bytes file recs.tb as follows:

$ /path/to/hadoop/bin/hadoop jar \
/path/to/hadoop/build/contrib/streaming/*.jar dumptb jira/recs > recs.tb

Such a typed bytes file isn’t very convenient for doing lookups though, since you basically have to go through the whole thing to find the recommendations corresponding to a particular issue. Keeping it completely in memory as a hash table might be an option if you have tons of RAM, but indexing it and putting only the index in memory can do the trick as well.

Here’s a very simple, but nevertheless quite effective, way of generating an index for a typed bytes file:

import sys
import typedbytes

infile = open(sys.argv[1], "rb")
outfile = open(sys.argv[2], "wb")

input = typedbytes.PairedInput(infile)
output = typedbytes.PairedOutput(outfile)
pos = 0
for key, value in input:
    output.write((key, pos))
    pos = infile.tell()

infile.close()
outfile.close()

This Python program relies on the typedbytes module to read a given typed bytes file and generate a second one containing (key, position) pairs. More concretely, running it with the arguments recs.tb recs.tb.idx leads to an index file recs.tb.idx for recs.tb. If you put

file = open("recs.tb")
input = typedbytes.PairedInput(file)
index = dict(typedbytes.PairedInput(open("recs.tb.idx")))

in the initialization part of a Python program, you can then lookup the recommendations for a particular issuenr as follows:

file.seek(index[issuenr])
readnr, recs = input.read()
if readnr != issuenr:
    raise ValueError("corrupt index")

When dealing with huge output files, you might have to use a second-level index on (a sorted version of) the index obtained in this way, but in many cases this simple approach gets the job done just fine.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: