In order to be able to do something useful with the output of a Dumbo program, you often need to index it first. Suppose, for instance, that you ran a program that computes JIRA issue recommendations on lots and lots of issues, saving the output to jira/recs on the DFS. If the HADOOP-1722 patch has been applied to your Hadoop build, you can then dump this output to a fairly compact typed bytes file recs.tb as follows:
$ /path/to/hadoop/bin/hadoop jar \ /path/to/hadoop/build/contrib/streaming/*.jar dumptb jira/recs > recs.tb
Such a typed bytes file isn’t very convenient for doing lookups though, since you basically have to go through the whole thing to find the recommendations corresponding to a particular issue. Keeping it completely in memory as a hash table might be an option if you have tons of RAM, but indexing it and putting only the index in memory can do the trick as well.
Here’s a very simple, but nevertheless quite effective, way of generating an index for a typed bytes file:
import sys import typedbytes infile = open(sys.argv, "rb") outfile = open(sys.argv, "wb") input = typedbytes.PairedInput(infile) output = typedbytes.PairedOutput(outfile) pos = 0 for key, value in input: output.write((key, pos)) pos = infile.tell() infile.close() outfile.close()
This Python program relies on the typedbytes module to read a given typed bytes file and generate a second one containing (key, position) pairs. More concretely, running it with the arguments recs.tb recs.tb.idx leads to an index file recs.tb.idx for recs.tb. If you put
file = open("recs.tb") input = typedbytes.PairedInput(file) index = dict(typedbytes.PairedInput(open("recs.tb.idx")))
in the initialization part of a Python program, you can then lookup the recommendations for a particular issuenr as follows:
file.seek(index[issuenr]) readnr, recs = input.read() if readnr != issuenr: raise ValueError("corrupt index")
When dealing with huge output files, you might have to use a second-level index on (a sorted version of) the index obtained in this way, but in many cases this simple approach gets the job done just fine.