## TF-IDF revisited

May 17, 2009

Remember the buffering problems for the TF-IDF program discussed in a previous post as well as the lecture about MapReduce algorithms from Cloudera‘s free Hadoop training? Thanks to the new joining abstraction (and the minor fixes and enhancements in Dumbo 0.21.13 and 0.21.14), these problems can now easily be avoided:

```from dumbo import *
from math import log

def mapper1(key, value):
for word in value.split():
yield (key[0], word), 1

@primary
def mapper2a(key, value):
yield key[0], value

@secondary
def mapper2b(key, value):
yield key[0], (key[1], value)

@primary
def mapper3a(key, value):
yield value[0], 1

@secondary
def mapper3b(key, value):
yield value[0], (key, value[1])

class Reducer(JoinReducer):
def __init__(self):
self.sum = 0
def primary(self, key, values):
self.sum = sum(values)

class Combiner(JoinCombiner):
def primary(self, key, values):
yield key, sum(values)

class Reducer1(Reducer):
def secondary(self, key, values):
for (doc, n) in values:
yield key, (doc, float(n) / self.sum)

class Reducer2(Reducer):
def __init__(self):
Reducer.__init__(self)
self.doccount = float(self.params["doccount"])
def secondary(self, key, values):
idf = log(self.doccount / self.sum)
for (doc, tf) in values:
yield (key, doc), tf * idf

def runner(job):
multimapper = MultiMapper()
multimapper = MultiMapper()