The Cloudera guys blogged about using Pig for examining Apache logs yesterday. Although it nicely illustrates several lesser-known Pig features, I’m not overly impressed with the described program to be honest. Having to revert to three different scripting languages to do some GeoIP lookups just complicates things too much if you ask me. Personally, I’d much prefer writing something like:
class Mapper:
def __init__(self):
from re import compile
self.regex = compile(r'(?P<ip>[\d\.\-]+) (?P<id>[\w\-]+) ' \
r'(?P<user>[\w\-]+) \[(?P<time>[^\]]+)\] ' \
r'"(?P<request>[^"]+)" (?P<status>[\d\-]+) ' \
r'(?P<bytes>[\d\-]+) "(?P<referer>[^"]+)" ' \
r'"(?P<agent>[^"]+)"')
from pygeoip import GeoIP, MEMORY_CACHE
self.geoip = GeoIP(self.params["geodata"], flags=MEMORY_CACHE)
def __call__(self, key, value):
mo = self.regex.match(value)
if mo:
request, bytes = mo.group("request"), mo.group("bytes")
if request.startswith("GET") and bytes != "-":
rec = self.geoip.record_by_addr(mo.group("ip"))
country = rec["country_code"] if rec else "-"
yield country, (1, int(bytes))
if __name__ == "__main__":
from dumbo import run, sumsreducer
run(Mapper, sumsreducer, combiner=sumsreducer)
After installing Python 2.6, I tested this hits_by_country.py program on my chrooted Cloudera-flavored Hadoop server as follows:
$ wget http://pygeoip.googlecode.com/files/pygeoip-0.1.1-py2.6.egg $ wget http://bit.ly/geolitecity $ wget http://bit.ly/randomapachelog # found via Google $ dumbo put access.log access.log -hadoop /usr/lib/hadoop $ dumbo start hits_by_country.py -hadoop /usr/lib/hadoop \ -input access.log -output hits_by_country \ -python python2.6 -libegg pygeoip-0.1.1-py2.6.egg \ -file GeoLiteCity.dat -param geodata=GeoLiteCity.dat $ dumbo cat hits_by_country/part-00000 -hadoop /usr/lib/hadoop/ | \ sort -k2,2nr | head -n 5 US 9400 388083137 KR 6714 2655270 DE 1859 32131992 RU 1838 44073038 CA 1055 23035208
At Last.fm, we use the GeoIP Python bindings instead of the pure-Python pygeoip module, which is nearly identical API-wise but might be a bit slower. Also, we abstract away the format of our Apache logs by using a parser class and we have some library code for identifying hits from robots as well, much like the IsBotUA() method in the Pig example.