Analysing Apache logs

June 18, 2009

The Cloudera guys blogged about using Pig for examining Apache logs yesterday. Although it nicely illustrates several lesser-known Pig features, I’m not overly impressed with the described program to be honest. Having to revert to three different scripting languages to do some GeoIP lookups just complicates things too much if you ask me. Personally, I’d much prefer writing something like:

class Mapper:
    def __init__(self):
        from re import compile
        self.regex = compile(r'(?P<ip>[\d\.\-]+) (?P<id>[\w\-]+) ' \
                             r'(?P<user>[\w\-]+) \[(?P<time>[^\]]+)\] ' \
                             r'"(?P<request>[^"]+)" (?P<status>[\d\-]+) ' \
                             r'(?P<bytes>[\d\-]+) "(?P<referer>[^"]+)" ' \
        from pygeoip import GeoIP, MEMORY_CACHE
        self.geoip = GeoIP(self.params["geodata"], flags=MEMORY_CACHE)
    def __call__(self, key, value):
        mo = self.regex.match(value)
        if mo:
            request, bytes ="request"),"bytes")
            if request.startswith("GET") and bytes != "-":
                rec = self.geoip.record_by_addr("ip"))
                country = rec["country_code"] if rec else "-"
                yield country, (1, int(bytes))

if __name__ == "__main__":
    from dumbo import run, sumsreducer
    run(Mapper, sumsreducer, combiner=sumsreducer)

After installing Python 2.6, I tested this program on my chrooted Cloudera-flavored Hadoop server as follows:

$ wget
$ wget
$ wget  # found via Google
$ dumbo put access.log access.log -hadoop /usr/lib/hadoop
$ dumbo start -hadoop /usr/lib/hadoop \
-input access.log -output hits_by_country \
-python python2.6 -libegg pygeoip-0.1.1-py2.6.egg \
-file GeoLiteCity.dat -param geodata=GeoLiteCity.dat
$ dumbo cat hits_by_country/part-00000 -hadoop /usr/lib/hadoop/ | \
sort -k2,2nr | head -n 5
US      9400    388083137
KR      6714    2655270
DE      1859    32131992
RU      1838    44073038
CA      1055    23035208

At, we use the GeoIP Python bindings instead of the pure-Python pygeoip module, which is nearly identical API-wise but might be a bit slower. Also, we abstract away the format of our Apache logs by using a parser class and we have some library code for identifying hits from robots as well, much like the IsBotUA() method in the Pig example.