Analysing Apache logs

The Cloudera guys blogged about using Pig for examining Apache logs yesterday. Although it nicely illustrates several lesser-known Pig features, I’m not overly impressed with the described program to be honest. Having to revert to three different scripting languages to do some GeoIP lookups just complicates things too much if you ask me. Personally, I’d much prefer writing something like:

class Mapper:
    def __init__(self):
        from re import compile
        self.regex = compile(r'(?P<ip>[\d\.\-]+) (?P<id>[\w\-]+) ' \
                             r'(?P<user>[\w\-]+) \[(?P<time>[^\]]+)\] ' \
                             r'"(?P<request>[^"]+)" (?P<status>[\d\-]+) ' \
                             r'(?P<bytes>[\d\-]+) "(?P<referer>[^"]+)" ' \
                             r'"(?P<agent>[^"]+)"')
        from pygeoip import GeoIP, MEMORY_CACHE
        self.geoip = GeoIP(self.params["geodata"], flags=MEMORY_CACHE)
    def __call__(self, key, value):
        mo = self.regex.match(value)
        if mo:
            request, bytes = mo.group("request"), mo.group("bytes")
            if request.startswith("GET") and bytes != "-":
                rec = self.geoip.record_by_addr(mo.group("ip"))
                country = rec["country_code"] if rec else "-"
                yield country, (1, int(bytes))

if __name__ == "__main__":
    from dumbo import run, sumsreducer
    run(Mapper, sumsreducer, combiner=sumsreducer)

After installing Python 2.6, I tested this hits_by_country.py program on my chrooted Cloudera-flavored Hadoop server as follows:

$ wget http://pygeoip.googlecode.com/files/pygeoip-0.1.1-py2.6.egg
$ wget http://bit.ly/geolitecity
$ wget http://bit.ly/randomapachelog  # found via Google
$ dumbo put access.log access.log -hadoop /usr/lib/hadoop
$ dumbo start hits_by_country.py -hadoop /usr/lib/hadoop \
-input access.log -output hits_by_country \
-python python2.6 -libegg pygeoip-0.1.1-py2.6.egg \
-file GeoLiteCity.dat -param geodata=GeoLiteCity.dat
$ dumbo cat hits_by_country/part-00000 -hadoop /usr/lib/hadoop/ | \
sort -k2,2nr | head -n 5
US      9400    388083137
KR      6714    2655270
DE      1859    32131992
RU      1838    44073038
CA      1055    23035208

At Last.fm, we use the GeoIP Python bindings instead of the pure-Python pygeoip module, which is nearly identical API-wise but might be a bit slower. Also, we abstract away the format of our Apache logs by using a parser class and we have some library code for identifying hits from robots as well, much like the IsBotUA() method in the Pig example.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: