Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to me yesterday, so the time was right to finally have a look at Cloudera’s nicely packaged and patched-up Hadoop.
I started from a chrooted Debian server, on which I installed the Cloudera distribution, Python 2.5, and Dumbo as follows:
# cat /etc/apt/sources.list deb http://ftp.be.debian.org/debian etch main contrib non-free deb http://www.backports.org/debian etch-backports main contrib non-free deb http://archive.cloudera.com/debian etch contrib deb-src http://archive.cloudera.com/debian etch contrib # wget -O - http://backports.org/debian/archive.key | apt-key add - # wget -O - http://archive.cloudera.com/debian/archive.key | apt-key add - # apt-get update # apt-get install hadoop python2.5 python2.5-dev # wget http://peak.telecommunity.com/dist/ez_setup.py # python2.5 ez_setup.py dumbo
# adduser klaas # su - klaas $ wget http://bit.ly/wordcountpy http://bit.ly/briantxt $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Unsurprisingly, it also worked perfectly in pseudo-distributed mode:
$ exit # apt-get install hadoop-conf-pseudo # /etc/init.d/hadoop-namenode start # /etc/init.d/hadoop-secondarynamenode start # /etc/init.d/hadoop-datanode start # /etc/init.d/hadoop-jobtracker start # /etc/init.d/hadoop-tasktracker start # su - klaas $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm brianwc/_logs -hadoop /usr/lib/hadoop/ Deleted hdfs://localhost/user/klaas/brianwc/_logs $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Note that I removed the _logs directory first because dumbo cat would’ve complained about it otherwise. You can avoid this minor annoyance by disabling the creation of _logs directories.
$ wget http://bit.ly/joinpy $ wget http://bit.ly/hostnamestxt http://bit.ly/logstxt $ dumbo put hostnames.txt hostnames.txt -hadoop /usr/lib/hadoop/ $ dumbo put logs.txt logs.txt -hadoop /usr/lib/hadoop/ $ dumbo start join.py -input hostnames.txt -input logs.txt \ -output joined -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm joined/_logs -hadoop /usr/lib/hadoop $ dumbo cat joined -hadoop /usr/lib/hadoop | grep node1 node1 5
$ zcat /usr/share/man/man1/python2.5.1.gz > python.man $ for i in `seq 100000`; do cat python.man >> python.txt; done $ du -h python.txt 1.2G python.txt $ dumbo put python.txt python.txt -hadoop /usr/lib/hadoop/ $ time dumbo start wordcount.py -input python.txt -output pywc \ -python python2.5 -hadoop /usr/lib/hadoop/ real 17m45.473s user 0m1.380s sys 0m0.224s $ exit # apt-get install gcc libc6-dev # su - klaas $ python2.5 ez_setup.py -zmaxd. ctypedbytes $ time dumbo start wordcount.py -input python.txt -output pywc2 \ -python python2.5 -hadoop /usr/lib/hadoop/ \ -libegg ctypedbytes-0.1.5-py2.5-linux-i686.egg real 13m22.420s user 0m1.320s sys 0m0.216s
In this particular case, ctypedbytes appears to be 25% faster. Your mileage may vary since the running times depend on many factors, but in any case I’d always expect ctypedbytes to lead to significant speed improvements.