Over the last couple of days, I picked up some rumors concerning the inclusion of all patches on which Dumbo relies in the most recent version of Cloudera’s Hadoop distribution. Todd confirmed this to me yesterday, so the time was right to finally have a look at Cloudera’s nicely packaged and patched-up Hadoop.
I started from a chrooted Debian server, on which I installed the Cloudera distribution, Python 2.5, and Dumbo as follows:
# cat /etc/apt/sources.list deb http://ftp.be.debian.org/debian etch main contrib non-free deb http://www.backports.org/debian etch-backports main contrib non-free deb http://archive.cloudera.com/debian etch contrib deb-src http://archive.cloudera.com/debian etch contrib # wget -O - http://backports.org/debian/archive.key | apt-key add - # wget -O - http://archive.cloudera.com/debian/archive.key | apt-key add - # apt-get update # apt-get install hadoop python2.5 python2.5-dev # wget http://peak.telecommunity.com/dist/ez_setup.py # python2.5 ez_setup.py dumbo
Then, I created a user for myself and confirmed that the wordcount.py program runs properly on Cloudera’s distribution in standalone mode:
# adduser klaas # su - klaas $ wget http://bit.ly/wordcountpy http://bit.ly/briantxt $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Unsurprisingly, it also worked perfectly in pseudo-distributed mode:
$ exit # apt-get install hadoop-conf-pseudo # /etc/init.d/hadoop-namenode start # /etc/init.d/hadoop-secondarynamenode start # /etc/init.d/hadoop-datanode start # /etc/init.d/hadoop-jobtracker start # /etc/init.d/hadoop-tasktracker start # su - klaas $ dumbo start wordcount.py -input brian.txt -output brianwc \ -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm brianwc/_logs -hadoop /usr/lib/hadoop/ Deleted hdfs://localhost/user/klaas/brianwc/_logs $ dumbo cat brianwc -hadoop /usr/lib/hadoop/ | grep Brian Brian 6
Note that I removed the _logs directory first because dumbo cat would’ve complained about it otherwise. You can avoid this minor annoyance by disabling the creation of _logs directories.
I also verified that HADOOP-5528 got included by running the join.py example successfully:
$ wget http://bit.ly/joinpy $ wget http://bit.ly/hostnamestxt http://bit.ly/logstxt $ dumbo put hostnames.txt hostnames.txt -hadoop /usr/lib/hadoop/ $ dumbo put logs.txt logs.txt -hadoop /usr/lib/hadoop/ $ dumbo start join.py -input hostnames.txt -input logs.txt \ -output joined -python python2.5 -hadoop /usr/lib/hadoop/ $ dumbo rm joined/_logs -hadoop /usr/lib/hadoop $ dumbo cat joined -hadoop /usr/lib/hadoop | grep node1 node1 5
And while I was at it, I did a quick typedbytes versus ctypedbytes comparison as well:
$ zcat /usr/share/man/man1/python2.5.1.gz > python.man $ for i in `seq 100000`; do cat python.man >> python.txt; done $ du -h python.txt 1.2G python.txt $ dumbo put python.txt python.txt -hadoop /usr/lib/hadoop/ $ time dumbo start wordcount.py -input python.txt -output pywc \ -python python2.5 -hadoop /usr/lib/hadoop/ real 17m45.473s user 0m1.380s sys 0m0.224s $ exit # apt-get install gcc libc6-dev # su - klaas $ python2.5 ez_setup.py -zmaxd. ctypedbytes $ time dumbo start wordcount.py -input python.txt -output pywc2 \ -python python2.5 -hadoop /usr/lib/hadoop/ \ -libegg ctypedbytes-0.1.5-py2.5-linux-i686.egg real 13m22.420s user 0m1.320s sys 0m0.216s
In this particular case, ctypedbytes appears to be 25% faster. Your mileage may vary since the running times depend on many factors, but in any case I’d always expect ctypedbytes to lead to significant speed improvements.
June 8, 2009 at 12:11 pm |
[...] this splitwordcount.py program on my chrooted Cloudera-flavored Hadoop server (after updating Dumbo and building feathers.jar) gave me the following results: $ dumbo [...]
June 18, 2009 at 12:03 pm |
[...] installing Python 2.6, I tested this hits_by_country.py program on my chrooted Cloudera-flavored Hadoop server as follows: $ wget http://pygeoip.googlecode.com/files/pygeoip-0.1.1-py2.6.egg $ wget [...]