Virtual Python environments

May 24, 2009

Judging from some of the questions about Dumbo development that keep popping up, virtual Python environments are apparently not that widely known and used yet. Therefore, I thought it made sense to write a quick post about them.

The virtualenv tool can be installed as follows:

$ wget http://peak.telecommunity.com/dist/ez_setup.py
$ python ez_setup.py virtualenv

While you’re at it, you might also want to install nose by doing

$ python ez_setup.py nose

since running unit tests sometimes doesn’t work if you don’t install this module manually. Once you got virtualenv installed, you can create and activate a virtual Python environment as follows:

$ mkdir ~/envs
$ virtualenv ~/envs/dumbo
$ source ~/envs/dumbo/bin/activate

You then get a slightly different prompt to remind you that you’re using the isolated virtual environment:

(dumbo)$ which python
/home/username/envs/dumbo/bin/python
(dumbo)$ deactivate
$ which python
/usr/bin/python

Such a virtual environment can be very convenient for developing and debugging Dumbo:

$ source ~/envs/dumbo/bin/activate
(dumbo)$ git clone git://github.com/klbostee/dumbo.git
(dumbo)$ cd dumbo
(dumbo)$ python setup.py test  # run unit tests
(dumbo)$ python setup.py develop  # install symlinks
(dumbo)$ which dumbo
/home/username/envs/dumbo/bin/dumbo
(dumbo)$ cd examples
(dumbo)$ dumbo start wordcount.py -input brian.txt -output out.code
(dumbo)$ dumbo cat out.code | head -n 2
A       2
And     4

Anything you change to the source code will then immediately affect the behavior of dumbo in the virtual environment, and none of this interferes with your global Python installation in any way. Soon enough, you’ll start wondering how you ever managed to live without virtual Python environments.

Note that running on Hadoop won’t work when you installed Dumbo via python setup.py develop. The develop command installs symlinks to the source files (such that you don’t have to run it after each change when you’re developing), but in order to be able to run on Hadoop an egg needs to be generated and installed, which is precisely what python setup.py install does.


Dumbo IP count in C

May 3, 2009

It doesn’t comply very well with the goal of making it as easy as possible to write MapReduce programs, but Dumbo mappers and reducers can also be written in C instead of Python. I just put an example on GitHub to illustrate this. Although it’s nowhere near as convenient as using Python, writing a mapper or reducer in C is not that hard since you get to use the nifty Python C API, and in some specific cases the speed gains might be worth the extra effort. Moreover, setuptools nicely takes care of all the building and compiling, and you can limit the C code to computationally expensive parts and still use Python for the rest.