Reading Hadoop records in Python

December 23, 2009

At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows:

Hadoop records
     → CsvRecordInput
         → hadoop_record Python module

Although it’s a nice and very systematic solution, I couldn’t resist blogging about an already existing alternative solution for this problem:

Hadoop records
     → TypedBytesRecordInput
         → typedbytes Python module

Not only would this have saved Paul a lot of work, it probably also would’ve been more efficient, especially when using ctypedbytes, the speedy variant of the typedbytes module.

Getting organized

March 11, 2009

I created Assembla issue trackers for both Dumbo and Typedbytes a few moments ago. They aren’t quite as powerful as JIRA, but they integrate nicely with GitHub and should be more than sufficient for the time being.

UPDATE: We now use GitHub’s native issue trackers:

Dumbo 0.21 and typedbytes 0.3

March 11, 2009

I finally took the time to release Dumbo 0.21. All Java code is gone now, and the Python code got split up into submodules to make things more clear and maintainable. The installation of Dumbo itself is now very easy, but until Hadoop 0.21 gets released it requires a patched Hadoop, since it relies on HADOOP-1722 for replacing the functionality provided by the removed Java code.

Thanks to Daniel Lescohier, the typedbytes Python module — which is now a required dependency for Dumbo — also had its version number increased. Version 0.3 of this module incorporates various improvements and optimizations, but one of these improvements requires another patch to be applied to Hadoop, unfortunately.

So, for now, you have to apply the patches from both HADOOP-1722 and HADOOP-5450 in order to be able to use the latest Dumbo offerings, but this minor annoyance will hopefully be remedied by Hadoop 0.21, since the plan is to get HADOOP-5450 in Hadoop 0.21, just like HADOOP-1722.