At the 11/18 Bay Area HUG, Paul Tarjan apparently presented an approach for reading Hadoop records in Python. In summary, his approach seems to work as follows:
Hadoop records
→CsvRecordInput
→hadoop_recordPython module
Although it’s a nice and very systematic solution, I couldn’t resist blogging about an already existing alternative solution for this problem:
Hadoop records
→TypedBytesRecordInput
→typedbytesPython module
Not only would this have saved Paul a lot of work, it probably also would’ve been more efficient, especially when using ctypedbytes, the speedy variant of the typedbytes module.