HUGUK #4

May 18, 2010

In response to Johan‘s desperate request I’ve decided to organize a 4th HUGUK meetup. More info will follow on the official HUGUK blog soon, but since it’s going to be fairly short notice I thought it made sense to already share some details now:

The two main talks will be:

“Introduction to Sqoop” by Aaron Kimball

— Synopsis —

This talk introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop’s distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.

After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We’ll also cover some deeper technical details of Sqoop’s architecture, and take a look at some upcoming aspects of Sqoop’s development roadmap.

— Bio —

Aaron Kimball has been working with Hadoop since early 2007. Aaron has worked with the NSF and several other universities nationally and internationally to advance education in the field of large-scale data-intensive computing. He helped create and deliver academic course materials first used at the University of Washington (and later adopted by many other academic institutions) as well as Hadoop training materials used by several industry partners. Aaron has also worked as an independent consultant focusing on Hadoop and Amazon EC2-based systems. At Cloudera, he continues to actively develop Hadoop and related tools, as well as focus on training and user education. Aaron holds a B.S. in Computer Science from Cornell University, and an M.S. in Computer Science and Engineering from the University of Washington.

“Hive at Last.fm” by Tim Sell

— Synopsis —

This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.

— Bio —

Tim Sell is a Data Engineer at Last.fm who works with Hive and Hadoop on a daily basis.

As usual we’ll try to provide some free beer at the end and anyone is welcome to give a short lightning talk after the main presentations.


Powered by Dumbo?

May 9, 2009

I’ve slowly started taking on the slightly daunting task of writing my Ph.D. dissertation, and I’m considering including a chapter about Dumbo and Hadoop. However, thinking about this made me realize that I’m pretty clueless as to how many people are using Dumbo, and for what purposes it’s being used outside of Last.fm. I know for a fact that CBSi started using it recently, and there are a few other companies like Lookery that appear to be making use of it it as well, but I don’t really know what they’re using it for exactly, and judging from the number of questions I keep getting there must be more people out there who are using Dumbo for non-toy projects. So, if you aren’t just reading this blog out of personal interest, please drop me a line at klaas at last dot fm or add a comment to this post. It’ll make my day, and you might get an honorable mention in my dissertation. When the list is long enough, I might even devote an entire wiki page to it as well.


BinaryPartitioner backported to 0.18

May 6, 2009

Today, Tim put a smile on the faces of many Dumbo users at Last.fm by backporting HADOOP-5528 to Hadoop 0.18. Now that his backported patch has been deployed, we finally get to use join keys on our production clusters, allowing us to join more easily and avoid the memory-hungry alternative ways of joining datasets.


First post

February 23, 2009

Welcome to Dumbotics, the place where I plan to post regularly about Dumbo — the Python module that allows you to easily write and run Hadoop programs that is, not the flying circus elephant after which this module was named. Hopefully, this will lead to more up-to-date and extensive documentation, and maybe some posts will even be (somewhat) appealing to people interested in Hadoop and MapReduce in general as well.