Deploying Hadoop on Traditional Supercomputers

Some time ago I posted a guide on my website that outlined some of the details involved in allowing users of SDSC's Gordon supercomputer and FutureGrid's cloud clusters to deploy semi-persistent Hadoop clusters as job scripts within the existing batch system.  Although I discussed the concepts of what was involved in allowing users to spin up their own compute nodes, the fact remained that that whole guide assumes that Hadoop and myHadoop are already installed on the reader's cluster.

A number of people have contacted me and expressed interest in reading a guide that starts a little earlier on in the process--that is, how do you first install Hadoop and myHadoop on a cluster, and then how do users interact with that installation in such a way that allows them to submit jobs that spawn Hadoop clusters?  +Dana Brunson was kind enough to remind me that I had promised some people such a guide and never delivered it, so I set out to write one up this weekend.

It turned out that writing up a guide that was sufficiently simple required a substantial rewrite of the myHadoop framework itself, so I wound up doing that as well.  I made a lot of changes (and broke backwards compatibility-sorry!), but the resulting guide on how to deploy Hadoop on a traditional high-performance computing cluster is now pretty simple.  Here it is:


In addition, I released a new version of myHadoop to accompany this guide:


I'm pretty excited about this, as it represents a very easy way to get people playing with the technology without having to invest in new infrastructure.  I was able to deploy it out on a number of XSEDE resources entirely from userland, and there's no reason it would not be simple on any machine with local disks and a batch system.

3-Step Install and 3-Step Spin-up

Although the details are in the guide I referenced above, deploying Hadoop on a traditional cluster using myHadoop only takes three easy steps:
  1. Download a stock Apache Hadoop 1.x binary distribution and myHadoop 0.30b
  2. Unpack both Hadoop and myHadoop into an installation directory (no actual installation required)
  3. Apply a single patch shipped with myHadoop to Apache Hadoop
  4. There is no step #4
Spinning up a Hadoop cluster is similarly simple for users.  Once Hadoop and myHadoop are installed, users can just
  1. Export a few environment variables ($PATH, $HADOOP_HOME, $JAVA_HOME, and $HADOOP_CONF_DIR)
  2. Run the myhadoop-configure.sh command and pass it $HADOOP_CONF_DIR and the location of each compute node's local disk filesystem (e.g., /scratch/$USER)
  3. Run Hadoop's "start-all.sh" script (no input needed)
  4. Like I said, no step #4

Requirements

Hadoop clusters have fundamentally different architectures than traditional supercomputers due to their workloads being data-intensive rather than compute-intensive.  However, any compute cluster that has
  • a local scratch disk in each node,
  • some form of TCP-capable interconnect (1gig, 10gig, or InfiniBand), and
  • a supported resource manager (currently Torque, SLURM, and SGE)
can run Hadoop in some capacity or another with myHadoop.

This is a lie.  I reimplemented myHadoop's "persistent mode" support so that you can also run HDFS off of a durable networked parallel filesystem like Lustre and keep your HDFS data around even when you haven't got a Hadoop cluster spun up.  This is undercutting the benefits of HDFS though, so it's not a recommended approach.