Setting up Hadoop on OSX Mountain Lion

Everyone I know that deals with large amounts of data has been looking closer at Hadoop as it’s matured. Especially with tools like Hive, old datawarehouse hands are taking a serious look at it as a better type of long time data archive and storage. You probably should too.

While most of the time for the types of real work you’d be doing, it makes more sense to spin up Amazon’s EC2, Elastic Map Reduce or another flavour of virtualized Hadoop instance in the cloud for the clustering and crunching benefits, it’s very good to have a local install for development and testing.

This walkthrough should take you about 15 minutes to get Hadoop up and running.

So What Is Hadoop?

Basically, what it allows you to do is take huge amounts of data, or computationally difficult problems, chop them into little bits, solve the little bits and then recombine them all to arrive at an answer for the original big problem.

While it uses an open source implementation of Google’s MapReduce and GFS file system to do that, Hadoop is also the open-source set of tools around this incredible ability to solve data-intensive, distributed applications.

Most importantly perhaps, those tools allow the running of those applications on large clusters of cheap computers, making problems formerly the domain of super computers within the reach of us normal mortals and even more quickly solvable in some cases.

Learning it, and all the tools, does have a fairly steep learning curve, so a nice little HOWTO to get it installed locally is a Good Thing. ™

Here we go.

Prerequisities

There are a couple. Just for reference, at time of writing, I’m running OSX 10.8.2 with all System Updates and Hadoop is at 1.1.1. The amazing Homebrew needs to be installed as that’s what we’ll be using to get the package (if you’re not using it, you really should be.).

Java

You do need java installed and it needs to be of the 1.6.x version, so do make sure you have that installed and that it’s working:

$ java -version
java version "1.6.0_41"
Java(TM) SE Runtime Environment (build 1.6.0_41-b02-445-11M4107)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-445, mixed mode)

If you don’t have java installed, checking for the version should get you a prompt asking if you’d like it installed.

ssh

Hadoop nodes are managed via ssh. So there are two things here:

  1. You need to have Remote Logins checked on your System Preferences | Sharing panel
  2. You need a ssh key public/private keypair to be able to ssh into your node

I would normally assume anyone attempting to install hadoop is already exposed to ssh, and probably uses it in their dev work (if you do, skip to the end two commands and just log into your local box and say yes to authorization.). Just in case you don’t though, type:

$ ssh-keygen -t rsa -P ""

and let them be installed to their default locations. When that’s done, you need to make sure your public key is authorized. Easiest CLI-fu to do that is:

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

ssh into your localhost machine and into your actual host to make sure it works.

$ ssh localhost
$ ssh Gunter.local

(replace Gunter with your machine’s network name, of course).

Installing on Mountain Lion

We’re just going to install a single node Hadoop running in pseudo-distributed mode. This is where where each Hadoop daemon runs in a separate Java process.

The wonders of homebrew never cease. This is how you install hadoop with homebrew:

$ brew install hadoop

Note that when it installs, it will set $JAVA_HOME to /usr/libexec/java_home just in case you run into trouble with a custom configuration or have been using other javas installed elsewhere.

Configuring Hadoop

So, it’s not quite that easy. Now you have to do a bit of fiddling with conf files before you can fire up your own data crunching elephant of fury.

There are four config files that need to be modified and they are all located in /usr/local/Cellar/hadoop/1.1.1/libexec/conf

  1. hadoop-env.sh
  2. core-site.xml
  3. hdfs-site.xml
  4. mapred-site.xml

hadoop-env.sh

With homebrew, most of your work is already done here, though a bug introduced in OSX 10.7 Lion, appears to still be affecting Mountain Lion which may give you a Unable to load realm info from SCDynamicStore error.

Add the following line into the file with your text editor of choice. Personally, I put it in line 19 where HADOOP_OPTS are asked for.

export HADOOP_OPTS="-Djava.security.krb5.realm=-Djava.security.krb5.kdc="

core-site.xml

This file controls site-specific property overrides. You’ll have a few.

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>

Two things:

  1. I’ve used localhost here just because I am only assuming that I will use this myself for development. If you wanted to make it available to others on the network, you should change this to the .local name for your machine. For example, for me I would replace the localhost with hdfs://Gunter.local:9000 for users across my network.
  2. Hadoop must be able to write to the tmp directories to work.

hdfs-site.xml

This file controls the configuration of the Hadoop distributed file system. Since we’re only using a single node here, we just need to let it know it’s a single node and to keep one copy. You can optionally also inform it where it should keep the data in hadoop writable directories but that appears to be extraneous.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

mapred-site.xml

Unsurprisingly, this file controls MapReduce overrides. The maximum values for the map and reduce tasks are completely optional though wise. Note, that you can also, in new betas of Hadoop, set up for MapReduce 1 or MapReduce 2 in this file by noting classic or yarn as a value on name mapreduce.framework.name under a distinct property.

A good starting point for the max on Mappers seems to be 1 x each virtual core you possess and for Reducers 1 x each for each physical disk or figure 2 x for each SSD (thanks to @andykent for that tip, though he says to use that as a starting point and then experiment.). On my 2012 i5 Macbook Air, I’ve got 4 virtual cores due to Hyperthreading (on the one chip with two physical processors).

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
  </property>
  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>2</value>
  </property>
</configuration>

Again, as with the core-site file I’ve simply used localhost here. To make this available across the network you could put in network name for your machine like Gunter.local and you should be able to access it.

Spooling up the FTLs - Intitializing

The HDFS system need to be initialized before you can use it. It also ensures that hadoop can write to the directories it needs to.

$ hadoop namenode -format

This should give you a nice few lines of output which, if successful, should end with a message to the effect the Storage Director (in my case /tmp/hadoop-daryl/dfs/name) has been successfully formatted, followed by a shutdown message.

Congratulations! Your hadoop is ready to rock!

Jump! - Running Hadoop

You can get Hadoop started simply enough with:

$ /usr/local/Cellar/hadoop/1.1.1/libexec/bin/start-all.sh

and a few entries of your password if you haven’t gone for the passphraseless ssh-key option above (I always use a password - I also recommend a ln -s to shorten the above command as it’s handy.).

Just to check hadoop is running you can use the handy jps command. You should see output very similar to this:

$ jps
49770 TaskTracker
49678 JobTracker
49430 NameNode
49522 DataNode
49615 SecondaryNameNode
49823 Jps

That shows that Hadoop is ready and waiting to accept jobs.

Find Pi to Test Hadoop

There’s some nice example jobs available to make sure your system is actually running hadoop and working. Try this one:

$ hadoop jar /usr/local/Cellar/hadoop/1.1.1/libexec/hadoop-examples-*.jar pi 10 100

It should take a while (and not get all that close to π), but you should see an output result at the end stating:

Estimated value of Pi is 3.14800000000000000000

Anyhow, that’s it. You’re now running hadoop and it can process jobs.

TaskTracker, HDFS and MapReduce Admin

You also have access to your hadoop instance’s web-based administration panels for HDFS and for MapReduce for monitoring.

HDFS Administrator @ http://localhost:50070

MapReduce Administrator @ http://localhost:50030

Task Tracker @ http://localhost:50060

Shutting Down Hadoop

Why you ever would is beyond me, but just in case you don’t like all those background processes lying around, you can

$ /usr/local/Cellar/hadoop/1.1.1/libexec/bin/stop-all.sh

Again, something I’d personally ln -s and also you’ll need to type your passphrase a few times if you haven’t got this set on a passphraseless ssh key from above.

Conclusion

And that’s about it. I intend to tweak this install guide as necessary and also as I find out more about the best ways to run hadoop as well as if there are better ways to configure the defaults (so please, if you see anything amiss, please let me know. Happy to hear from you.).

In later posts, I’ll be going over installing and using Pig and some of the other tools in the Hadoop stack as well as some knowledge on Wukong , the Ruby on Hadoop DSL (um, as soon as I learn something about it myself.).