Thursday, September 10, 2009

Building Hadoop and Hypertable on Debian Lenny

Hadoop and Hypertable on Debian

Environment:

Systems: (2) SUN Virtualbox 3.0.6 Virtual Machines, both Debian Lenny x86
Hypervisor system: Debian Lenny AMD64
Networking: Bridged adapter (not NAT) to eth0
g++: 4.3.2

*Note: The fact that this test deployment is on virtual machines is irrelevant to the configuration of Hadoop or Hypertable. Since the (2) virtual machines are on a shared disk, high-performance IO is not expected. However, should the testbed prove promising, dedicated systems will be deployed and high-performance tuning of both Hadoop and Hypertable will be explored.

Background:

At NSLS-II we are toying with the idea of back-ending the next generation of the Channel Archiver with a distributed database atop a distributed filesystem. This for a myriad of reasons other than it being a "cool project". For example, combining Hypertable (a high-performance distributed data storage system) with the MapReduce functionality of Hadoop promises to offer performance, redundancy, reliability, and scalability.

Goals:

Build a (2) node Hadoop and Hypertable cluster. The (2) nodes are "systemA" (master) and "systemB" (slave).

(1) Prerequisites:

*Note: Building Hadoop and Hypertable in this document requires adding the unstable and (optionally) testing repositories to /etc/apt/sources.list.

(1) Make sure to specify that Lenny/stable is the default distribution:
Edit /etc/apt/apt.conf:
# apt.conf
APT::Default-Release "stable"; # Only "stable", "testing", "unstable" are acceptable release names; i.e., "lenny" is not accepted.

$ apt-get update
$
apt-get -y install g++ cmake libboost-dev liblog4cpp5-dev git-core cronolog libgoogle-perftools-dev libevent-dev zlib1g-dev libexpat1-dev libdb4.6++-dev libncurses-dev libreadline5-dev

(2) Install Hyperic-Sigar

$ wget http://internap.dl.sourceforge.net/sourceforge/sigar/hyperic-sigar-1.6.2.tar.gz$ tar -xzvf hyperic-sigar-1.6.2.tar.gz$ cp ~src/hyperic-sigar-1.6.2/sigar-bin/include/*.h /usr/local/include$ cp ~src/hyperic-sigar-1.6.2/sigar-bin/lib/libsigar-x86-linux.so /usr/local/lib/$ ldconfig

(3) Install Thrift:

$ apt-get -y install sun-java6-jdk ant autoconf automake libtool bison flex pkg-config php5 php5-cli ruby-dev libhttp-access2-ruby libbit-vector-perl liblog4j1.2-java erlang ruby libevent-1.4-2

$ update-java-alternatives --set java-6-sun
$ ln -f -s /bin/bash /bin/sh
$ wget www.hypertable.org/pub/thrift-r796538.tgz


* Note: Since I am behind a proxy I needed to set this variable:

$ export ANT_OPTS="-Dhttp.proxyHost=192.168.1.130 -Dhttp.proxyPort=3128"
$ ./bootstrap.sh
$ ./configure
$ make


(4) Build Hadoop:

(I) Get Hadoop:

Download the latest version of Hadoop. I untarred mine in /opt and made a symlink from hadoop-0.20.1 to hadoop:

The latest version can be found at this mirror:
http://ftp.wayne.edu/apache/hadoop/core/

At the time of writing this document, the latest Hadoop was version 0.20.1

$ wget http://ftp.wayne.edu/apache/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

You can also retrieve the latest version via git, but the Hadoop directory tree is different than is what is referred to in this document.

$ git clone git://git.apache.org/hadoop-common.git
$ git clone git://git.apache.org/hadoop-hdfs.git
$ git clone git://git.apache.org/hadoop-mapreduce.git


$ ln -s /opt/hadoop-0.20.1 /opt/hadoop

Some source needs patching in order for Hypertable to cooperate with MapReduce. This is the list of files requiring patching:
/opt/hadoop/src/c++/utils/impl/StringUtils.cc
/opt/hadoop/src/c++/utils/impl/SerialUtils.cc
/opt/hadoop/src/c++/pipes/impl/HadoopPipes.cc

Here are the patches:

(i) SerialUtils.cc:
--- SerialUtils.cc (revision 765057)
+++ SerialUtils.cc (working copy)
@@ -18,7 +18,8 @@
#include "hadoop/SerialUtils.hh"
#include "hadoop/StringUtils.hh"

-#include
+#include
+#include
#include
#include
#include

(ii) StringUtils.cc:
--- StringUtils.cc (revision 765057)
+++ StringUtils.cc (working copy)
@@ -18,10 +18,11 @@
#include "hadoop/StringUtils.hh"
#include "hadoop/SerialUtils.hh"

-#include
+#include
#include
-#include
-#include
+#include
+#include
+#include
#include

using std::string;
@@ -31,7 +32,7 @@

string toString(int32_t x) {
char str[100];
- sprintf(str, "%d", x);
+ snprintf(str, 100, "%d", x);
return str;
}

@@ -96,7 +97,7 @@
const char* deliminators) {

string result(str);
- for(int i=result.length() -1; i >= 0; --i) {
+ for(int i = result.length() - 1; i >= 0; --i) {
char ch = result[i];
if (!isprint(ch) ||
ch == '\\' ||
@@ -116,7 +117,7 @@
break;
default:
char buff[4];
- sprintf(buff, "\\%02x", static_cast(result[i]));
+ snprintf(buff, 4, "\\%02x", static_cast(result[i]));
result.replace(i, 1, buff);
}
}

(iii) HadoopPipes.cc:
--- HadoopPipes.cc (revision 765057)
+++ HadoopPipes.cc (working copy)
@@ -26,9 +26,9 @@
#include
#include
#include
-#include
-#include
-#include
+#include
+#include
+#include
#include
#include

(II) Compile Hadoop:

$ cd /opt/hadoop/src/c++/pipes
$ sh configure
$ make && make install

$ cd /opt/hadoop/src/c++/utils
$ sh configure
$ make && make install


This will place Hadoop headers in:
/opt/hadoop/src/c++/install/include/hadoop

And Hadoop libraries in:
/opt/hadoop/src/c++/install/lib

Add the location of the Hadoop libraries to ld:
$ echo "/opt/hadoop/src/c++/install/lib" > /etc/ld.so.conf.d/hadoop.conf
$ ldconfig


Next we're ready to compile Hadoop.
$ cd /opt/hadoop

*Note: Again, because I am behind a proxy, I needed to set this environmental variable:
$ export ANT_OPTS="-Dhttp.proxyHost=192.168.1.130 -Dhttp.proxyPort=3128"
$ ant compile && ant jar


(III) Configure Hadoop:

(A) My Hadoop configuration files are kept in /opt/hadoop/conf
Relevant configuration files in this test are:
hadoop-env.sh # Environmental variables
core-site.xml # Default Hadoop filesystem
hdfs-site.xml # HDFS defaults for replication, name, and datanode services
mapred-site.xml # MapReduce defaults for trackers
slaves
master

*Note: All configuration files except for "slaves" and "master" will be the same on both cluster nodes.

(i) hadoop-env.sh:

By default Hadoop prefers ipv6. Since I am not using ipv6 this required a change:
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

other variables set in hadoop-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/hadoop
export HADOOP_LOG_DIR=/var/log/hadoop
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
export HADOOP_PID_DIR=/var/hadoop/pids

This will require the creation of directories:
$ mkdir /var/log/hadoop
$ mkdir /var/hadoop/pids


(ii) core-site.xml (where "systemA" is the master node in the cluster):
<property>
<name>fs.default.name</name>
<value>hdfs://systemA:9000</value>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>

</property>

(iii) hdfs-site.xml:
*Note: dfs.replication specifies how many nodes are in the Hadoop cluster, in our case, (2).
*Note: This configuration requires the creation of a few directories, namely, "/dfsname", "/hadoop/data". If your test system has more than (1) system disk, it is advisable to separate the DataNode and NameNode directories so as to avoid contentious IO.

<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.The actual number of replications can
be specified when the file is created. The default is used if replication is not
specified in create time.</description>

</property>
<property>
<name>dfs.name.dir</name>
<value>/dfsname</value>
<description>Path on local filesystem where the NameNode stores the
namespace and transactions logs persistently.</description>

</property>
<property>
<name>dfs.data.dir</name>
<value>
/hadoop/data</value>
<description>Comma separated list of paths on the local filesystem of
a DataNode where it should store its blocks</description>

</property>

(iv) mapred-site.xml:
*Note: This configuration requires the creation of the directory "/hadoop/mapred"
<property>
<name>mapred.job.tracker</name>

<value>systemA:9001</value>
<description>The host and port that the MapReduce job tracker runs at.
If "local", then jobs are run in-process as a single map and reduce task.*lt;/description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred</name>
<description>Path on the HDFS where the MapReduce framework stores
system files e.g., /hadoop/mapred/system
</description>
</property>

(v) slaves:

*Note: both systems are slaves, so both are listed
*Note: this file is only configured on the master, "systemA"
systemA
systemB

(vi) Master (just the name of the master):
systemA

(B) SSH
Enable public-key authentication on all nodes within the Hadoop cluster. This will allow the Hadoop service to log-on to other nodes and start/stop services.

I am running Hadoop as the user "hadoop".

On both machines, create the Hadoop user:
$ groupadd hadoop
$ useradd -g hadoop -c "Hadoop User" -d /opt/hadoop -s /bin/bash hadoop


On the master server, systemA:
$ su - hadoop
$ ssh-keygen -t rsa -P ""
$ cp id_rsa.pub ~hadoop/.ssh/authorized_keys


Copy id_rsa.pub over the slave system and save it in the same place: ~hadoop/.ssh/authorized_keys

(C) Local name lookup
In /etc/hosts on both machines:
Enter all Hadoop node IP addresses and remove "127.0.0.1 localhost". Have "localhost" point to an assigned IP address

(IV) Initialize Hadoop:

*Note: Initialization is only necessary on the Master node.
*Note: For convenience-sake I've added Hadoop to my path:

$ export PATH=$PATH:/opt/hadoop/bin


$ hadoop namenode -format
$ start-dfs.sh
$ start-mapred.sh


Check to see what's running on both the master and slave:
systemA $ jps
4388 Jps
28444 NameNode
28795 JobTracker
2181 main
28696 SecondaryNameNode
28575 DataNode
28903 TaskTracker

systemB $ jps
13033 DataNode
13142 TaskTracker
1675 Jps
539 main

systemA $ netstat -ptlen

(V) Create a directory in the Hadoop namespace:

$ hadoop dfs -mkdir /hypertable
$ hadoop/bin/hadoop dfs -chmod 777 /hypertable
$ hadoop/bin/hadoop dfs -ls /
$ hadoop dfsadmin -report


(5) Hypertable installation:

(I) Get Hypertable:

$ apt-get -y install git sparsehash libbz2-dev doxygen graphviz
$ git config --global user.name "First Lastname"
$ git config --global user.email "something@something.com"
$ git clone git://scm.hypertable.org/pub/repos/hypertable.git


(II) Pre-build:

(A) Fixes:

The Hypertable source requires some patching so that the build cooperates with Debian's g++. The files that need to be patched are:
~src/hypertable/contrib/cc/MapReduce/TableReader.cc
~src/hypertable/contrib/cc/MapReduce/TableRangeMap.cc

(i) TableReader.cc:
--- a/TableReader.cc
+++ b/TableReader.cc
@@ -24,7 +24,7 @@ TableReader::TableReader(HadoopPipes::MapContext& context)
HadoopUtils::deserializeString(start_row, stream);
HadoopUtils::deserializeString(end_row, stream);

- scan_spec_builder.add_row_interval(start_row, true, end_row, true);
+ scan_spec_builder.add_row_interval(start_row.c_str(), true, end_row.c_str(), true);

if (allColumns == false) {
std::vector columns;
@@ -32,7 +32,7 @@ TableReader::TableReader(HadoopPipes::MapContext& context)

split(columns, job->get("hypertable.table.columns"), is_any_of(", "));
BOOST_FOREACH(const std::string &c, columns) {
- scan_spec_builder.add_column(c);
+ scan_spec_builder.add_column(c.c_str());
}
}
m_scanner = m_table->create_scanner(scan_spec_builder.get());

(ii) TableRangeMap.cc:
--- a/TableRangeMap.cc
+++ b/TableRangeMap.cc
@@ -28,7 +28,7 @@ namespace Mapreduce

startrow = tmprow;

- meta_scan_builder.add_row_interval(startrow, true, startrow + "\xff\xff", true);
+ meta_scan_builder.add_row_interval(startrow.c_str(), true, (startrow + "\xff\xff").c_str(), true);

/* select columns */
meta_scan_builder.add_column("StartRow");

(B) Hypertable config

Edit ~src/hypertable/conf/hypertable.cfg and enter the information about the Hadoop Master:
# HDFS Broker
HdfsBroker.Port=38030
HdfsBroker.fs.default.name=hdfs://systemA:9000
HdfsBroker.Workers=20

(III) Build Hypertable:

Assuming that Hypertable has been unpacked in ~src then
$ mkdir ~src/build/hypertable
$ cd ~src/build/hypertable


*Note some important variables that need to be set in order for a successful compile on the Debian platform:
HADOOP_INCLUDE_PATH = /opt/hadoop/src/c++/install/include
HADOOP_LIB_PATH = /opt/hadoop/src/c++/install/lib
JAVA_INCLUDE_PATH = /usr/lib/jvm/java-6-sun/include
JAVA_INCLUDE_PATH2 = /usr/lib/jvm/java-6-openjdk/include

$ cmake -DBUILD_SHARED_LIBS=ON -DHADOOP_INCLUDE_PATH=/opt/hadoop/src/c++/install/include -DHADOOP_LIB_PATH=/opt/hadoop/src/c++/install/lib -DJAVA_INCLUDE_PATH=/usr/lib/jvm/java-6-sun/include -DJAVA_INCLUDE_PATH2=/usr/lib/jvm/java-6-openjdk/include ../../hypertable

$ make -j <number of cores>
$ make install
$ make doc


(IV) Had issues starting up Hypertable:
: error while loading shared libraries: libHyperThriftConfig.so: cannot open shared object file: No such file or directory

Temporary fix:
*Note: this is a stop-gap since this library is linked to files in the source. Don't delete the source. Need to fix this.
$ cp ~src/build/hypertable/src/cc/ThriftBroker/libHyperThriftConfig.so /opt/hypertable/0.9.2.6/lib/

(IV) Resolve issue with Hypertable connecting to Hadoop:
Replace the Hypertable-Hadoop jar with Hadoop's jar:
$ cp /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar.hypertable
$ cp /opt/hadoop-0.20.1/hadoop-0.20.1-core.jar /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar


(V) Initialize Hypertable:
*Note: Added Hypertable to path:

$ export PATH=$PATH:/opt/hypertable/0.9.2.6/bin
$ start-all-servers.sh hadoop


(V) Hypertable Scripts (reference):

start-dfsbroker.sh (local|hadoop|kfs) [<server options>]
start-hyperspace.sh [<server options>]
start-master.sh [<server options>]
start-rangeserver.sh [<server options>]
start-dfsbroker.sh hadoop
clean-database.sh

And a wrapper script to start all services:
start-all-servers.sh (local|hadoop|kfs) [<server options>]

Enjoy!

5 comments:

Ralph said...

Oh, my.

Luke said...

Thanks for trying out Hypertable :)

Note, the Hadoop integration code you're trying out is experimental and unmaintained (it's in contrib for a reason.), because it's not a typical workflow and requires JNI to boot. The current recommended workflow with Hadoop is using Hadoop to munge raw data and injecting results into hypertable via the Thrift interface in reducers or HQL "load data infile", which works with stock hypertable that even has a binary debian package that includes everything :)

We'll soon have better integration (no JNI needed) for map-reduce input from hypertable tables with user specified scan predicates, which will be in the binary debian package as well.

Robert Petkus said...

Luke,
Which packages are you referring to -- Hadoop, Hypertable, or both? Repos?

When I began this exercise I used binary debian packages for Hadoop 0.18.x from Cloudera (I didn't find a Hypertable package). Hadoop worked fine in a cluster but Hypertable, which compiled without error, had serious stability issues - especially the rangeserver - which kept segfaulting. Digging around on the net exposed complaints of compatibility issues between older versions of Hadoop and Hypertable which led me in this direction.

Still digesting the implementation details of Hypertable+Hadoop...

Robert Petkus said...

Ah, I have found the binary packages for Hypertable. They are located at http://package.hypertable.org/

Looks like this is completely new and issues are still being worked out (as of 8/29 there was a comment "Still chasing down a rare deadlock issue").

I'm going to give it a spin and will create a new how-to if all works well.

Robert Petkus said...

I had serious problems with the debian hypertable package and reverted back to the configuration descibed in my blog. I do not recommend the binary packages at this time.