Mr. Petkus: 2009

Hadoop and Hypertable on Debian

Environment:

Systems: (2) SUN Virtualbox 3.0.6 Virtual Machines, both Debian Lenny x86
Hypervisor system: Debian Lenny AMD64
Networking: Bridged adapter (not NAT) to eth0
g++: 4.3.2

*Note: The fact that this test deployment is on virtual machines is irrelevant to the configuration of Hadoop or Hypertable. Since the (2) virtual machines are on a shared disk, high-performance IO is not expected. However, should the testbed prove promising, dedicated systems will be deployed and high-performance tuning of both Hadoop and Hypertable will be explored.

Background:

At NSLS-II we are toying with the idea of back-ending the next generation of the Channel Archiver with a distributed database atop a distributed filesystem. This for a myriad of reasons other than it being a "cool project". For example, combining Hypertable (a high-performance distributed data storage system) with the MapReduce functionality of Hadoop promises to offer performance, redundancy, reliability, and scalability.

Goals:

Build a (2) node Hadoop and Hypertable cluster. The (2) nodes are "systemA" (master) and "systemB" (slave).

(1) Prerequisites:

*Note: Building Hadoop and Hypertable in this document requires adding the unstable and (optionally) testing repositories to /etc/apt/sources.list.

(1) Make sure to specify that Lenny/stable is the default distribution:
Edit /etc/apt/apt.conf:
# apt.conf
APT::Default-Release "stable"; # Only "stable", "testing", "unstable" are acceptable release names; i.e., "lenny" is not accepted.

$ apt-get update
$ apt-get -y install g++ cmake libboost-dev liblog4cpp5-dev git-core cronolog libgoogle-perftools-dev libevent-dev zlib1g-dev libexpat1-dev libdb4.6++-dev libncurses-dev libreadline5-dev

(2) Install Hyperic-Sigar

$ wget http://internap.dl.sourceforge.net/sourceforge/sigar/hyperic-sigar-1.6.2.tar.gz$ tar -xzvf hyperic-sigar-1.6.2.tar.gz$ cp ~src/hyperic-sigar-1.6.2/sigar-bin/include/*.h /usr/local/include$ cp ~src/hyperic-sigar-1.6.2/sigar-bin/lib/libsigar-x86-linux.so /usr/local/lib/$ ldconfig

(3) Install Thrift:

$ apt-get -y install sun-java6-jdk ant autoconf automake libtool bison flex pkg-config php5 php5-cli ruby-dev libhttp-access2-ruby libbit-vector-perl liblog4j1.2-java erlang ruby libevent-1.4-2

$ update-java-alternatives --set java-6-sun
$ ln -f -s /bin/bash /bin/sh
$ wget www.hypertable.org/pub/thrift-r796538.tgz

* Note: Since I am behind a proxy I needed to set this variable:

$ export ANT_OPTS="-Dhttp.proxyHost=192.168.1.130 -Dhttp.proxyPort=3128"
$ ./bootstrap.sh
$ ./configure
$ make

(4) Build Hadoop:

(I) Get Hadoop:

Download the latest version of Hadoop. I untarred mine in /opt and made a symlink from hadoop-0.20.1 to hadoop:

The latest version can be found at this mirror:
http://ftp.wayne.edu/apache/hadoop/core/

At the time of writing this document, the latest Hadoop was version 0.20.1

$ wget http://ftp.wayne.edu/apache/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

You can also retrieve the latest version via git, but the Hadoop directory tree is different than is what is referred to in this document.

$ git clone git://git.apache.org/hadoop-common.git
$ git clone git://git.apache.org/hadoop-hdfs.git
$ git clone git://git.apache.org/hadoop-mapreduce.git

$ ln -s /opt/hadoop-0.20.1 /opt/hadoop

Some source needs patching in order for Hypertable to cooperate with MapReduce. This is the list of files requiring patching:
/opt/hadoop/src/c++/utils/impl/StringUtils.cc
/opt/hadoop/src/c++/utils/impl/SerialUtils.cc
/opt/hadoop/src/c++/pipes/impl/HadoopPipes.cc

Here are the patches:

(i) SerialUtils.cc:
--- SerialUtils.cc (revision 765057)
+++ SerialUtils.cc (working copy)
@@ -18,7 +18,8 @@
#include "hadoop/SerialUtils.hh"
#include "hadoop/StringUtils.hh"

-#include
+#include
+#include
#include
#include
#include

(ii) StringUtils.cc:
--- StringUtils.cc (revision 765057)
+++ StringUtils.cc (working copy)
@@ -18,10 +18,11 @@
#include "hadoop/StringUtils.hh"
#include "hadoop/SerialUtils.hh"

-#include
+#include
#include
-#include
-#include
+#include
+#include
+#include
#include

using std::string;
@@ -31,7 +32,7 @@

string toString(int32_t x) {
char str[100];
- sprintf(str, "%d", x);
+ snprintf(str, 100, "%d", x);
return str;
}

@@ -96,7 +97,7 @@
const char* deliminators) {

string result(str);
- for(int i=result.length() -1; i >= 0; --i) {
+ for(int i = result.length() - 1; i >= 0; --i) {
char ch = result[i];
if (!isprint(ch) ||
ch == '\\' ||
@@ -116,7 +117,7 @@
break;
default:
char buff[4];
- sprintf(buff, "\\%02x", static_cast(result[i]));
+ snprintf(buff, 4, "\\%02x", static_cast(result[i]));
result.replace(i, 1, buff);
}
}

(iii) HadoopPipes.cc:
--- HadoopPipes.cc (revision 765057)
+++ HadoopPipes.cc (working copy)
@@ -26,9 +26,9 @@
#include
#include
#include
-#include
-#include
-#include
+#include
+#include
+#include
#include
#include

(II) Compile Hadoop:

$ cd /opt/hadoop/src/c++/pipes
$ sh configure
$ make && make install

$ cd /opt/hadoop/src/c++/utils
$ sh configure
$ make && make install

This will place Hadoop headers in:
/opt/hadoop/src/c++/install/include/hadoop

And Hadoop libraries in:
/opt/hadoop/src/c++/install/lib

Add the location of the Hadoop libraries to ld:
$ echo "/opt/hadoop/src/c++/install/lib" > /etc/ld.so.conf.d/hadoop.conf
$ ldconfig

Next we're ready to compile Hadoop.
$ cd /opt/hadoop

*Note: Again, because I am behind a proxy, I needed to set this environmental variable:
$ export ANT_OPTS="-Dhttp.proxyHost=192.168.1.130 -Dhttp.proxyPort=3128"
$ ant compile && ant jar

(III) Configure Hadoop:

(A) My Hadoop configuration files are kept in /opt/hadoop/conf
Relevant configuration files in this test are:
hadoop-env.sh # Environmental variables
core-site.xml # Default Hadoop filesystem
hdfs-site.xml # HDFS defaults for replication, name, and datanode services
mapred-site.xml # MapReduce defaults for trackers
slaves
master

*Note: All configuration files except for "slaves" and "master" will be the same on both cluster nodes.

(i) hadoop-env.sh:

By default Hadoop prefers ipv6. Since I am not using ipv6 this required a change:
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

other variables set in hadoop-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/hadoop
export HADOOP_LOG_DIR=/var/log/hadoop
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
export HADOOP_PID_DIR=/var/hadoop/pids

This will require the creation of directories:
$ mkdir /var/log/hadoop
$ mkdir /var/hadoop/pids

(ii) core-site.xml (where "systemA" is the master node in the cluster):
<property>
<name>fs.default.name</name>
<value>hdfs://systemA:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

(iii) hdfs-site.xml:
*Note: dfs.replication specifies how many nodes are in the Hadoop cluster, in our case, (2).
*Note: This configuration requires the creation of a few directories, namely, "/dfsname", "/hadoop/data". If your test system has more than (1) system disk, it is advisable to separate the DataNode and NameNode directories so as to avoid contentious IO.

<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.The actual number of replications can
be specified when the file is created. The default is used if replication is not
specified in create time.</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/dfsname</value>
<description>Path on local filesystem where the NameNode stores the
namespace and transactions logs persistently.</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/data</value>
<description>Comma separated list of paths on the local filesystem of
a DataNode where it should store its blocks</description>
</property>

(iv) mapred-site.xml:
*Note: This configuration requires the creation of the directory "/hadoop/mapred"
<property>
<name>mapred.job.tracker</name>
<value>systemA:9001</value>
<description>The host and port that the MapReduce job tracker runs at.
If "local", then jobs are run in-process as a single map and reduce task.*lt;/description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred</name>
<description>Path on the HDFS where the MapReduce framework stores
system files e.g., /hadoop/mapred/system</description>
</property>

(v) slaves:
*Note: both systems are slaves, so both are listed
*Note: this file is only configured on the master, "systemA"
systemA
systemB

(vi) Master (just the name of the master):
systemA

(B) SSH
Enable public-key authentication on all nodes within the Hadoop cluster. This will allow the Hadoop service to log-on to other nodes and start/stop services.

I am running Hadoop as the user "hadoop".

On both machines, create the Hadoop user:
$ groupadd hadoop
$ useradd -g hadoop -c "Hadoop User" -d /opt/hadoop -s /bin/bash hadoop

On the master server, systemA:
$ su - hadoop
$ ssh-keygen -t rsa -P ""
$ cp id_rsa.pub ~hadoop/.ssh/authorized_keys

Copy id_rsa.pub over the slave system and save it in the same place: ~hadoop/.ssh/authorized_keys

(C) Local name lookup
In /etc/hosts on both machines:
Enter all Hadoop node IP addresses and remove "127.0.0.1 localhost". Have "localhost" point to an assigned IP address

(IV) Initialize Hadoop:

*Note: Initialization is only necessary on the Master node.
*Note: For convenience-sake I've added Hadoop to my path:

$ export PATH=$PATH:/opt/hadoop/bin

$ hadoop namenode -format
$ start-dfs.sh
$ start-mapred.sh

Check to see what's running on both the master and slave:
systemA $ jps
4388 Jps
28444 NameNode
28795 JobTracker
2181 main
28696 SecondaryNameNode
28575 DataNode
28903 TaskTracker

systemB $ jps
13033 DataNode
13142 TaskTracker
1675 Jps
539 main

systemA $ netstat -ptlen

(V) Create a directory in the Hadoop namespace:

$ hadoop dfs -mkdir /hypertable
$ hadoop/bin/hadoop dfs -chmod 777 /hypertable
$ hadoop/bin/hadoop dfs -ls /
$ hadoop dfsadmin -report

(5) Hypertable installation:

(I) Get Hypertable:

$ apt-get -y install git sparsehash libbz2-dev doxygen graphviz
$ git config --global user.name "First Lastname"
$ git config --global user.email "something@something.com"
$ git clone git://scm.hypertable.org/pub/repos/hypertable.git

(II) Pre-build:

(A) Fixes:
The Hypertable source requires some patching so that the build cooperates with Debian's g++. The files that need to be patched are:
~src/hypertable/contrib/cc/MapReduce/TableReader.cc
~src/hypertable/contrib/cc/MapReduce/TableRangeMap.cc

(i) TableReader.cc:
--- a/TableReader.cc
+++ b/TableReader.cc
@@ -24,7 +24,7 @@ TableReader::TableReader(HadoopPipes::MapContext& context)
HadoopUtils::deserializeString(start_row, stream);
HadoopUtils::deserializeString(end_row, stream);

- scan_spec_builder.add_row_interval(start_row, true, end_row, true);
+ scan_spec_builder.add_row_interval(start_row.c_str(), true, end_row.c_str(), true);

if (allColumns == false) {
std::vector columns;
@@ -32,7 +32,7 @@ TableReader::TableReader(HadoopPipes::MapContext& context)

split(columns, job->get("hypertable.table.columns"), is_any_of(", "));
BOOST_FOREACH(const std::string &c, columns) {
- scan_spec_builder.add_column(c);
+ scan_spec_builder.add_column(c.c_str());
}
}
m_scanner = m_table->create_scanner(scan_spec_builder.get());

(ii) TableRangeMap.cc:
--- a/TableRangeMap.cc
+++ b/TableRangeMap.cc
@@ -28,7 +28,7 @@ namespace Mapreduce

startrow = tmprow;

- meta_scan_builder.add_row_interval(startrow, true, startrow + "\xff\xff", true);
+ meta_scan_builder.add_row_interval(startrow.c_str(), true, (startrow + "\xff\xff").c_str(), true);

/* select columns */
meta_scan_builder.add_column("StartRow");

(B) Hypertable config

Edit ~src/hypertable/conf/hypertable.cfg and enter the information about the Hadoop Master:
# HDFS Broker
HdfsBroker.Port=38030
HdfsBroker.fs.default.name=hdfs://systemA:9000
HdfsBroker.Workers=20

(III) Build Hypertable:

Assuming that Hypertable has been unpacked in ~src then
$ mkdir ~src/build/hypertable
$ cd ~src/build/hypertable

*Note some important variables that need to be set in order for a successful compile on the Debian platform:
HADOOP_INCLUDE_PATH = /opt/hadoop/src/c++/install/include
HADOOP_LIB_PATH = /opt/hadoop/src/c++/install/lib
JAVA_INCLUDE_PATH = /usr/lib/jvm/java-6-sun/include
JAVA_INCLUDE_PATH2 = /usr/lib/jvm/java-6-openjdk/include

$ cmake -DBUILD_SHARED_LIBS=ON -DHADOOP_INCLUDE_PATH=/opt/hadoop/src/c++/install/include -DHADOOP_LIB_PATH=/opt/hadoop/src/c++/install/lib -DJAVA_INCLUDE_PATH=/usr/lib/jvm/java-6-sun/include -DJAVA_INCLUDE_PATH2=/usr/lib/jvm/java-6-openjdk/include ../../hypertable

$ make -j <number of cores>
$ make install
$ make doc

(IV) Had issues starting up Hypertable:
: error while loading shared libraries: libHyperThriftConfig.so: cannot open shared object file: No such file or directory

Temporary fix:
*Note: this is a stop-gap since this library is linked to files in the source. Don't delete the source. Need to fix this.
$ cp ~src/build/hypertable/src/cc/ThriftBroker/libHyperThriftConfig.so /opt/hypertable/0.9.2.6/lib/

(IV) Resolve issue with Hypertable connecting to Hadoop:
Replace the Hypertable-Hadoop jar with Hadoop's jar:
$ cp /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar.hypertable
$ cp /opt/hadoop-0.20.1/hadoop-0.20.1-core.jar /opt/hypertable/0.9.2.6/lib/java/hadoop-0.20.0-core.jar

(V) Initialize Hypertable:
*Note: Added Hypertable to path:

$ export PATH=$PATH:/opt/hypertable/0.9.2.6/bin
$ start-all-servers.sh hadoop

(V) Hypertable Scripts (reference):

start-dfsbroker.sh (local|hadoop|kfs) [<server options>]
start-hyperspace.sh [<server options>]
start-master.sh [<server options>]
start-rangeserver.sh [<server options>]
start-dfsbroker.sh hadoop
clean-database.sh

And a wrapper script to start all services:
start-all-servers.sh (local|hadoop|kfs) [<server options>]

Enjoy!

I've always had a disorganized mind brimming with ideas and while it has served me well over the years I felt it was time to 1) optimize the way I think, 2) delineate and partition work, and 3) preserve state. Why do I need to preserve state? When I get distracted or pulled off on some wild tangent I need to focus quickly again. Likewise, when dragging myself into the office on Monday mornings I want to pick up precisely where I left off on Friday. Finally, and maybe it's because I'm getting older, I spend more time thinking about what I've been thinking about. It's a sort of vicious cycle wherein the moment I'm situated to perform a task, right when I've knifed fresh paints on the palette, I have to go to the restroom or get a coffee or eat lunch and all is lost.

Enter mind mapping. I've spent a considerable amount of time this past week evaluating various mind mapping software on OS X. The candidates were MindJet MindManager 7, Inspiration, FreeMind, and XMind

Here's a summary of my impressions:

Inspiration:

Old MacOS9 look and feel
Costs $$$$ -- but why should I pay that for a something that feels so outdated?
Crashed twice and felt somewhat unwieldy for very large corporate or engineering projects
Very very nice outline mode handy for cut and paste right into a doc or email.
Might consider the Kidspiration for the youngins

MindManager7 for Mac:

$$$ but the newer MM8 is only available on Windows
Frequently crashed (due to evaluation? Doubt it)
Overview mode was annoying with no automatic way of expanding every node in the tree
Nice interface and keyboard shortcuts
Good documentation
Supports floating nodes
Worked quite well -- felt productive from the start.

FreeMind:

OpenSource. This is the product I wanted to like the most being a fan of the opensource community.
Full export
Cloud functionality mimics XMind boundaries.
Text mode is somewhat clumsy -- how does one delete the text icon when there is no longer any text without deleting the node itself?? This is precisely the sort of information I don't want to waste time digging around for.
Documented key-mappings for Mac didn't always match reality
Overview mode?
Annoying options menu -- especially choosing default colors
No floating nodes and difficult to place nodes where you want them (they're always snapping back to the way FreeMind wants it)
Somewhat primitive look and feel
I took some time before I felt truly productive with this product

XMind:

Some components OpenSource
Extremely polished, friendly interface
Good documentation
Floating nodes
Intuitive -- felt productive almost at once.
Easy to both add and delete notes using function+F4
Compact legend of key-bindings instantly accessible via ↑⌘L
Optional tri-pane window featuring outline view and properties
Cool Boundary and Summary functions
Nice auto-styling like multi-branch coloring and line tapering
Flexible node styling such as rounded, rectangle, callout, fishhead
Useful templates
Can't export to PDF in the free version

Ranking: 1) XMind, 2) FreeMind, 3) MindManager, 4) Inspiration.

XMind just wins hands-down in $$$, ease of use, form, and function. I would even consider purchasing the yearly subscription for some advanced features (Gantt charting) @ $49 a bargain.

Freemind is nice, too, but not as polished, full-featured, or easy to use as XMind. As mentioned before, I like opensource community projects, but in the end I need to get work done.

Shortly after XMind was made available at no cost, one of the FreeMind developers initiated a thread on the FreeMind mailing list asking whether or not it was worthwhile to continue that project (see here). Obviously the answer should be yes -- why crumple in the face of competition? Reading through the thread, however, I noticed a lot of commentors stated that while XMind is good, it's also slow whereas Freemind is lean and fast and, as such, Freemind is more suitable for quick off-the-cuff mapping. While it is true that Freemind is a bit spryer I feel this is negligable -- at least on a modern desktop.

Mr. Petkus

Thursday, September 10, 2009

Building Hadoop and Hypertable on Debian Lenny

Friday, August 21, 2009

Debian dbus - ldap error messages on boot

Thursday, August 20, 2009

OpenNMS and Apache2 on Debian Lenny

Tuesday, January 27, 2009

Cray pas haunted house

Friday, January 23, 2009

Cray pas -- first go

Thursday, January 22, 2009

Mind mapping software: XMind vs. FreeMind vs. MindManager vs. Inspiration

Blog Archive

About Me