Sunday, August 24, 2008

Installation and configuration of Hadoop and Hbase

Hadoop

Installation
Hadoop installation instructions: http://hadoop.apache.org/core/docs/current/quickstart.html and http://hadoop.apache.org/core/docs/current/cluster_setup.html.
To set up hadoop cluster, generally two configuration files should be modified:
hadoop-site.xml and slaves.
(1) My hadoop-site.xml looks like:
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>pg3:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>pg3:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
Read file hadoop-default.xml for all available options.
(2) My slaves file looks like:
localhost
pg1
pg2

I need to install Hadoop on three machines now and I use rsync to make these machines synchronized with eath other in terms of configuration.

Commands
(*) Format a new file system: hadoop namenode -format
(*) Start/stop Hadoop
start-dfs.sh/stop-dfs.sh
start up the distributed file system (HDFS)
start-mapred.sh/stop-mapred.sh
start up map reduce service.
start-all.sh/stop-all.sh
start up both HDFS and map reduce service

Hadoop reads content in file slaves to get all nodes and then starts up all these nodes.

Check status of the services
HDFS: http://domain:50070/
MapReduce: http://domain:50030/

HBase

Installation instructions: http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description
The configuration file is hbase-site.xml. My hbase-site.xml looks like
<configuration>
  <property>
    <name>hbase.master</name>
    <value>pg3:60000</value>
    <description>The host and port that the HBase master runs at.</description>
  </property>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://pg3.ucs.indiana.edu:9000/hbase</value>
    <description>The directory shared by region servers.</description>
  </property>
</configuration>

Commands
start-hbase.sh    starts up hbase service
stop-hbase.sh      stop hbase service

Note: hbase bases its functionalities on hadoop. Sometimes it is necessary for hbase to know the configuration of hadoop. Following statements are excerpted from hbase document which I think is important:

"Of note, if you have made HDFS client configuration on your hadoop cluster, hbase will not see this configuration unless you do one of the following:
  • Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh
  • Add a copy of hadoop-site.xml to ${HBASE_HOME}/conf, or
  • If only a small set of HDFS client configurations, add them to hbase-site.xml
An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to hbase. "

No comments: