Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Sunday, December 26, 2010

Hadoop Log

When Hadoop is started, it sets hadoop.log.dir using -Dhadoop.log.dir=$HADOOP_LOG_DIR.
If you don't set environment variable HADOOP_LOG_DIR explicitly, it will be $HADOOP_HOME/logs. If you don't specify HADOOP_HOME, Hadoop will guess it by using path of the script that you use to start Hadoop. So if you install Hadoop to dir <hadoop_dir>, and HADOOP_LOG_DIR is not set, then the log dir is <hadoop_dir>/logs.

If you want to change root log dir, change file 'conf/hadoop-env.sh'. Add a line similar to

export HADOOP_LOG_DIR=/your/local/log/dir

In following table, you should replace those variables which are enclosed in angle brackets.
<jobid>: id of a job
<username>: username of the user who starts up Hadoop.
<host>: host name of the node which runs the process.

Direcotory

Description

Related config parameters

<hadoop.log.dir> Log of various daemons  
  hadoop-<username>-jobtracker-<host>.log Log of jobtracker daemon  
  hadoop-<username>-namenode-<host>.log Log of namenode daemon  
  hadoop-<username>-secondarynamenode-<host>.log Log of secondarynamenode daemon  
  hadoop-<username>-tasktracker-<host>.log Log of tasktracker daemon  
  hadoop-<username>-datanode-<host>.log Log of datanode daemon  
  job_<jobid>_conf.xml Configuration file of a job Only exists when the job is running.
 
<hadoop.log.dir>/history  
mapreduce.jobtracker.jobhistory.location

  job_<jobid>_conf.xml

    Only exists when the job is running.

  job_<jobid>_<username>

    Only exists when the job is running.
 
<hadoop.log.dir>/done log of completed jobs
mapreduce.jobtracker.jobhistory.completed.location
  job_<jobid>_<username> Event logging. It includes all events of the job (e.g. job started, task started).  
  job_<jobid>_conf.xml Job conf file. It includes all configurations of the job.  
 
<hadoop.log.dir>/userlogs Log of attempts. Stored on each task tracker.  
  job_<jobid> Each directory contains log of all attempts of the job.  
 
/jobtracker/jobsInfo (in HDFS) Job Status Store
mapreduce.jobtracker.persist.jobstatus.active
mapreduce.jobtracker.persist.jobstatus.hours
mapreduce.jobtracker.persist.jobstatus.dir
  <jobId>.info job status of a job  

Job logs in <hadoop.log.dir>/history/done directory are kept for mapreduce.jobtracker.jobhistory.maxage. Default value is 1 week.

Wednesday, August 27, 2008

Hadoop and HBase port usage

Hadoop port usage
On name node
50090: org.apache.hadoop.dfs.SecondaryNameNode
33220: ditto.(This port is not fixed and may be changed when hadoop is restarted)

50070: org.apache.hadoop.dfs.NameNode
9000:ditto
46684: ditto(This port is not fixed and may be changed when hadoop is restarted)

9001: org.apache.hadoop.mapred.JobTracker
50030: ditto
60502: ditto(This port is not fixed and may be changed when hadoop is restarted)

On data node
50075: org.apache.hadoop.dfs.DataNode
50010: ditto
45868: ditto(This port is not fixed and may be changed when hadoop is restarted)

50060: org.apache.hadoop.mapred.TaskTracker
55027: ditto(This port is not fixed and may be changed when hadoop is restarted)

HBase port usage
On master
60000: org.apache.hadoop.hbase.master.HMaster start
60010: ditto

On data node
60030: org.apache.hadoop.hbase.regionserver.HRegionServer
60020: ditto

Monday, August 25, 2008

Insert pubchem data into HBase

HBase shell
HBase provides a shell utility which lets users to execute simple commands. The shell can be started up using:
${HBASE_HOME}/bin/hbase shell
Then input command help to get help document which describes usage of various supported commands. These commands can be used to manipulate data stored in HBase. E.g. command list can be used to list all tables in hbase. Command get can be used to get row or cell contents from hbase. Command put can be used to store data into a cell.

Data insertion
Data source is ftp://ftp.ncbi.nlm.nih.gov/pubchem/. I modified python scripts and C source code given by Rajarshi.
Data retrieval and processing steps:

  1. Download all information about compounds from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF. I finally got 123 GB.
  2. Decompress those files
  3. Extract information from these .sdf files and write it to .txt files. Library openbabel is used to compile the C++ code.
  4. Combine those .txt files generated in step 3 into one big .dat file
  5. Write a ruby script to insert all data in the .dat file into HBase.
    Command is like this: ${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb

Why did I write Ruby script instead of Java program in step 5?
HBase is written in Java and so provides Java API. However, to compile Java programs is kind of cumbersome -- set lengthy CLASSPATH ... 
So I chose to write scripts which can be executed directly by HBase shell. I found useful information on this page. There is a section called "scripting" in that post. But the information there is far from complete. It does not tell readers how to write the scripts. At first, I wrote a script which included some shell commands, one command per line, and then fed it to hbase shell. Unfortunately, it didn't work. After enumerous trials, I found that Ruby scripts could be fed to shell. Ruby scripts cannot make use of existing shell commands directly. Ruby binding of original Java APIs must be used.

I have not learnt Ruby at all before. So I must teach myself to grasp basic knowledge about Ruby. Ruby is sort of different in terms of syntactic flexibility. It supports so many shorthands to improve productivity. Anyway, "Ruby is easy to learn, but hard to master". By the way, Ruby documents seem not to be abundant compared with Python, Perl...

How to write Ruby scripts for HBase?
This site http://wiki.apache.org/hadoop/Hbase/JRuby contains related information. But I could not run the sample script successfully because of errors in the script!!! Damn it! I wonder whether the author had tested the code before he released it. Some errors are so obvious.
After the ruby script is completed, it can be executed using:
${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb 
Java API:
http://hadoop.apache.org/hbase/docs/current/api/index.html

My ruby script:

#!/usr/bin/ruby -w

include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.io.BatchUpdate
import org.apache.hadoop.io.Text

pubchem_compound_fields = [
    'cid',
    'iupac_openeye_name',
    'iupac_cas_name',
    'iupac_name',
    'iupac_systematic_name',
    'iupac_traditional_name',
    'nist_inchi',
    'cactvs_xlogp',
    'cactvs_exact_mass',
    'openeye_mw',
    'openeye_can_smiles',
    'openeye_iso_smiles',
    'cactvs_tpsa',
    'total_charge',
    'heavy_atom_count']

compound_table_name = 'compound'

numfields = pubchem_compound_fields.length

path = "/home/zhguo/BigTable/BigTable-Pubchem/data/"
filename = "#{path}compound.dat"
file = File.new(filename, 'r')
counter = 0

conf = HBaseConfiguration.new
tablename = compound_table_name
tablename_text = Text.new(tablename)
desc = HTableDescriptor.new(tablename)
coltextarr = Array.new
pubchem_compound_fields.each_with_index do |v, i|
    if (i == 0) then next; end
    desc.addFamily(HColumnDescriptor.new("#{v}:"))
    coltextarr << Text.new("#{v}:")
end

admin = HBaseAdmin.new(conf)
if !admin.tableExists(tablename_text) then
    admin.createTable(desc)
=begin
    puts "deleting table #{tablename_text}"
    admin.disableTable(tablename_text)
    admin.deleteTable(tablename_text)
    puts "deleted table #{tablename_text} successfully"
=end
end

#admin.createTable(desc)
table = HTable.new(conf, tablename_text)

startind = 1641500 #from which line should we start.This
                   #is useful when you don't want to start
                   #from the beginning of the data file.

nlines = `cat #{filename} | wc -l`

logfilename = 'updatedb.log'
logfile = File.new(logfilename, "a")
while (line = file.gets) #&& (counter < 20)
    counter += 1
    if (counter < startind) then
        next
    end
    msg = "processing line #{counter}/#{nlines}"
    logfile.puts msg
    if counter%100 == 0 then
        print  msg
        STDOUT.flush
        logfile.flush
    end

    arr = line.split("\t")
    len = arr.length
        if (numfields != len) then
        next
    end
    rowindex = 0
    rowname = arr[rowindex]
    arr.delete_at(rowindex)
    row = Text.new(rowname)
    b = BatchUpdate.new(row)

    arr.each_with_index do |v, i|
        str = java.lang.String.new(v)
        b.put(coltextarr[i], str.getBytes("UTF-8"))
    end
    table.commit(b)
end

Sunday, August 24, 2008

Installation and configuration of Hadoop and Hbase

Hadoop

Installation
Hadoop installation instructions: http://hadoop.apache.org/core/docs/current/quickstart.html and http://hadoop.apache.org/core/docs/current/cluster_setup.html.
To set up hadoop cluster, generally two configuration files should be modified:
hadoop-site.xml and slaves.
(1) My hadoop-site.xml looks like:
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>pg3:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>pg3:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
Read file hadoop-default.xml for all available options.
(2) My slaves file looks like:
localhost
pg1
pg2

I need to install Hadoop on three machines now and I use rsync to make these machines synchronized with eath other in terms of configuration.

Commands
(*) Format a new file system: hadoop namenode -format
(*) Start/stop Hadoop
start-dfs.sh/stop-dfs.sh
start up the distributed file system (HDFS)
start-mapred.sh/stop-mapred.sh
start up map reduce service.
start-all.sh/stop-all.sh
start up both HDFS and map reduce service

Hadoop reads content in file slaves to get all nodes and then starts up all these nodes.

Check status of the services
HDFS: http://domain:50070/
MapReduce: http://domain:50030/

HBase

Installation instructions: http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description
The configuration file is hbase-site.xml. My hbase-site.xml looks like
<configuration>
  <property>
    <name>hbase.master</name>
    <value>pg3:60000</value>
    <description>The host and port that the HBase master runs at.</description>
  </property>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://pg3.ucs.indiana.edu:9000/hbase</value>
    <description>The directory shared by region servers.</description>
  </property>
</configuration>

Commands
start-hbase.sh    starts up hbase service
stop-hbase.sh      stop hbase service

Note: hbase bases its functionalities on hadoop. Sometimes it is necessary for hbase to know the configuration of hadoop. Following statements are excerpted from hbase document which I think is important:

"Of note, if you have made HDFS client configuration on your hadoop cluster, hbase will not see this configuration unless you do one of the following:
  • Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh
  • Add a copy of hadoop-site.xml to ${HBASE_HOME}/conf, or
  • If only a small set of HDFS client configurations, add them to hbase-site.xml
An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to hbase. "