Wednesday, August 27, 2008

Hadoop and HBase port usage

Hadoop port usage
On name node
50090: org.apache.hadoop.dfs.SecondaryNameNode
33220: ditto.(This port is not fixed and may be changed when hadoop is restarted)

50070: org.apache.hadoop.dfs.NameNode
9000:ditto
46684: ditto(This port is not fixed and may be changed when hadoop is restarted)

9001: org.apache.hadoop.mapred.JobTracker
50030: ditto
60502: ditto(This port is not fixed and may be changed when hadoop is restarted)

On data node
50075: org.apache.hadoop.dfs.DataNode
50010: ditto
45868: ditto(This port is not fixed and may be changed when hadoop is restarted)

50060: org.apache.hadoop.mapred.TaskTracker
55027: ditto(This port is not fixed and may be changed when hadoop is restarted)

HBase port usage
On master
60000: org.apache.hadoop.hbase.master.HMaster start
60010: ditto

On data node
60030: org.apache.hadoop.hbase.regionserver.HRegionServer
60020: ditto

Monday, August 25, 2008

Insert pubchem data into HBase

HBase shell
HBase provides a shell utility which lets users to execute simple commands. The shell can be started up using:
${HBASE_HOME}/bin/hbase shell
Then input command help to get help document which describes usage of various supported commands. These commands can be used to manipulate data stored in HBase. E.g. command list can be used to list all tables in hbase. Command get can be used to get row or cell contents from hbase. Command put can be used to store data into a cell.

Data insertion
Data source is ftp://ftp.ncbi.nlm.nih.gov/pubchem/. I modified python scripts and C source code given by Rajarshi.
Data retrieval and processing steps:

  1. Download all information about compounds from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF. I finally got 123 GB.
  2. Decompress those files
  3. Extract information from these .sdf files and write it to .txt files. Library openbabel is used to compile the C++ code.
  4. Combine those .txt files generated in step 3 into one big .dat file
  5. Write a ruby script to insert all data in the .dat file into HBase.
    Command is like this: ${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb

Why did I write Ruby script instead of Java program in step 5?
HBase is written in Java and so provides Java API. However, to compile Java programs is kind of cumbersome -- set lengthy CLASSPATH ... 
So I chose to write scripts which can be executed directly by HBase shell. I found useful information on this page. There is a section called "scripting" in that post. But the information there is far from complete. It does not tell readers how to write the scripts. At first, I wrote a script which included some shell commands, one command per line, and then fed it to hbase shell. Unfortunately, it didn't work. After enumerous trials, I found that Ruby scripts could be fed to shell. Ruby scripts cannot make use of existing shell commands directly. Ruby binding of original Java APIs must be used.

I have not learnt Ruby at all before. So I must teach myself to grasp basic knowledge about Ruby. Ruby is sort of different in terms of syntactic flexibility. It supports so many shorthands to improve productivity. Anyway, "Ruby is easy to learn, but hard to master". By the way, Ruby documents seem not to be abundant compared with Python, Perl...

How to write Ruby scripts for HBase?
This site http://wiki.apache.org/hadoop/Hbase/JRuby contains related information. But I could not run the sample script successfully because of errors in the script!!! Damn it! I wonder whether the author had tested the code before he released it. Some errors are so obvious.
After the ruby script is completed, it can be executed using:
${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb 
Java API:
http://hadoop.apache.org/hbase/docs/current/api/index.html

My ruby script:

#!/usr/bin/ruby -w

include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.io.BatchUpdate
import org.apache.hadoop.io.Text

pubchem_compound_fields = [
    'cid',
    'iupac_openeye_name',
    'iupac_cas_name',
    'iupac_name',
    'iupac_systematic_name',
    'iupac_traditional_name',
    'nist_inchi',
    'cactvs_xlogp',
    'cactvs_exact_mass',
    'openeye_mw',
    'openeye_can_smiles',
    'openeye_iso_smiles',
    'cactvs_tpsa',
    'total_charge',
    'heavy_atom_count']

compound_table_name = 'compound'

numfields = pubchem_compound_fields.length

path = "/home/zhguo/BigTable/BigTable-Pubchem/data/"
filename = "#{path}compound.dat"
file = File.new(filename, 'r')
counter = 0

conf = HBaseConfiguration.new
tablename = compound_table_name
tablename_text = Text.new(tablename)
desc = HTableDescriptor.new(tablename)
coltextarr = Array.new
pubchem_compound_fields.each_with_index do |v, i|
    if (i == 0) then next; end
    desc.addFamily(HColumnDescriptor.new("#{v}:"))
    coltextarr << Text.new("#{v}:")
end

admin = HBaseAdmin.new(conf)
if !admin.tableExists(tablename_text) then
    admin.createTable(desc)
=begin
    puts "deleting table #{tablename_text}"
    admin.disableTable(tablename_text)
    admin.deleteTable(tablename_text)
    puts "deleted table #{tablename_text} successfully"
=end
end

#admin.createTable(desc)
table = HTable.new(conf, tablename_text)

startind = 1641500 #from which line should we start.This
                   #is useful when you don't want to start
                   #from the beginning of the data file.

nlines = `cat #{filename} | wc -l`

logfilename = 'updatedb.log'
logfile = File.new(logfilename, "a")
while (line = file.gets) #&& (counter < 20)
    counter += 1
    if (counter < startind) then
        next
    end
    msg = "processing line #{counter}/#{nlines}"
    logfile.puts msg
    if counter%100 == 0 then
        print  msg
        STDOUT.flush
        logfile.flush
    end

    arr = line.split("\t")
    len = arr.length
        if (numfields != len) then
        next
    end
    rowindex = 0
    rowname = arr[rowindex]
    arr.delete_at(rowindex)
    row = Text.new(rowname)
    b = BatchUpdate.new(row)

    arr.each_with_index do |v, i|
        str = java.lang.String.new(v)
        b.put(coltextarr[i], str.getBytes("UTF-8"))
    end
    table.commit(b)
end

Sunday, August 24, 2008

Installation and configuration of Hadoop and Hbase

Hadoop

Installation
Hadoop installation instructions: http://hadoop.apache.org/core/docs/current/quickstart.html and http://hadoop.apache.org/core/docs/current/cluster_setup.html.
To set up hadoop cluster, generally two configuration files should be modified:
hadoop-site.xml and slaves.
(1) My hadoop-site.xml looks like:
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>pg3:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>pg3:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
Read file hadoop-default.xml for all available options.
(2) My slaves file looks like:
localhost
pg1
pg2

I need to install Hadoop on three machines now and I use rsync to make these machines synchronized with eath other in terms of configuration.

Commands
(*) Format a new file system: hadoop namenode -format
(*) Start/stop Hadoop
start-dfs.sh/stop-dfs.sh
start up the distributed file system (HDFS)
start-mapred.sh/stop-mapred.sh
start up map reduce service.
start-all.sh/stop-all.sh
start up both HDFS and map reduce service

Hadoop reads content in file slaves to get all nodes and then starts up all these nodes.

Check status of the services
HDFS: http://domain:50070/
MapReduce: http://domain:50030/

HBase

Installation instructions: http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description
The configuration file is hbase-site.xml. My hbase-site.xml looks like
<configuration>
  <property>
    <name>hbase.master</name>
    <value>pg3:60000</value>
    <description>The host and port that the HBase master runs at.</description>
  </property>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://pg3.ucs.indiana.edu:9000/hbase</value>
    <description>The directory shared by region servers.</description>
  </property>
</configuration>

Commands
start-hbase.sh    starts up hbase service
stop-hbase.sh      stop hbase service

Note: hbase bases its functionalities on hadoop. Sometimes it is necessary for hbase to know the configuration of hadoop. Following statements are excerpted from hbase document which I think is important:

"Of note, if you have made HDFS client configuration on your hadoop cluster, hbase will not see this configuration unless you do one of the following:
  • Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh
  • Add a copy of hadoop-site.xml to ${HBASE_HOME}/conf, or
  • If only a small set of HDFS client configurations, add them to hbase-site.xml
An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to hbase. "

Monday, August 18, 2008

Mysql error "InnoDB: Unable to lock ./ibdata1, error: 11"

Recently, there was a power outage in the lab. I did not shut down my machines before the power outage. After I restarted my Ubuntu, I could not start up Mysql. The error was:

InnoDB: Unable to lock ./ibdata1, error: 11
InnoDB: Check that you do not already have another mysqld process
InnoDB: using the same InnoDB data or log files.
InnoDB: Error in opening ./ibdata1

But I am 100% sure that no another mysqld process was running. After I searched on line, I found that I was not the only one encountering this error. See this post.
This post has an insight into this problem. It seems to be caused by NFS. Mysql is not installed on a local file system. It is installed on a remote file system which is mounted to other file systems by using NFS.

Solution
make a copy of the original files (ibdata1, ib_logfile0, ib_logfile1...).

mv ibdata1 ibdata1.bak
cp -a ibdata1.bak ibdata1
......