Monday, August 25, 2008

Insert pubchem data into HBase

HBase shell
HBase provides a shell utility which lets users to execute simple commands. The shell can be started up using:
${HBASE_HOME}/bin/hbase shell
Then input command help to get help document which describes usage of various supported commands. These commands can be used to manipulate data stored in HBase. E.g. command list can be used to list all tables in hbase. Command get can be used to get row or cell contents from hbase. Command put can be used to store data into a cell.

Data insertion
Data source is ftp://ftp.ncbi.nlm.nih.gov/pubchem/. I modified python scripts and C source code given by Rajarshi.
Data retrieval and processing steps:

  1. Download all information about compounds from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF. I finally got 123 GB.
  2. Decompress those files
  3. Extract information from these .sdf files and write it to .txt files. Library openbabel is used to compile the C++ code.
  4. Combine those .txt files generated in step 3 into one big .dat file
  5. Write a ruby script to insert all data in the .dat file into HBase.
    Command is like this: ${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb

Why did I write Ruby script instead of Java program in step 5?
HBase is written in Java and so provides Java API. However, to compile Java programs is kind of cumbersome -- set lengthy CLASSPATH ... 
So I chose to write scripts which can be executed directly by HBase shell. I found useful information on this page. There is a section called "scripting" in that post. But the information there is far from complete. It does not tell readers how to write the scripts. At first, I wrote a script which included some shell commands, one command per line, and then fed it to hbase shell. Unfortunately, it didn't work. After enumerous trials, I found that Ruby scripts could be fed to shell. Ruby scripts cannot make use of existing shell commands directly. Ruby binding of original Java APIs must be used.

I have not learnt Ruby at all before. So I must teach myself to grasp basic knowledge about Ruby. Ruby is sort of different in terms of syntactic flexibility. It supports so many shorthands to improve productivity. Anyway, "Ruby is easy to learn, but hard to master". By the way, Ruby documents seem not to be abundant compared with Python, Perl...

How to write Ruby scripts for HBase?
This site http://wiki.apache.org/hadoop/Hbase/JRuby contains related information. But I could not run the sample script successfully because of errors in the script!!! Damn it! I wonder whether the author had tested the code before he released it. Some errors are so obvious.
After the ruby script is completed, it can be executed using:
${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb 
Java API:
http://hadoop.apache.org/hbase/docs/current/api/index.html

My ruby script:

#!/usr/bin/ruby -w

include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.io.BatchUpdate
import org.apache.hadoop.io.Text

pubchem_compound_fields = [
    'cid',
    'iupac_openeye_name',
    'iupac_cas_name',
    'iupac_name',
    'iupac_systematic_name',
    'iupac_traditional_name',
    'nist_inchi',
    'cactvs_xlogp',
    'cactvs_exact_mass',
    'openeye_mw',
    'openeye_can_smiles',
    'openeye_iso_smiles',
    'cactvs_tpsa',
    'total_charge',
    'heavy_atom_count']

compound_table_name = 'compound'

numfields = pubchem_compound_fields.length

path = "/home/zhguo/BigTable/BigTable-Pubchem/data/"
filename = "#{path}compound.dat"
file = File.new(filename, 'r')
counter = 0

conf = HBaseConfiguration.new
tablename = compound_table_name
tablename_text = Text.new(tablename)
desc = HTableDescriptor.new(tablename)
coltextarr = Array.new
pubchem_compound_fields.each_with_index do |v, i|
    if (i == 0) then next; end
    desc.addFamily(HColumnDescriptor.new("#{v}:"))
    coltextarr << Text.new("#{v}:")
end

admin = HBaseAdmin.new(conf)
if !admin.tableExists(tablename_text) then
    admin.createTable(desc)
=begin
    puts "deleting table #{tablename_text}"
    admin.disableTable(tablename_text)
    admin.deleteTable(tablename_text)
    puts "deleted table #{tablename_text} successfully"
=end
end

#admin.createTable(desc)
table = HTable.new(conf, tablename_text)

startind = 1641500 #from which line should we start.This
                   #is useful when you don't want to start
                   #from the beginning of the data file.

nlines = `cat #{filename} | wc -l`

logfilename = 'updatedb.log'
logfile = File.new(logfilename, "a")
while (line = file.gets) #&& (counter < 20)
    counter += 1
    if (counter < startind) then
        next
    end
    msg = "processing line #{counter}/#{nlines}"
    logfile.puts msg
    if counter%100 == 0 then
        print  msg
        STDOUT.flush
        logfile.flush
    end

    arr = line.split("\t")
    len = arr.length
        if (numfields != len) then
        next
    end
    rowindex = 0
    rowname = arr[rowindex]
    arr.delete_at(rowindex)
    row = Text.new(rowname)
    b = BatchUpdate.new(row)

    arr.each_with_index do |v, i|
        str = java.lang.String.new(v)
        b.put(coltextarr[i], str.getBytes("UTF-8"))
    end
    table.commit(b)
end

No comments: