HBase provides a shell utility which lets users to execute simple commands. The shell can be started up using:
${HBASE_HOME}/bin/hbase shell
Then input command help to get help document which describes usage of various supported commands. These commands can be used to manipulate data stored in HBase. E.g. command list can be used to list all tables in hbase. Command get can be used to get row or cell contents from hbase. Command put can be used to store data into a cell.
Data insertion
Data source is ftp://ftp.ncbi.nlm.nih.gov/pubchem/. I modified python scripts and C source code given by Rajarshi.
Data retrieval and processing steps:
- Download all information about compounds from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF. I finally got 123 GB.
- Decompress those files
- Extract information from these .sdf files and write it to .txt files. Library openbabel is used to compile the C++ code.
- Combine those .txt files generated in step 3 into one big .dat file
- Write a ruby script to insert all data in the .dat file into HBase.
Command is like this:${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb
Why did I write Ruby script instead of Java program in step 5?
HBase is written in Java and so provides Java API. However, to compile Java programs is kind of cumbersome -- set lengthy CLASSPATH ...
So I chose to write scripts which can be executed directly by HBase shell. I found useful information on this page. There is a section called "scripting" in that post. But the information there is far from complete. It does not tell readers how to write the scripts. At first, I wrote a script which included some shell commands, one command per line, and then fed it to hbase shell. Unfortunately, it didn't work. After enumerous trials, I found that Ruby scripts could be fed to shell. Ruby scripts cannot make use of existing shell commands directly. Ruby binding of original Java APIs must be used.
I have not learnt Ruby at all before. So I must teach myself to grasp basic knowledge about Ruby. Ruby is sort of different in terms of syntactic flexibility. It supports so many shorthands to improve productivity. Anyway, "Ruby is easy to learn, but hard to master". By the way, Ruby documents seem not to be abundant compared with Python, Perl...
How to write Ruby scripts for HBase?
This site http://wiki.apache.org/hadoop/Hbase/JRuby contains related information. But I could not run the sample script successfully because of errors in the script!!! Damn it! I wonder whether the author had tested the code before he released it. Some errors are so obvious.
After the ruby script is completed, it can be executed using:${HBASE_HOME}/bin/hbase org.jruby.Main rubyscript.rb
Java API:
http://hadoop.apache.org/hbase/docs/current/api/index.html
My ruby script:
#!/usr/bin/ruby -w include Java import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.HColumnDescriptor import org.apache.hadoop.hbase.HTableDescriptor import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.io.BatchUpdate import org.apache.hadoop.io.Text pubchem_compound_fields = [ 'cid', 'iupac_openeye_name', 'iupac_cas_name', 'iupac_name', 'iupac_systematic_name', 'iupac_traditional_name', 'nist_inchi', 'cactvs_xlogp', 'cactvs_exact_mass', 'openeye_mw', 'openeye_can_smiles', 'openeye_iso_smiles', 'cactvs_tpsa', 'total_charge', 'heavy_atom_count'] compound_table_name = 'compound' numfields = pubchem_compound_fields.length path = "/home/zhguo/BigTable/BigTable-Pubchem/data/" filename = "#{path}compound.dat" file = File.new(filename, 'r') counter = 0 conf = HBaseConfiguration.new tablename = compound_table_name tablename_text = Text.new(tablename) desc = HTableDescriptor.new(tablename) coltextarr = Array.new pubchem_compound_fields.each_with_index do |v, i| if (i == 0) then next; end desc.addFamily(HColumnDescriptor.new("#{v}:")) coltextarr << Text.new("#{v}:") end admin = HBaseAdmin.new(conf) if !admin.tableExists(tablename_text) then admin.createTable(desc) =begin puts "deleting table #{tablename_text}" admin.disableTable(tablename_text) admin.deleteTable(tablename_text) puts "deleted table #{tablename_text} successfully" =end end #admin.createTable(desc) table = HTable.new(conf, tablename_text) startind = 1641500 #from which line should we start.This #is useful when you don't want to start #from the beginning of the data file. nlines = `cat #{filename} | wc -l` logfilename = 'updatedb.log' logfile = File.new(logfilename, "a") while (line = file.gets) #&& (counter < 20) counter += 1 if (counter < startind) then next end msg = "processing line #{counter}/#{nlines}" logfile.puts msg if counter%100 == 0 then print msg STDOUT.flush logfile.flush end arr = line.split("\t") len = arr.length if (numfields != len) then next end rowindex = 0 rowname = arr[rowindex] arr.delete_at(rowindex) row = Text.new(rowname) b = BatchUpdate.new(row) arr.each_with_index do |v, i| str = java.lang.String.new(v) b.put(coltextarr[i], str.getBytes("UTF-8")) end table.commit(b) end
No comments:
Post a Comment