�ݺ�ߣ

The Ruby UCSC API:
accessing the UCSC Genome
Database using Ruby
Hiroyuki Mishima(1, Jan Aerts(2, Toshiaki Katayama(3,
Raoul J.P. Bonnal(4, Koh-ichiro Yoshiura(1
1)Nagasaki University, Japan;
2)Leuven University, Belgium;
3)DBCLS, ROIS, Japan;
4)Instituto Nazionale Genetica Molecolare, Italy

20th Annual International Conference on Integrate Systems for Molecular Biology
2012 July 15-17, @Long Beach, CA, USA

Background:
The University of California, Santa Cruz (UCSC) genome database is among the most used
sources of genomic annotation in human and other organisms. The database offers excellent
web-based graphical user interface (the UCSC genome browser) and several means for
programmatic queries. A simple application programming interface (API) in a scripting
language aimed at the biologist was however not yet available. Here, we present the Ruby
UCSC API, a library to access the UCSC genome database using Ruby.
Results:
The API is designed as a BioRuby plug-in (Biogem) and built on the ActiveRecord 3 framework
for the object-relational mapping, making writing SQL statements unnecessary. The current
version of the API supports databases of all organisms in the UCSC genome database including
human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index��if available��when querying for genomic intervals. The API also
supports genomic sequence queries using locally downloaded *.2bit files that are not stored
in the official MySQL database. The API is implemented in pure Ruby and is therefore available
in different environments and with different Ruby interpreters (including JRuby).
Conclusions:
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby
UCSC API will facilitate biologists to query the UCSC genome database programmatically. The
API is available through the RubyGem system. Source codes and documentations are available
at https://github.com/misshie/bioruby-ucsc-api/ 2

The UCSC genome database
? UCSC genome database is among the most used
sources of genomic annotation in human and
other organisms.
? Excellent web-based graphical user interface
(the UCSC genome browser) and several means
for programmatic queries.
? A simple application programming interface
(API) in a scripting language aimed at the
biologist was however not yet available.
? Supporting a large number of tables (>40,000) is
still challenging. 3

Ruby UCSC API
? A Ruby library to access
the UCSC genome
database.
? Designed as a Biogem -
BioRuby plug-in
? Built on the ActiveRecord
3 framework for an
object-relational mapping.
? Written in pure Ruby �C
supporting MRI Ruby Design structure of
1.9/1.8 and JRuby the Ruby UCSC API
4

Dynamic Table Class Definition

? The UCSC database is optimized to serve the genome
browser, resulting in a very large number of tables
? > 41,840 tables as MySQL *.MYD files
? Database components are updated frequently.
? Ruby UCSC API adopts dynamic class definition to
handle many table classes.
? When a table class referred for the first time, the API
prefetch fields of the table to detect a table type and
define appropriate table class. Additionally, this lazy
evaluation of class definition makes API initialization
much faster.
5

Availability and Installation
Installation via RubyGems

$ gem install bio-ucsc-api

GitHub
https://github.com/misshie/bioruby-ucsc-api
Support Forum
http://rubyucscapi.userecho.com/
RubyGems.org
https://rubygems.org/gems/bio-ucsc-api
6

Sample Codes and Features
require 'bio-ucsc��
Bio::Ucsc::Hg19.connect
result =
Bio::Ucsc::Hg19::Snp131.
find_by_name("rs56289060")
puts result.chrom # => "chr1"
? Supporting all organisms and at least newest
assemblies
? Supporting UCSC��s official MySQL server and local
mirror MySQL servers
? ActiveRecord��s object-relation mapping 7

region = "chr17:7,579,614-7,579,700"
condition =
Bio::Ucsc::Hg19::Snp131.
with_interval(region).select(:name)
puts condition.to_sql

SELECT name FROM `snp131`
WHERE (chrom = 'chr17' AND bin in (642,80,9,1,0)
AND ( (chromStart BETWEEN 7579613 AND 7579700)
OR (chromEnd BETWEEN 7579613 AND 7579700)
OR (chromStart <= 7579613 AND
chromEND >= 7579700) ));

? Generating complex SQL statements using relations
? The bin index is, if available, used to accelerate queries.
8

# declaration of the table association
Ucsc::Hg19::KnownGene.class_eval do
has_one :knownToEnsembl, {:primary_key => :name,
:foreign_key => :name}
end
# reference to an associated field
puts Ucsc::Hg19::KnownGene.first.name
# => ��uc001aaa3��
puts Ucsc::Hg19::KnownGene.first.knownToEnsembl.value
# => "ENST00000456328"

? The user can define table associations.
? Associated tables can be accessed like fields of the
table.
9

1: # load a locally-stored sequence file,
and extract partial seqence
2: seq = Ucsc::File::Twobit.open("hg19.2bit")
3: puts seq.subseq("chr1:9990-10009")
# => "NNNNNNNNNNNTAACCCTAA"

? In the UCSC genome database, genomic sequences are
not stored in the MySQL databases but in *.2bit files.
? Reference sequence objects are generated by the
File::Twobit.open class methods, and sequences
can be retrieved by the File::Twobit#subseq
method.
10

Supported Databases
clade/organism databases
human Hg19, Hg18
mammals chimp (PanTro3), orangutan (PonAbe2), rhesus (RheMac2), marmoset (CalJac3),
mouse (Mm9), rat (Rn4), guinea pig (CavPor3), rabbit (OryCun2), cat (FelCat4),
panda (AilMel1), dog (CanFam2), horse (EquCab2), pig (SusScr2), sheep
(OviAri1), cow (BosTau4), elephant (LoxAfr3), opossum (MonDom5), platypus
(OrnAna1)
vertebrates chicken (GalGal3), zebra finch (TaeGut1), lizard (AnoCar2), X. tropicalis
(XenTro2), zebrafish (DanRer7), tetraodon (TetNig2), fugu (Fr2), stickleback
(GasAcu1), medaka (OryLat2), lamprey (PetMar1)
deuterostomes lancelet (BraFlo1), sea squirt (Ci2), sea urchin (StrPur2)
insects D.melanogaster (Dm3), D.simulans (DroSim1), D.sechellia (DroSec1), D.yakuba
(DroYak2), D.erecta (DroEre1), D.ananassae (DroAna2), D.pseudoobscura (Dp3),
D.persimilis (DroPer1), D.virilis (DroVir2), D.mojavensis (DroMoj2), D.grimshawi
(DroGri1), Anopheles mosquito (AnoGam1), honey bee (ApiMel2)
nematodes C.elegans (Ce6), C.brenneri (CaePb3), C.briggsae (Cb3), C.remanei (CaeRem3),
C.japonica (CaeJap1), P.pacificus (PriPac1)
others sea hare (AplCal1), yeast (SacCer2)
common databases Go, HgFixed, Proteome, UniProt, VisiGene 11

Current Limitations

? Table associations are not defined automatically.
? For some tables including subsets of the
ENCODE tables, the actual data are not stored in
the MySQL database itself but are stored as
references to BigWig, BigBed and BAM files. To
date, the Ruby UCSC API does not support them
yet. Instead, a Biogem, ��bio-samtools��, suppots
BAM file handlings.
12

Conclusions

? UCSC��s official executables and C libraries are
the most comprehensive and fastest API for the
UCSC genome database.
? However, APIs for scripting languages still have
significant advantages for the user because
their concern is not only runtime speed but also
total time from programming to results.
? The Ruby UCSC API can therefore have a
significant impact in the field.
13

�ݺ�ߣ

The Ruby UCSC API @ISMB2012

Recommended

More Related Content

Similar to The Ruby UCSC API @ISMB2012 (20)

Recently uploaded (20)

The Ruby UCSC API @ISMB2012