The program ChemEngine recognizes textual patterns in supplementary scientific research article data to generate standard molecular structure data. It has been demonstrated to selectively harvest atomic coordinates from different formats of coordinates data stored in supplementary PDF files with high accuracy, as shown by close agreement of computed single point energies to the original values. The program and source code are available online at the given URL.
1 of 1
Download to read offline
More Related Content
chemengine karthi acs sandiego rev1.0
1. Here we present an approach to harvest 3D molecular data from the supporting
information of scientific research articles that are normally available from
publisher’s resources. This java based ChemEngine program recognizes textual
patterns from the supplementary data and generates standard molecular structure
data.
The methodology has been demonstrated via few case studies on different
formats of coordinates data stored in supplementary information files, wherein
ChemEngine selectively harvested the atomic coordinates and interpreted them
as molecules with high accuracy. The reusability of extracted molecular
coordinate data was demonstrated by computing Single Point Energies (SPEs)
that were in close agreement with the original computed data provided with the
articles. Software along with source codes and instructions available at
https://sourceforge.net/projects/chemengine/files/?source=navbar
ABSTRACT
Harvesting chemical data from the web is a challenging task requiring several
convoluted steps. Transforming the PDF files of molecular data to re-usable
format is extremely difficult. Generation of 3D structures from these molecular
images in raster format was extremely difficult.
The supporting information related to computational methods based research
articles, describing the transition states of organic reactions is now available
from journal publishers’ websites containing description of computations
performed with tables of results, molecular images in 3D conformations along
with 3D molecular co-ordinates in a PDF format. This combined data in a single
file complicates the harvesting process and development of pattern recognition
techniques for selectively excluding the non-atomic co-ordinate information
from the pool of large collection of textual data presented as supporting
material. The freedom of choosing data formats necessitates the development of
several pattern recognition templates in the form of regular expressions to
handle diverse formats (co-ordinates separated by space, comma, tab etc.) and
maintain the order in which the XYZ co-ordinates and atom information is
presented by the authors. This study therefore highlights the need for
development of standards required for submitting the supporting materials with
molecular data in a consistent, truly computable and re-usable format to journals
publishing computational research.
INTRODUCTION
Several standard molecular representations in ASCII format which are easily readable by
molecular modeling and chemoinformatics software packages are available. A specific set of
guidelines defined by the publishers to submit molecular data even in a PDF format, would
accelerate the automatic processing and recognition of chemical data for further
computational studies related to reaction modeling, drug-discovery and molecular inventory
management. Exchange of chemical data between multiple softwares without loss of
information is a critical requirement in computational chemistry and chemoinformatics
applications In a recent article, the importance of curation of large chemogenomics data set
for building better predictive model for life sciences has been emphasized [17]. During the
preparation of this manuscript, a timely research article by Rzepa's group on granularity
model for extracting molecular information appeared [18] that stresses on the need for
periodic and automatic curation of data from supplementary information in research articles.
The present work is geared towards partial fulfillment of this need for "futuristic research
data management".
Although spectral, molecular and analytical data have been harvested in the past but
extracting molecules directly from author supplied atomic coordinates provided in
supplementary materials as PDF format is not known. Accordingly, in the present work, we
have developed an application, ChemEngine that reads all the files stored in the PDF format
to extract molecular coordinates and generate computable molecular structures. To
demonstrate the efficiency of the program, supporting material data files of three different
molecular representations in terms of delimiters in the co-ordinate data were selected and the
data was successfully parsed using ChemEngine to extract molecular data. Supporting
materials of molecular data files also include brief description of molecules, computed data,
plots, page numbers, document information, manuscript bibliographic details etc. as a single
document in PDF format that makes harvesting the molecular data extremely difficult as
these have to be selectively excluded while parsing the file. Given an input file in PDF
format, the program yields three different files in GJF format, text file containing computed
bond matrix and all molecules in SDF format. The contents of the non molecular data file
can also be utilized by further subjecting it to standard text mining methodologies for
retrieving molecule names or other information such as list of basis sets employed in the
specific computational work.
MATERIALS AND METHODS
In general the major problem in processing molecular data stored in PDF files arises due to
the non-standard representation of coordinates such as inconsistency in the number of digits
appearing after the decimal, interchange of atom type with atomic number in the first
column and improper alignment of x, y, z coordinate values. ChemEngine identifies the text
patterns and processes this information to yield a common generic format of coordinate
matrix. Further the bond matrix algorithm implemented in the program generates a bond
matrix for the creation of a connection table to generate the ready to compute 3D molecular
structure.(AN = atomic number, AS = atomic symbol , CT = connection table)
In the present work we process the molecules and transform them into SDF format that is
mostly compatible with commercial packages thus saving time and computational effort. The
compute once and use many times approach will help the readers to access the original input
files even after passage of time. Therefore it is suggested that the chemical community
should maintain a standard and consistent representation of chemical structure data in the
electronic supplementary files in native format or standard data format to facilitate the re-
usability among the scientific community.
RESULTS
CONCLUSIONS
An application ChemEngine presented here selectively extracts the 3D structure
from coordinate information present along with inadvertently introduced noisy
data present in PDF files. This approach can obviate to some extent the loss of
chemical data while at the same time conserve the memory and storage space
required at the journal site. The methodology exemplified here will enable
molecule mining in semantic context and ensure maximum reuse of the valuable
data by interested readers thereby enhancing the citations of the authors.
REFERENCES
Karthikeyan M, Vyas R. Role of Open Source Tools and Resources in Virtual Screening
for Drug Discovery. Combinatorial Chemistry & High Throughput Screening.
2015;18(6):528-543.
Postma G, van Bakel B, Kateman G. Automatic Extraction of Analytical Chemical
Information. System Description, Inventory of Tasks and Problems, and Preliminary
Results. Journal of Chemical Information and Modeling. 1996;36(4):770-785.
Karthikeyan M, Bender A. Encoding and Decoding Graphical Chemical Structures as
Two-Dimensional (PDF417) Barcodes. ChemInform. 2005;36(32).
Murray-Rust P, Mitchell J, Rzepa H. BMC Bioinformatics. 2005;6(1):141. 2005;6(1):180.
Guha R, Howard M, Hutchison G, Murray-Rust P, Rzepa H, Steinbeck C et al. The Blue
Obelisk Interoperability in Chemical Informatics. Journal of Chemical Information and
Modeling. 2006;46(3):991-998.
Fourches D, Muratov E, Tropsha A. Curation of Chemogenomics data. Nature Chemical
Biology. 2015;11(8):535-535.
Harvey M, Mason N, McLean A, Murray-Rust P, Rzepa H, Stewart J. Standards-based
curation of a decade-old digital repository dataset of molecular information. J
Cheminform. 2015;7(1).
Karthikeyan M, Krishnan S, Pandey A, Bender A. Harvesting Chemical Information from
the Internet Using a Distributed Approach: ChemXtreme. Journal of Chemical
Information and Modeling. 2006;46(2):452-461.
Karthikeyan, M. Automatic harvesting of molecular information raster graphics. US
Patent Appl 14/241285, 2011.
Karthikeyan M, Pandit Y, Pandit D, Vyas R. MegaMiner: A Tool for Lead Identification
Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.
Combinatorial Chemistry & High Throughput Screening. 2015;18(6):591-603.
Karthikeyan M, Renu Vyas, Practical Chemoinformatics, 2014 , Springer
ACKNOWLEDGEMENT
MK thanks Department of Science and Technology for the award of International Travel
grant to attend the meeting. MK thanks director CSIR NCL for providing infrastructure
and support. The financial support received from GENESIS (BSC0121) and INSPIRE
(CSC0107) under 12FYP projects is duly acknowledged. RV thanks DST, New Delhi for
award of a fellowship. (WOS-A/ LS-201/2011).
†CSIR-National Chemical Laboratory,(Government of India), Pune- 411008, India
ChemEngine: Harvesting 3D Chemical Structures of Supplementary Data from PDF Files
Muthukumarasamy Karthikeyan†* and Renu Vyas‡
Entry Case Study N= molecules Regular Expression pattern Format &
Delimiter
1 Epoxide formation
from sulfur ylides and aldehydes
29 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0-
9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,}
PDF
Space
2 Thiol ene click chemistry 115 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0-
9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,}
Text
Space
3 Design of
tetra(arenediyl)bis(allyl)
derivatives for cope
rearrangement transition states
55 ^[A-Za-z0-9]{1,2},[0]{0,1}[,]{0,1}-
{0,1}.{1,2}[0-9]{1,10},-{0,1}.{1,2}[0-
9]{1,10}.{1,}
PDF
Comma
3D MOLECULE DATAPDF TEXT CHEM ENGINE
Input Data Format
Pattern Recognition
Regular Expression
Generate Coordinates
https://sourceforge.net/projects/chemengine/