ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Here we present an approach to harvest 3D molecular data from the supporting
information of scientific research articles that are normally available from
publisher’s resources. This java based ChemEngine program recognizes textual
patterns from the supplementary data and generates standard molecular structure
data.
The methodology has been demonstrated via few case studies on different
formats of coordinates data stored in supplementary information files, wherein
ChemEngine selectively harvested the atomic coordinates and interpreted them
as molecules with high accuracy. The reusability of extracted molecular
coordinate data was demonstrated by computing Single Point Energies (SPEs)
that were in close agreement with the original computed data provided with the
articles. Software along with source codes and instructions available at
https://sourceforge.net/projects/chemengine/files/?source=navbar
ABSTRACT
Harvesting chemical data from the web is a challenging task requiring several
convoluted steps. Transforming the PDF files of molecular data to re-usable
format is extremely difficult. Generation of 3D structures from these molecular
images in raster format was extremely difficult.
The supporting information related to computational methods based research
articles, describing the transition states of organic reactions is now available
from journal publishers’ websites containing description of computations
performed with tables of results, molecular images in 3D conformations along
with 3D molecular co-ordinates in a PDF format. This combined data in a single
file complicates the harvesting process and development of pattern recognition
techniques for selectively excluding the non-atomic co-ordinate information
from the pool of large collection of textual data presented as supporting
material. The freedom of choosing data formats necessitates the development of
several pattern recognition templates in the form of regular expressions to
handle diverse formats (co-ordinates separated by space, comma, tab etc.) and
maintain the order in which the XYZ co-ordinates and atom information is
presented by the authors. This study therefore highlights the need for
development of standards required for submitting the supporting materials with
molecular data in a consistent, truly computable and re-usable format to journals
publishing computational research.
INTRODUCTION
Several standard molecular representations in ASCII format which are easily readable by
molecular modeling and chemoinformatics software packages are available. A specific set of
guidelines defined by the publishers to submit molecular data even in a PDF format, would
accelerate the automatic processing and recognition of chemical data for further
computational studies related to reaction modeling, drug-discovery and molecular inventory
management. Exchange of chemical data between multiple softwares without loss of
information is a critical requirement in computational chemistry and chemoinformatics
applications In a recent article, the importance of curation of large chemogenomics data set
for building better predictive model for life sciences has been emphasized [17]. During the
preparation of this manuscript, a timely research article by Rzepa's group on granularity
model for extracting molecular information appeared [18] that stresses on the need for
periodic and automatic curation of data from supplementary information in research articles.
The present work is geared towards partial fulfillment of this need for "futuristic research
data management".
Although spectral, molecular and analytical data have been harvested in the past but
extracting molecules directly from author supplied atomic coordinates provided in
supplementary materials as PDF format is not known. Accordingly, in the present work, we
have developed an application, ChemEngine that reads all the files stored in the PDF format
to extract molecular coordinates and generate computable molecular structures. To
demonstrate the efficiency of the program, supporting material data files of three different
molecular representations in terms of delimiters in the co-ordinate data were selected and the
data was successfully parsed using ChemEngine to extract molecular data. Supporting
materials of molecular data files also include brief description of molecules, computed data,
plots, page numbers, document information, manuscript bibliographic details etc. as a single
document in PDF format that makes harvesting the molecular data extremely difficult as
these have to be selectively excluded while parsing the file. Given an input file in PDF
format, the program yields three different files in GJF format, text file containing computed
bond matrix and all molecules in SDF format. The contents of the non molecular data file
can also be utilized by further subjecting it to standard text mining methodologies for
retrieving molecule names or other information such as list of basis sets employed in the
specific computational work.
MATERIALS AND METHODS
In general the major problem in processing molecular data stored in PDF files arises due to
the non-standard representation of coordinates such as inconsistency in the number of digits
appearing after the decimal, interchange of atom type with atomic number in the first
column and improper alignment of x, y, z coordinate values. ChemEngine identifies the text
patterns and processes this information to yield a common generic format of coordinate
matrix. Further the bond matrix algorithm implemented in the program generates a bond
matrix for the creation of a connection table to generate the ready to compute 3D molecular
structure.(AN = atomic number, AS = atomic symbol , CT = connection table)
In the present work we process the molecules and transform them into SDF format that is
mostly compatible with commercial packages thus saving time and computational effort. The
compute once and use many times approach will help the readers to access the original input
files even after passage of time. Therefore it is suggested that the chemical community
should maintain a standard and consistent representation of chemical structure data in the
electronic supplementary files in native format or standard data format to facilitate the re-
usability among the scientific community.
RESULTS
CONCLUSIONS
An application ChemEngine presented here selectively extracts the 3D structure
from coordinate information present along with inadvertently introduced noisy
data present in PDF files. This approach can obviate to some extent the loss of
chemical data while at the same time conserve the memory and storage space
required at the journal site. The methodology exemplified here will enable
molecule mining in semantic context and ensure maximum reuse of the valuable
data by interested readers thereby enhancing the citations of the authors.
REFERENCES
Karthikeyan M, Vyas R. Role of Open Source Tools and Resources in Virtual Screening
for Drug Discovery. Combinatorial Chemistry & High Throughput Screening.
2015;18(6):528-543.
Postma G, van Bakel B, Kateman G. Automatic Extraction of Analytical Chemical
Information. System Description, Inventory of Tasks and Problems, and Preliminary
Results. Journal of Chemical Information and Modeling. 1996;36(4):770-785.
Karthikeyan M, Bender A. Encoding and Decoding Graphical Chemical Structures as
Two-Dimensional (PDF417) Barcodes. ChemInform. 2005;36(32).
Murray-Rust P, Mitchell J, Rzepa H. BMC Bioinformatics. 2005;6(1):141. 2005;6(1):180.
Guha R, Howard M, Hutchison G, Murray-Rust P, Rzepa H, Steinbeck C et al. The Blue
Obelisk Interoperability in Chemical Informatics. Journal of Chemical Information and
Modeling. 2006;46(3):991-998.
Fourches D, Muratov E, Tropsha A. Curation of Chemogenomics data. Nature Chemical
Biology. 2015;11(8):535-535.
Harvey M, Mason N, McLean A, Murray-Rust P, Rzepa H, Stewart J. Standards-based
curation of a decade-old digital repository dataset of molecular information. J
Cheminform. 2015;7(1).
Karthikeyan M, Krishnan S, Pandey A, Bender A. Harvesting Chemical Information from
the Internet Using a Distributed Approach: ChemXtreme. Journal of Chemical
Information and Modeling. 2006;46(2):452-461.
Karthikeyan, M. Automatic harvesting of molecular information raster graphics. US
Patent Appl 14/241285, 2011.
Karthikeyan M, Pandit Y, Pandit D, Vyas R. MegaMiner: A Tool for Lead Identification
Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.
Combinatorial Chemistry & High Throughput Screening. 2015;18(6):591-603.
Karthikeyan M, Renu Vyas, Practical Chemoinformatics, 2014 , Springer
ACKNOWLEDGEMENT
MK thanks Department of Science and Technology for the award of International Travel
grant to attend the meeting. MK thanks director CSIR NCL for providing infrastructure
and support. The financial support received from GENESIS (BSC0121) and INSPIRE
(CSC0107) under 12FYP projects is duly acknowledged. RV thanks DST, New Delhi for
award of a fellowship. (WOS-A/ LS-201/2011).
†CSIR-National Chemical Laboratory,(Government of India), Pune- 411008, India
ChemEngine: Harvesting 3D Chemical Structures of Supplementary Data from PDF Files
Muthukumarasamy Karthikeyan†* and Renu Vyas‡
Entry Case Study N= molecules Regular Expression pattern Format &
Delimiter
1 Epoxide formation
from sulfur ylides and aldehydes
29 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0-
9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,}
PDF
Space
2 Thiol ene click chemistry 115 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0-
9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,}
Text
Space
3 Design of
tetra(arenediyl)bis(allyl)
derivatives for cope
rearrangement transition states
55 ^[A-Za-z0-9]{1,2},[0]{0,1}[,]{0,1}-
{0,1}.{1,2}[0-9]{1,10},-{0,1}.{1,2}[0-
9]{1,10}.{1,}
PDF
Comma
3D MOLECULE DATAPDF TEXT CHEM ENGINE
Input Data Format
Pattern Recognition
Regular Expression
Generate Coordinates
https://sourceforge.net/projects/chemengine/

More Related Content

chemengine karthi acs sandiego rev1.0

  • 1. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher’s resources. This java based ChemEngine program recognizes textual patterns from the supplementary data and generates standard molecular structure data. The methodology has been demonstrated via few case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies (SPEs) that were in close agreement with the original computed data provided with the articles. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar ABSTRACT Harvesting chemical data from the web is a challenging task requiring several convoluted steps. Transforming the PDF files of molecular data to re-usable format is extremely difficult. Generation of 3D structures from these molecular images in raster format was extremely difficult. The supporting information related to computational methods based research articles, describing the transition states of organic reactions is now available from journal publishers’ websites containing description of computations performed with tables of results, molecular images in 3D conformations along with 3D molecular co-ordinates in a PDF format. This combined data in a single file complicates the harvesting process and development of pattern recognition techniques for selectively excluding the non-atomic co-ordinate information from the pool of large collection of textual data presented as supporting material. The freedom of choosing data formats necessitates the development of several pattern recognition templates in the form of regular expressions to handle diverse formats (co-ordinates separated by space, comma, tab etc.) and maintain the order in which the XYZ co-ordinates and atom information is presented by the authors. This study therefore highlights the need for development of standards required for submitting the supporting materials with molecular data in a consistent, truly computable and re-usable format to journals publishing computational research. INTRODUCTION Several standard molecular representations in ASCII format which are easily readable by molecular modeling and chemoinformatics software packages are available. A specific set of guidelines defined by the publishers to submit molecular data even in a PDF format, would accelerate the automatic processing and recognition of chemical data for further computational studies related to reaction modeling, drug-discovery and molecular inventory management. Exchange of chemical data between multiple softwares without loss of information is a critical requirement in computational chemistry and chemoinformatics applications In a recent article, the importance of curation of large chemogenomics data set for building better predictive model for life sciences has been emphasized [17]. During the preparation of this manuscript, a timely research article by Rzepa's group on granularity model for extracting molecular information appeared [18] that stresses on the need for periodic and automatic curation of data from supplementary information in research articles. The present work is geared towards partial fulfillment of this need for "futuristic research data management". Although spectral, molecular and analytical data have been harvested in the past but extracting molecules directly from author supplied atomic coordinates provided in supplementary materials as PDF format is not known. Accordingly, in the present work, we have developed an application, ChemEngine that reads all the files stored in the PDF format to extract molecular coordinates and generate computable molecular structures. To demonstrate the efficiency of the program, supporting material data files of three different molecular representations in terms of delimiters in the co-ordinate data were selected and the data was successfully parsed using ChemEngine to extract molecular data. Supporting materials of molecular data files also include brief description of molecules, computed data, plots, page numbers, document information, manuscript bibliographic details etc. as a single document in PDF format that makes harvesting the molecular data extremely difficult as these have to be selectively excluded while parsing the file. Given an input file in PDF format, the program yields three different files in GJF format, text file containing computed bond matrix and all molecules in SDF format. The contents of the non molecular data file can also be utilized by further subjecting it to standard text mining methodologies for retrieving molecule names or other information such as list of basis sets employed in the specific computational work. MATERIALS AND METHODS In general the major problem in processing molecular data stored in PDF files arises due to the non-standard representation of coordinates such as inconsistency in the number of digits appearing after the decimal, interchange of atom type with atomic number in the first column and improper alignment of x, y, z coordinate values. ChemEngine identifies the text patterns and processes this information to yield a common generic format of coordinate matrix. Further the bond matrix algorithm implemented in the program generates a bond matrix for the creation of a connection table to generate the ready to compute 3D molecular structure.(AN = atomic number, AS = atomic symbol , CT = connection table) In the present work we process the molecules and transform them into SDF format that is mostly compatible with commercial packages thus saving time and computational effort. The compute once and use many times approach will help the readers to access the original input files even after passage of time. Therefore it is suggested that the chemical community should maintain a standard and consistent representation of chemical structure data in the electronic supplementary files in native format or standard data format to facilitate the re- usability among the scientific community. RESULTS CONCLUSIONS An application ChemEngine presented here selectively extracts the 3D structure from coordinate information present along with inadvertently introduced noisy data present in PDF files. This approach can obviate to some extent the loss of chemical data while at the same time conserve the memory and storage space required at the journal site. The methodology exemplified here will enable molecule mining in semantic context and ensure maximum reuse of the valuable data by interested readers thereby enhancing the citations of the authors. REFERENCES Karthikeyan M, Vyas R. Role of Open Source Tools and Resources in Virtual Screening for Drug Discovery. Combinatorial Chemistry & High Throughput Screening. 2015;18(6):528-543. Postma G, van Bakel B, Kateman G. Automatic Extraction of Analytical Chemical Information. System Description, Inventory of Tasks and Problems, and Preliminary Results. Journal of Chemical Information and Modeling. 1996;36(4):770-785. Karthikeyan M, Bender A. Encoding and Decoding Graphical Chemical Structures as Two-Dimensional (PDF417) Barcodes. ChemInform. 2005;36(32). Murray-Rust P, Mitchell J, Rzepa H. BMC Bioinformatics. 2005;6(1):141. 2005;6(1):180. Guha R, Howard M, Hutchison G, Murray-Rust P, Rzepa H, Steinbeck C et al. The Blue Obelisk Interoperability in Chemical Informatics. Journal of Chemical Information and Modeling. 2006;46(3):991-998. Fourches D, Muratov E, Tropsha A. Curation of Chemogenomics data. Nature Chemical Biology. 2015;11(8):535-535. Harvey M, Mason N, McLean A, Murray-Rust P, Rzepa H, Stewart J. Standards-based curation of a decade-old digital repository dataset of molecular information. J Cheminform. 2015;7(1). Karthikeyan M, Krishnan S, Pandey A, Bender A. Harvesting Chemical Information from the Internet Using a Distributed Approach: ChemXtreme. Journal of Chemical Information and Modeling. 2006;46(2):452-461. Karthikeyan, M. Automatic harvesting of molecular information raster graphics. US Patent Appl 14/241285, 2011. Karthikeyan M, Pandit Y, Pandit D, Vyas R. MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment. Combinatorial Chemistry & High Throughput Screening. 2015;18(6):591-603. Karthikeyan M, Renu Vyas, Practical Chemoinformatics, 2014 , Springer ACKNOWLEDGEMENT MK thanks Department of Science and Technology for the award of International Travel grant to attend the meeting. MK thanks director CSIR NCL for providing infrastructure and support. The financial support received from GENESIS (BSC0121) and INSPIRE (CSC0107) under 12FYP projects is duly acknowledged. RV thanks DST, New Delhi for award of a fellowship. (WOS-A/ LS-201/2011). †CSIR-National Chemical Laboratory,(Government of India), Pune- 411008, India ChemEngine: Harvesting 3D Chemical Structures of Supplementary Data from PDF Files Muthukumarasamy Karthikeyan†* and Renu Vyas‡ Entry Case Study N= molecules Regular Expression pattern Format & Delimiter 1 Epoxide formation from sulfur ylides and aldehydes 29 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0- 9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,} PDF Space 2 Thiol ene click chemistry 115 ^[A-Za-z0-9]{1,2}s+-{0,1}.{1,2}[0- 9]{1,8}s+-{0,1}.{1,2}[0-9]{1,8}.{1,} Text Space 3 Design of tetra(arenediyl)bis(allyl) derivatives for cope rearrangement transition states 55 ^[A-Za-z0-9]{1,2},[0]{0,1}[,]{0,1}- {0,1}.{1,2}[0-9]{1,10},-{0,1}.{1,2}[0- 9]{1,10}.{1,} PDF Comma 3D MOLECULE DATAPDF TEXT CHEM ENGINE Input Data Format Pattern Recognition Regular Expression Generate Coordinates https://sourceforge.net/projects/chemengine/