Chemical Normalization
? Isentris Registration PP
?
1 of 26
Download to read offline
More Related Content
Intervet Chemicals Directory (ICD) - A Framework Combining Accelrys Pipeline Pilot and Symyx Isentris
1. Intervet Chemicals Directory (ICD)
A Framework Combining Accelrys Pipeline Pilot and Symyx Isentris
Frank Oellien
10/26/2010
Accelrys European User Group Meeting, Barcelona
2. Outline
? Motivation ICD project (historical review)
? Technical Implementation (2003)
? ICD Today (Enhancements in the last years)
? Technical limitations of the Isentris approach
? Solution: Combining Symyx Isentris & Accelrys PP
C Structure Registration, Synchronization
C Database Cleaning
C Property-Calculations
SP Intervet Chemicals Directory (ICD)
10/26/2010
2
3. Motivation
? Start of the ICD project 2003
? Company was still young
? BioChemInformatics group (more precisely the cheminformatics
branch) started its work on regular basis
C
C
C
C
Ligand- and Structure-based Virtual Screening (LO and Hit2Lead projects)
Property and Descriptor Calculations
QSAR
Substructure- and Similarity Searches
★ Access to many in-house data sources especially structures required
★ Many exchange formats used (including Excel and SD files)
★ Many diverse tools and applications used
SP Intervet Chemicals Directory (ICD)
10/26/2010
3
5. The Idea C A Central Data Source
Other Data Sources
In-house Databases
Supplier Data
SD
SD SD
SD
SD
SD
ICD
CompLog
BCI Applications
Medicinal
Chemists
SP Intervet Chemicals Directory (ICD)
10/26/2010
5
6. Requirements
? Standard data source for all BCI tasks
? Merged data source including in-house structures, supplier structures
and other data sources
? Dynamically updated
? Structure database with unique structure identifier
? Standardized and Normalized data (including chemical normalization)
? Extendable system that can store other BCI-relevant information
(e.g. virtual screening data)
Ask other Scientists in the Drug Discovery department
? Storing supplier catalogues and other supplier information
? Data source for compound ordering
? Accessible by other scientists (especially medicinal chemists)
? Storage of physico-chemical properties for research projects
SP Intervet Chemicals Directory (ICD)
10/26/2010
6
7. Implementation: Reasons for Isentris (2003)
? Not many systems available in 2003 (Auspyx, Acorrd, Isentris)
? Isentris used many technolgies that were already available in-house
(MDL Direct, Oracle)
? Chemical Normalization available: Cheshire
? Advanced J2EE architecture and API that allows a good
customization and extension
? CoRe: already an existing project based on Isentris
C Intervet was an early adopter of Isentris
C No additional software costs
C Synergy effects (e.g. chemical business rules)
SP Intervet Chemicals Directory (ICD)
10/26/2010
7
8. Implementation Overview
SD Files
? supplier catalogs
Chemical Rules
? TORE Updates (in-house)
File syntax
normalisation
of SD files
Generation of
salt information
and ParentHash codes
(CheckAndFix_Main.cct)
MDL Isentris (Client-Server)
prepared
SD Files
CACTVS (Linux)
ADME data
Java application
registration
Java applications (Windows)
Oracle SQLLoader
(phys-chem
properties)
chemical
normalisation
ICD
SP Intervet Chemicals Directory (ICD)
10/26/2010
8
9. Implementation: SD File Syntax Standardisation
? Based on CACTVS application (by Xemistry)
? SD file can have different inputs
? 2 generic scripts (supplier-specific, in-house specific) to standardize
the format of the input SD files and supplier-specific configuration files
? SDF fields for supplier-related files:
SupplierName, OrderNo, CatalogName, CatalogType, CatalogRelease,
Confidential, CompoundName, IsSalt, Salt, Quantity, Purity
? SDF fields for in-house data:
AHNO, CompoundName, IsSalt, Salt
? Calculation of structural hash codes (parent structure hash code)
Insensitive hash codes: isotope, salt, tautomer, stereochemistry
? Automatically knowledge-based identification of salts
★ 174 different salts can be determined
SP Intervet Chemicals Directory (ICD)
10/26/2010
9
10. Implementation: Chemical Normalisation
? Based on Cheshire (part of the Isentris framework)
? JavaScript clone
? Valence checks, Ion2kov, nitro group, transition metals, queries,
geometries, stereo chemistry,´
? 99 rules
C 45 correction functions
C 29 warnings functions
C 25 error functions
? Used by CoRe and ICD applications
? Import: molfile string
? Output: molfile string and message string
★ Category; No of changes???list of descriptions
SP Intervet Chemicals Directory (ICD)
10/26/2010
10
11. Implementation: Registration
? Based on Symyx Isentris Java Client (now Accelrys Isentris)
? Using Isentris Data Sources (Data Source Factory)
? 3 Java applications (in-house structures, supplier, virtual screening)
★ 31 java classes, ~9.500 lines code
? Run types: command line, GUI, batch mode
? Chem. Normalisation, duplicate check, registration logic
*
* ICD Supplier Registration
* Version null
* Frank Oellien, Intervet Innovation GmbH
*
1:10:21 PM INFO: Chemical Normalization status:
304334 records without changes
4685 records fixed
11 records fixed but still have warnings
200 records with warnings
2 records with errors
1:10:21 PM INFO: Chemical Registration status:
1:10:21 PM INFO: New supplier has been registered.
309230 records to register
309222 records passed registration
8 records failed registration
180284 new structues registered
128938 structues already found in the DB
1:10:21 PM INFO: Closing Cheshire environment...
1:10:22 PM INFO: Releasing the ICD datasource
resources ...
1:10:22 PM INFO: Closing the ICD DataSourceFactory...
1:10:22 PM INFO: Logout...
1:10:22 PM INFO: All resources released.
SP Intervet Chemicals Directory (ICD)
10/26/2010
11
12. ICD Today - Datasheet
? ~ 11,500,000 structures
? 237 different catalogues
(including screening
libraries, focused data
sets)
? 60 suppliers
? A broad range of
standard pysicochemical properties
? Intervet¨s in-house
database
? Specific Intervet data
sets
? References to external
sources (PubChem)
SP Intervet Chemicals Directory (ICD)
10/26/2010
12
13. ICD Today C Change of Relevance
? Still the main data source for the BCI group, although almost all other
BCI technologies have changed in the meantime
? Moreover, has become a key technology platform for the whole Drug
Discovery process
C Almost all compound logistic activities are based on the ICD
(Applications for compound ordering)
C Stores specific essential information for CompLog
C Important database for Hit2Lead and LO projects
(contains decision-critical properties)
C Has become the most important structure-database for medicinal chemists
? Isentris upgrade to 3.1 ★ re-design of the ICD Isentris part necessary
? New demands by BCI and others had to be implemented
★ could not be realized with former setup because of limitations
? Solution: Combination with Pipeline Pilot
SP Intervet Chemicals Directory (ICD)
10/26/2010
13
14. Limitations of the original Isentris Setup
From the Beginning
? Starting with Isentris 1.1, early adopters
? Hard to implement: large, over-designed J2EE API, no developer
guides, only some small code snippets
? Limited and complicated functions
C e.g. no support for very large structure files
? Re-design of applications was necessary, because of Isentris updates
? No automation, everything is done in user context!
Regarding recent Demands
? Missing Automation was still most critical issue:
C Synchronisation
C Adding non-structural data
? Elaborate database cleaning mechanisms
SP Intervet Chemicals Directory (ICD)
10/26/2010
14
15. Registration of Supplier Cataloges
SD Files
Chemical Rules
(CheckAndFix_Main.cct)
structural
normalisation
of SD files
Generation of
salt information
and ParentHash codes
CACTVS (Linux)
MDL Isentris (Client-Server)
prepared
SD Files
chemical
normalisation
registration
Java applications (Windows)
ICD
SP Intervet Chemicals Directory (ICD)
10/26/2010
15
16. Registration of in-house Structures by PP I
in-house
database
structural
normalisation
of SD files
Chemical Rules
(CheckAndFix_Main.cct)
Generation of
salt information
and ParentHash codes
CACTVS called by PP
Synchronisation by
Pipeline Pilot (Linux)
chemical
normalisation
registration
Cheshire PP Component
ICD
SP Intervet Chemicals Directory (ICD)
10/26/2010
16
17. Registration of in-house Structures by PP II
Retrieve structures from database
Call CACTVS application
Chemical Normalisation & Registration
SP Intervet Chemicals Directory (ICD)
10/26/2010
17
18. Cheshire PP Component (Java)
? Implemented as PP Java component
? Based on Cheshire Java API
? Calls Cheshire core library (shared object files called by JNI)
SP Intervet Chemicals Directory (ICD)
10/26/2010
18
19. Cheshire PP Component (Java)
? Implemented as PP Java component
? Based on Cheshire Java API
? Calls Cheshire core library (shared object files called by JNI)
SP Intervet Chemicals Directory (ICD)
10/26/2010
19
25. Isentris PP Components @ SP Intervet
? Isentris Cheshire PP
? Converter:
C
C
C
C
Chime string to Molecule
Chime string to CTAB
Molecule to Chime string
CTAB to Chime string
SP Intervet Chemicals Directory (ICD)
10/26/2010
25
26. Acknowledgement
Information Management
? Werner Schl┨ter
? Thomas Fischer
BioChemInformatics
? Richard Marh?fer
? Andreas Krasky
? (J?rg Cramer)
? J?rg Schr?der
? Paul M. Selzer
Thank you
SP Intervet Chemicals Directory (ICD)
10/26/2010
26