際際滷

際際滷Share a Scribd company logo
Intervet Chemicals Directory (ICD)
A Framework Combining Accelrys Pipeline Pilot and Symyx Isentris
Frank Oellien

10/26/2010
Accelrys European User Group Meeting, Barcelona
Outline
? Motivation ICD project (historical review)
? Technical Implementation (2003)
? ICD Today (Enhancements in the last years)
? Technical limitations of the Isentris approach
? Solution: Combining Symyx Isentris & Accelrys PP
C Structure Registration, Synchronization
C Database Cleaning
C Property-Calculations

SP Intervet Chemicals Directory (ICD)

10/26/2010

2
Motivation
? Start of the ICD project 2003
? Company was still young
? BioChemInformatics group (more precisely the cheminformatics
branch) started its work on regular basis
C
C
C
C

Ligand- and Structure-based Virtual Screening (LO and Hit2Lead projects)
Property and Descriptor Calculations
QSAR
Substructure- and Similarity Searches

★ Access to many in-house data sources especially structures required
★ Many exchange formats used (including Excel and SD files)
★ Many diverse tools and applications used

SP Intervet Chemicals Directory (ICD)

10/26/2010

3
Pre-ICD Time (before Q2 2003)

SP Intervet Chemicals Directory (ICD)

10/26/2010

4
The Idea C A Central Data Source
Other Data Sources

In-house Databases

Supplier Data
SD
SD SD
SD
SD
SD

ICD

CompLog
BCI Applications

Medicinal
Chemists

SP Intervet Chemicals Directory (ICD)

10/26/2010

5
Requirements
? Standard data source for all BCI tasks
? Merged data source including in-house structures, supplier structures
and other data sources
? Dynamically updated
? Structure database with unique structure identifier
? Standardized and Normalized data (including chemical normalization)
? Extendable system that can store other BCI-relevant information
(e.g. virtual screening data)
Ask other Scientists in the Drug Discovery department
? Storing supplier catalogues and other supplier information
? Data source for compound ordering
? Accessible by other scientists (especially medicinal chemists)
? Storage of physico-chemical properties for research projects

SP Intervet Chemicals Directory (ICD)

10/26/2010

6
Implementation: Reasons for Isentris (2003)
? Not many systems available in 2003 (Auspyx, Acorrd, Isentris)
? Isentris used many technolgies that were already available in-house
(MDL Direct, Oracle)
? Chemical Normalization available: Cheshire
? Advanced J2EE architecture and API that allows a good
customization and extension
? CoRe: already an existing project based on Isentris
C Intervet was an early adopter of Isentris
C No additional software costs
C Synergy effects (e.g. chemical business rules)

SP Intervet Chemicals Directory (ICD)

10/26/2010

7
Implementation Overview
SD Files

? supplier catalogs

Chemical Rules

? TORE Updates (in-house)

File syntax
normalisation
of SD files

Generation of
salt information
and ParentHash codes

(CheckAndFix_Main.cct)

MDL Isentris (Client-Server)
prepared
SD Files

CACTVS (Linux)
ADME data

Java application

registration

Java applications (Windows)

Oracle SQLLoader

(phys-chem
properties)

chemical
normalisation

ICD

SP Intervet Chemicals Directory (ICD)

10/26/2010

8
Implementation: SD File Syntax Standardisation
? Based on CACTVS application (by Xemistry)
? SD file can have different inputs
? 2 generic scripts (supplier-specific, in-house specific) to standardize
the format of the input SD files and supplier-specific configuration files
? SDF fields for supplier-related files:
SupplierName, OrderNo, CatalogName, CatalogType, CatalogRelease,
Confidential, CompoundName, IsSalt, Salt, Quantity, Purity

? SDF fields for in-house data:
AHNO, CompoundName, IsSalt, Salt

? Calculation of structural hash codes (parent structure hash code)
Insensitive hash codes: isotope, salt, tautomer, stereochemistry

? Automatically knowledge-based identification of salts
★ 174 different salts can be determined

SP Intervet Chemicals Directory (ICD)

10/26/2010

9
Implementation: Chemical Normalisation
? Based on Cheshire (part of the Isentris framework)
? JavaScript clone
? Valence checks, Ion2kov, nitro group, transition metals, queries,
geometries, stereo chemistry,´
? 99 rules
C 45 correction functions
C 29 warnings functions
C 25 error functions

? Used by CoRe and ICD applications
? Import: molfile string
? Output: molfile string and message string
★ Category; No of changes???list of descriptions

SP Intervet Chemicals Directory (ICD)

10/26/2010

10
Implementation: Registration
? Based on Symyx Isentris Java Client (now Accelrys Isentris)
? Using Isentris Data Sources (Data Source Factory)
? 3 Java applications (in-house structures, supplier, virtual screening)
★ 31 java classes, ~9.500 lines code
? Run types: command line, GUI, batch mode
? Chem. Normalisation, duplicate check, registration logic


*
* ICD Supplier Registration
* Version null
* Frank Oellien, Intervet Innovation GmbH
*

1:10:21 PM INFO: Chemical Normalization status:
304334 records without changes
4685 records fixed
11 records fixed but still have warnings
200 records with warnings
2 records with errors
1:10:21 PM INFO: Chemical Registration status:
1:10:21 PM INFO: New supplier has been registered.
309230 records to register
309222 records passed registration
8 records failed registration
180284 new structues registered
128938 structues already found in the DB

1:10:21 PM INFO: Closing Cheshire environment...
1:10:22 PM INFO: Releasing the ICD datasource
resources ...
1:10:22 PM INFO: Closing the ICD DataSourceFactory...
1:10:22 PM INFO: Logout...
1:10:22 PM INFO: All resources released.

SP Intervet Chemicals Directory (ICD)

10/26/2010

11
ICD Today - Datasheet
? ~ 11,500,000 structures
? 237 different catalogues
(including screening
libraries, focused data
sets)
? 60 suppliers
? A broad range of
standard pysicochemical properties
? Intervet¨s in-house
database
? Specific Intervet data
sets
? References to external
sources (PubChem)

SP Intervet Chemicals Directory (ICD)

10/26/2010

12
ICD Today C Change of Relevance
? Still the main data source for the BCI group, although almost all other
BCI technologies have changed in the meantime
? Moreover, has become a key technology platform for the whole Drug
Discovery process
C Almost all compound logistic activities are based on the ICD
(Applications for compound ordering)
C Stores specific essential information for CompLog
C Important database for Hit2Lead and LO projects
(contains decision-critical properties)
C Has become the most important structure-database for medicinal chemists

? Isentris upgrade to 3.1 ★ re-design of the ICD Isentris part necessary
? New demands by BCI and others had to be implemented
★ could not be realized with former setup because of limitations
? Solution: Combination with Pipeline Pilot

SP Intervet Chemicals Directory (ICD)

10/26/2010

13
Limitations of the original Isentris Setup
From the Beginning
? Starting with Isentris 1.1, early adopters
? Hard to implement: large, over-designed J2EE API, no developer
guides, only some small code snippets
? Limited and complicated functions
C e.g. no support for very large structure files

? Re-design of applications was necessary, because of Isentris updates
? No automation, everything is done in user context!
Regarding recent Demands
? Missing Automation was still most critical issue:
C Synchronisation
C Adding non-structural data

? Elaborate database cleaning mechanisms
SP Intervet Chemicals Directory (ICD)

10/26/2010

14
Registration of Supplier Cataloges
SD Files

Chemical Rules
(CheckAndFix_Main.cct)

structural
normalisation
of SD files

Generation of
salt information
and ParentHash codes

CACTVS (Linux)

MDL Isentris (Client-Server)
prepared
SD Files

chemical
normalisation

registration

Java applications (Windows)

ICD

SP Intervet Chemicals Directory (ICD)

10/26/2010

15
Registration of in-house Structures by PP I
in-house
database

structural
normalisation
of SD files

Chemical Rules
(CheckAndFix_Main.cct)

Generation of
salt information
and ParentHash codes

CACTVS called by PP

Synchronisation by
Pipeline Pilot (Linux)

chemical
normalisation

registration

Cheshire PP Component

ICD

SP Intervet Chemicals Directory (ICD)

10/26/2010

16
Registration of in-house Structures by PP II
Retrieve structures from database
Call CACTVS application

Chemical Normalisation & Registration

SP Intervet Chemicals Directory (ICD)

10/26/2010

17
Cheshire PP Component (Java)

? Implemented as PP Java component
? Based on Cheshire Java API
? Calls Cheshire core library (shared object files called by JNI)

SP Intervet Chemicals Directory (ICD)

10/26/2010

18
Cheshire PP Component (Java)

? Implemented as PP Java component
? Based on Cheshire Java API
? Calls Cheshire core library (shared object files called by JNI)

SP Intervet Chemicals Directory (ICD)

10/26/2010

19
Cheshire PP Component (Java)

SP Intervet Chemicals Directory (ICD)

10/26/2010

20
Importing physico-chemical Properties I

ADME data

Oracle SQLLoader

(phys-chem
properties)

Java application

ICD

SP Intervet Chemicals Directory (ICD)

10/26/2010

21
Importing physico-chemical Properties I
External
application 2
(descriptos)
External
ADME data
application 3
(phys-chem
(descriptos)
properties)

External
application 4
(descriptos)

External
application 1
(standardize)

Retrieval of
structures without
properties

Oracle SQLLoader
Java application

Internal PP
components
(descriptors)

ICD

Import
properties

Managed by Pipeline Pilot (Linux)

SP Intervet Chemicals Directory (ICD)

10/26/2010

22
Importing physico-chemical Properties II

SP Intervet Chemicals Directory (ICD)

10/26/2010

23
Database Maintenance

SP Intervet Chemicals Directory (ICD)

10/26/2010

24
Isentris PP Components @ SP Intervet
? Isentris Cheshire PP
? Converter:
C
C
C
C

Chime string to Molecule
Chime string to CTAB
Molecule to Chime string
CTAB to Chime string

SP Intervet Chemicals Directory (ICD)

10/26/2010

25
Acknowledgement
Information Management
? Werner Schl┨ter
? Thomas Fischer
BioChemInformatics
? Richard Marh?fer
? Andreas Krasky
? (J?rg Cramer)
? J?rg Schr?der
? Paul M. Selzer

Thank you

SP Intervet Chemicals Directory (ICD)

10/26/2010

26

More Related Content

Intervet Chemicals Directory (ICD) - A Framework Combining Accelrys Pipeline Pilot and Symyx Isentris

  • 1. Intervet Chemicals Directory (ICD) A Framework Combining Accelrys Pipeline Pilot and Symyx Isentris Frank Oellien 10/26/2010 Accelrys European User Group Meeting, Barcelona
  • 2. Outline ? Motivation ICD project (historical review) ? Technical Implementation (2003) ? ICD Today (Enhancements in the last years) ? Technical limitations of the Isentris approach ? Solution: Combining Symyx Isentris & Accelrys PP C Structure Registration, Synchronization C Database Cleaning C Property-Calculations SP Intervet Chemicals Directory (ICD) 10/26/2010 2
  • 3. Motivation ? Start of the ICD project 2003 ? Company was still young ? BioChemInformatics group (more precisely the cheminformatics branch) started its work on regular basis C C C C Ligand- and Structure-based Virtual Screening (LO and Hit2Lead projects) Property and Descriptor Calculations QSAR Substructure- and Similarity Searches ★ Access to many in-house data sources especially structures required ★ Many exchange formats used (including Excel and SD files) ★ Many diverse tools and applications used SP Intervet Chemicals Directory (ICD) 10/26/2010 3
  • 4. Pre-ICD Time (before Q2 2003) SP Intervet Chemicals Directory (ICD) 10/26/2010 4
  • 5. The Idea C A Central Data Source Other Data Sources In-house Databases Supplier Data SD SD SD SD SD SD ICD CompLog BCI Applications Medicinal Chemists SP Intervet Chemicals Directory (ICD) 10/26/2010 5
  • 6. Requirements ? Standard data source for all BCI tasks ? Merged data source including in-house structures, supplier structures and other data sources ? Dynamically updated ? Structure database with unique structure identifier ? Standardized and Normalized data (including chemical normalization) ? Extendable system that can store other BCI-relevant information (e.g. virtual screening data) Ask other Scientists in the Drug Discovery department ? Storing supplier catalogues and other supplier information ? Data source for compound ordering ? Accessible by other scientists (especially medicinal chemists) ? Storage of physico-chemical properties for research projects SP Intervet Chemicals Directory (ICD) 10/26/2010 6
  • 7. Implementation: Reasons for Isentris (2003) ? Not many systems available in 2003 (Auspyx, Acorrd, Isentris) ? Isentris used many technolgies that were already available in-house (MDL Direct, Oracle) ? Chemical Normalization available: Cheshire ? Advanced J2EE architecture and API that allows a good customization and extension ? CoRe: already an existing project based on Isentris C Intervet was an early adopter of Isentris C No additional software costs C Synergy effects (e.g. chemical business rules) SP Intervet Chemicals Directory (ICD) 10/26/2010 7
  • 8. Implementation Overview SD Files ? supplier catalogs Chemical Rules ? TORE Updates (in-house) File syntax normalisation of SD files Generation of salt information and ParentHash codes (CheckAndFix_Main.cct) MDL Isentris (Client-Server) prepared SD Files CACTVS (Linux) ADME data Java application registration Java applications (Windows) Oracle SQLLoader (phys-chem properties) chemical normalisation ICD SP Intervet Chemicals Directory (ICD) 10/26/2010 8
  • 9. Implementation: SD File Syntax Standardisation ? Based on CACTVS application (by Xemistry) ? SD file can have different inputs ? 2 generic scripts (supplier-specific, in-house specific) to standardize the format of the input SD files and supplier-specific configuration files ? SDF fields for supplier-related files: SupplierName, OrderNo, CatalogName, CatalogType, CatalogRelease, Confidential, CompoundName, IsSalt, Salt, Quantity, Purity ? SDF fields for in-house data: AHNO, CompoundName, IsSalt, Salt ? Calculation of structural hash codes (parent structure hash code) Insensitive hash codes: isotope, salt, tautomer, stereochemistry ? Automatically knowledge-based identification of salts ★ 174 different salts can be determined SP Intervet Chemicals Directory (ICD) 10/26/2010 9
  • 10. Implementation: Chemical Normalisation ? Based on Cheshire (part of the Isentris framework) ? JavaScript clone ? Valence checks, Ion2kov, nitro group, transition metals, queries, geometries, stereo chemistry,´ ? 99 rules C 45 correction functions C 29 warnings functions C 25 error functions ? Used by CoRe and ICD applications ? Import: molfile string ? Output: molfile string and message string ★ Category; No of changes???list of descriptions SP Intervet Chemicals Directory (ICD) 10/26/2010 10
  • 11. Implementation: Registration ? Based on Symyx Isentris Java Client (now Accelrys Isentris) ? Using Isentris Data Sources (Data Source Factory) ? 3 Java applications (in-house structures, supplier, virtual screening) ★ 31 java classes, ~9.500 lines code ? Run types: command line, GUI, batch mode ? Chem. Normalisation, duplicate check, registration logic * * ICD Supplier Registration * Version null * Frank Oellien, Intervet Innovation GmbH * 1:10:21 PM INFO: Chemical Normalization status: 304334 records without changes 4685 records fixed 11 records fixed but still have warnings 200 records with warnings 2 records with errors 1:10:21 PM INFO: Chemical Registration status: 1:10:21 PM INFO: New supplier has been registered. 309230 records to register 309222 records passed registration 8 records failed registration 180284 new structues registered 128938 structues already found in the DB 1:10:21 PM INFO: Closing Cheshire environment... 1:10:22 PM INFO: Releasing the ICD datasource resources ... 1:10:22 PM INFO: Closing the ICD DataSourceFactory... 1:10:22 PM INFO: Logout... 1:10:22 PM INFO: All resources released. SP Intervet Chemicals Directory (ICD) 10/26/2010 11
  • 12. ICD Today - Datasheet ? ~ 11,500,000 structures ? 237 different catalogues (including screening libraries, focused data sets) ? 60 suppliers ? A broad range of standard pysicochemical properties ? Intervet¨s in-house database ? Specific Intervet data sets ? References to external sources (PubChem) SP Intervet Chemicals Directory (ICD) 10/26/2010 12
  • 13. ICD Today C Change of Relevance ? Still the main data source for the BCI group, although almost all other BCI technologies have changed in the meantime ? Moreover, has become a key technology platform for the whole Drug Discovery process C Almost all compound logistic activities are based on the ICD (Applications for compound ordering) C Stores specific essential information for CompLog C Important database for Hit2Lead and LO projects (contains decision-critical properties) C Has become the most important structure-database for medicinal chemists ? Isentris upgrade to 3.1 ★ re-design of the ICD Isentris part necessary ? New demands by BCI and others had to be implemented ★ could not be realized with former setup because of limitations ? Solution: Combination with Pipeline Pilot SP Intervet Chemicals Directory (ICD) 10/26/2010 13
  • 14. Limitations of the original Isentris Setup From the Beginning ? Starting with Isentris 1.1, early adopters ? Hard to implement: large, over-designed J2EE API, no developer guides, only some small code snippets ? Limited and complicated functions C e.g. no support for very large structure files ? Re-design of applications was necessary, because of Isentris updates ? No automation, everything is done in user context! Regarding recent Demands ? Missing Automation was still most critical issue: C Synchronisation C Adding non-structural data ? Elaborate database cleaning mechanisms SP Intervet Chemicals Directory (ICD) 10/26/2010 14
  • 15. Registration of Supplier Cataloges SD Files Chemical Rules (CheckAndFix_Main.cct) structural normalisation of SD files Generation of salt information and ParentHash codes CACTVS (Linux) MDL Isentris (Client-Server) prepared SD Files chemical normalisation registration Java applications (Windows) ICD SP Intervet Chemicals Directory (ICD) 10/26/2010 15
  • 16. Registration of in-house Structures by PP I in-house database structural normalisation of SD files Chemical Rules (CheckAndFix_Main.cct) Generation of salt information and ParentHash codes CACTVS called by PP Synchronisation by Pipeline Pilot (Linux) chemical normalisation registration Cheshire PP Component ICD SP Intervet Chemicals Directory (ICD) 10/26/2010 16
  • 17. Registration of in-house Structures by PP II Retrieve structures from database Call CACTVS application Chemical Normalisation & Registration SP Intervet Chemicals Directory (ICD) 10/26/2010 17
  • 18. Cheshire PP Component (Java) ? Implemented as PP Java component ? Based on Cheshire Java API ? Calls Cheshire core library (shared object files called by JNI) SP Intervet Chemicals Directory (ICD) 10/26/2010 18
  • 19. Cheshire PP Component (Java) ? Implemented as PP Java component ? Based on Cheshire Java API ? Calls Cheshire core library (shared object files called by JNI) SP Intervet Chemicals Directory (ICD) 10/26/2010 19
  • 20. Cheshire PP Component (Java) SP Intervet Chemicals Directory (ICD) 10/26/2010 20
  • 21. Importing physico-chemical Properties I ADME data Oracle SQLLoader (phys-chem properties) Java application ICD SP Intervet Chemicals Directory (ICD) 10/26/2010 21
  • 22. Importing physico-chemical Properties I External application 2 (descriptos) External ADME data application 3 (phys-chem (descriptos) properties) External application 4 (descriptos) External application 1 (standardize) Retrieval of structures without properties Oracle SQLLoader Java application Internal PP components (descriptors) ICD Import properties Managed by Pipeline Pilot (Linux) SP Intervet Chemicals Directory (ICD) 10/26/2010 22
  • 23. Importing physico-chemical Properties II SP Intervet Chemicals Directory (ICD) 10/26/2010 23
  • 24. Database Maintenance SP Intervet Chemicals Directory (ICD) 10/26/2010 24
  • 25. Isentris PP Components @ SP Intervet ? Isentris Cheshire PP ? Converter: C C C C Chime string to Molecule Chime string to CTAB Molecule to Chime string CTAB to Chime string SP Intervet Chemicals Directory (ICD) 10/26/2010 25
  • 26. Acknowledgement Information Management ? Werner Schl┨ter ? Thomas Fischer BioChemInformatics ? Richard Marh?fer ? Andreas Krasky ? (J?rg Cramer) ? J?rg Schr?der ? Paul M. Selzer Thank you SP Intervet Chemicals Directory (ICD) 10/26/2010 26