際際滷

際際滷Share a Scribd company logo
Quasi-Monte Carlo integration for
the fast and effective generation of
   molecular shape 鍖ngerprints
               John D. MacCuish
              Norah E. MacCuish,
     Michael Hawrylycz, and Mitch Chapman

            ACS San Francisco 2010
             COMP Drug Discovery


           john.maccuish@mesaac.com
Outline
 Why Shape? Why Shape Fingerprints?
 Prior Work




                                  1.00
                                  0.90
                                  0.80
                                  0.70
                                  0.60
 quasi-Monte Carlo Integration




                                  0.50
                                  0.40
                                  0.30
                                  0.20
                                  0.10
                                  0.00
                                         0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00




 Shape Fingerprint Generation
 Performance, Examples
 Future Work
Why Shape?
 Shape is a component in binding...
     3D QSAR
     similarity searching
     compound acquisition (shape diversity)
     virtual screening...

 Can be confounding -- multiple
  binding sites, surface binding...
 One more tool in the toolbox...
 Boost to 2D
   Pluses and minuses: Molecular Shape and Medicinal Chemistry:
      A Perspective, JMedChem, Feb. 2010, Nicholls, et al
Why Shape Fingerprints?

 Compute shape comparisons
  ef鍖ciently on a large scale

 Ef鍖cient storage
 Simple.
Prior Work
 Shape as a mixture of spherical
  gaussians (atoms) (Grant and
  Pickup...96)
 Fingerprints composed of key
  shapes, given the above method
  (Haigh, et al...05).
 Ultrafast Shape Recognition:
  Ballester, Richards 2007
 ShaEP, Vainio, et al 2009
Monte Carlo Integration
 Approximate an integral (e.g., a 3D
  volume) by random sampling
 Approximate volume of odd shapes or
  manifolds that are analytically dif鍖cult
  or have no closed form solution.
 Better error convergence than grid
  sampling (Metropolis).
quasi-Monte Carlo
   Integration (QMC)
 In practice, quasi-randomly generated
  points have best error convergence in
  low dimensions.
 Beats uniform random sampling (e.g.,
  pseudo, dart throwing, etc.)
 QMC became popular in early-90s in
  computer graphics (Shirley, 91) mid-90s
  in 鍖nance - options/futures pricing
  (Morokoff, Ca鍖isch, 95)
Approximate Volume
                                                                                                                                                                                 quasi-random
                                   grid                                                           pseudo-random                                                                (low discrepancy)                                                            P
1.00




                                                                                    1.00




                                                                                                                                                                        1.00
0.90




                                                                                    0.90




                                                                                                                                                                        0.90
0.80




                                                                                    0.80




                                                                                                                                                                        0.80
0.70




                                                                                    0.70




                                                                                                                                                                        0.70
0.60




                                                                                    0.60




                                                                                                                                                                        0.60
0.50




                                                                                    0.50




                                                                                                                                                                        0.50
0.40




                                                                                    0.40




                                                                                                                                                                        0.40
0.30




                                                                                    0.30




                                                                                                                                                                        0.30
0.20




                                                                                    0.20




                                                                                                                                                                        0.20
0.10




                                                                                    0.10




                                                                                                                                                                        0.10
0.00




                                                                                    0.00




                                                                                                                                                                        0.00
                                                                                                                                                                               0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00
       0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00          0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00



        Error related to the                                                                 Error related to just                                                             Error related to just
         number of points                                                                     number of points                                                                  number of points
        and the dimension
Approximation Error
 grid -- for error 竜, need 1/竜d grid points
  -- error convergence exponential in
  the dimension
 pseudo -- 1n convergence
 quasi -- 1/n convergence in practice
  in low dimensions
 quasi-random fewest points needed
  for equivalent approximation error.
Approximating the
Volume of a molecule
 E.g., CPK with van der Waals radii --
  Set of intersecting spheres
 Find a suitable bounding region for
  molecules
 Point sample the bounding region
  and tally up the points either inside
  or outside of the atom spheres.
Approximating the
Volume of a molecule
 Bounding Volume times (Points
  Inside)/(All Points) ~= molecular
  volume.

 Sphere or scalene ellipsoid as a
  bounding region reduces total overall
  points.
Volume Percent Error
   95% below    95% below
                             4961 low energy
   4.5% error   3.6% error
                             conformations
                             1357 Molecules
                             100-550 MW
                             鍖exible, in鍖exible

                            balloon
                            conformations
  95% below      95% below 1-15 conformations
   2% error      1.3% error
Fingerprint Generation
     Preliminaries
 Need a bounding region that covers the
  conformational space of the database of
  small molecule conformations.

 Cube? Sphere? Scalene Ellipsoid.
 Need orientation and alignment.
Fewer points
                          1.00
                          0.90
                          0.80
                          0.70
                          0.60




Dont need these points                                                                                       Dont need these points
                          0.50
                          0.40
                          0.30
                          0.20
                          0.10
                          0.00




                                 0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00
Fingerprint Generation
1. Generate a sample quasi-point set bounding
   region centered and 鍖xed at (0,0,0); Sphere of
   ~11  radius. This is 鍖xed, done once.
2. Mean center atom centers point set of a
   conformation to be 鍖ngerprinted.
3. Find sample points in the atom spheres (or
   function or your choice, e.g., solvent accessible
   surface ...)
4. Find the principal axes for a shape using the
   sampled points and the atom centers with SVD.
   Adjust for SVD sign ambiguity (Bro, et al, 2008).
Fingerprint Generation
5. Rotate atom centers point set to Principal
   Axes (PA)
6. Find points in PA con鍖guration and
   create 鍖ngerprint with 4 orientations --
   points in molecule 1, points out 0
7. Sub鍖ngerprints, each the length of the
   number of points.
Molecule in a Volume
of quasi-Random points
Four con鍖gurations
Alignment Overlay
Alignment Overlay
Shape Fingerprints
 4 鍖ngerprints per shape. e.g., 10,240
  X 4 bits = 5.12 Kbytes.

 Number and density of points
  determines resolution, speed of
  comparison, and storage size.

 Number of bits on is small, typically
  3-10% with CPK model. Fingerprints
  compress signi鍖cantly.
Fingerprint Similarity

 Choose 1 subFP of Shape FP A and compare it
  with all 4 subFPs of Shape FP B.

 The largest similarity comparison is chosen.
 Bit difference per subFP is small in a given FP
Fingerprint Similarity
 Inherent slight asymmetry of alignment of
  point sets of different sizes.

 Aligning B to As principal axes, vs aligning A
  to Bs principal axes.

 Try this on your favorite method...
 Use bit string similarity measure of your
  choice... Tanimoto, Ochai (Cosine), ...,
  Tversky,... Baroni-Urbani/Bush...
Performance
 With ~10K points and including IO:
 Fingerprint generation: ~500K
   conformations per hour

 Fingerprint Tanimoto comparisons:
   130KX130K matrix in 24 hours

 Large scale similarity searching trivial,
   e.g., 10X1M < 10 minutes
 Numbers estimated with 2009 iMac, with an Intel Core i7 2.8GHz processor
   and 4GB RAM, running Mac OS X 10.6.2; small to large compounds;
Space!
 5.12 Kbytes per 鍖ngerprint -- ~90%
  compression

 2 million conformations = 1+ GBs
 Space and CPU improving all of the
  time...
FP Experiments
         Typical all-pairs
       Tanimoto similarity
           distribution
      1357, 2D 768 MACCS
        Keys 鍖ngerprints

        Where the action is:
        Similarity searching,
          Clustering, etc.
             above 0.7
FP Experiments
All-pairs ShapeFingerprints
        4961 confs.


      Where the action is:
      Similarity searching,
                              Still values above 0.7
           Clustering,
        Alignment, etc.
         above ~0.65
Examples
        Multi-conformation set of actives in a
           secondary screen in ROCK2 from
           PubChem

        Xray ROCK2 ligand
        Fingerprint shape cluster
        Analyze shape cluster that contains the
           xray structure with ChemTattoo 3D
Fragment database analysis using molecular shape fingerprints,
        ACS San Francisco, Wednesday 8:40 , CINF 106
Shape Clustering
Example - ROCK2
Future Work
 Other shape functions
 Other low discrepancy point sets in
  different distributions -- density
  where it is needed.

 Speed (time) and compression
  (space)

 More practical applications
Acknowledgments
 JMol

 Balloon

 OpenBabel

 R
 PubChem

 PDB

            john.maccuish@mesaac.com

More Related Content

QMC-based Shape Fingerprints

  • 1. Quasi-Monte Carlo integration for the fast and effective generation of molecular shape 鍖ngerprints John D. MacCuish Norah E. MacCuish, Michael Hawrylycz, and Mitch Chapman ACS San Francisco 2010 COMP Drug Discovery john.maccuish@mesaac.com
  • 2. Outline Why Shape? Why Shape Fingerprints? Prior Work 1.00 0.90 0.80 0.70 0.60 quasi-Monte Carlo Integration 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Shape Fingerprint Generation Performance, Examples Future Work
  • 3. Why Shape? Shape is a component in binding... 3D QSAR similarity searching compound acquisition (shape diversity) virtual screening... Can be confounding -- multiple binding sites, surface binding... One more tool in the toolbox... Boost to 2D Pluses and minuses: Molecular Shape and Medicinal Chemistry: A Perspective, JMedChem, Feb. 2010, Nicholls, et al
  • 4. Why Shape Fingerprints? Compute shape comparisons ef鍖ciently on a large scale Ef鍖cient storage Simple.
  • 5. Prior Work Shape as a mixture of spherical gaussians (atoms) (Grant and Pickup...96) Fingerprints composed of key shapes, given the above method (Haigh, et al...05). Ultrafast Shape Recognition: Ballester, Richards 2007 ShaEP, Vainio, et al 2009
  • 6. Monte Carlo Integration Approximate an integral (e.g., a 3D volume) by random sampling Approximate volume of odd shapes or manifolds that are analytically dif鍖cult or have no closed form solution. Better error convergence than grid sampling (Metropolis).
  • 7. quasi-Monte Carlo Integration (QMC) In practice, quasi-randomly generated points have best error convergence in low dimensions. Beats uniform random sampling (e.g., pseudo, dart throwing, etc.) QMC became popular in early-90s in computer graphics (Shirley, 91) mid-90s in 鍖nance - options/futures pricing (Morokoff, Ca鍖isch, 95)
  • 8. Approximate Volume quasi-random grid pseudo-random (low discrepancy) P 1.00 1.00 1.00 0.90 0.90 0.90 0.80 0.80 0.80 0.70 0.70 0.70 0.60 0.60 0.60 0.50 0.50 0.50 0.40 0.40 0.40 0.30 0.30 0.30 0.20 0.20 0.20 0.10 0.10 0.10 0.00 0.00 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Error related to the Error related to just Error related to just number of points number of points number of points and the dimension
  • 9. Approximation Error grid -- for error 竜, need 1/竜d grid points -- error convergence exponential in the dimension pseudo -- 1n convergence quasi -- 1/n convergence in practice in low dimensions quasi-random fewest points needed for equivalent approximation error.
  • 10. Approximating the Volume of a molecule E.g., CPK with van der Waals radii -- Set of intersecting spheres Find a suitable bounding region for molecules Point sample the bounding region and tally up the points either inside or outside of the atom spheres.
  • 11. Approximating the Volume of a molecule Bounding Volume times (Points Inside)/(All Points) ~= molecular volume. Sphere or scalene ellipsoid as a bounding region reduces total overall points.
  • 12. Volume Percent Error 95% below 95% below 4961 low energy 4.5% error 3.6% error conformations 1357 Molecules 100-550 MW 鍖exible, in鍖exible balloon conformations 95% below 95% below 1-15 conformations 2% error 1.3% error
  • 13. Fingerprint Generation Preliminaries Need a bounding region that covers the conformational space of the database of small molecule conformations. Cube? Sphere? Scalene Ellipsoid. Need orientation and alignment.
  • 14. Fewer points 1.00 0.90 0.80 0.70 0.60 Dont need these points Dont need these points 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
  • 15. Fingerprint Generation 1. Generate a sample quasi-point set bounding region centered and 鍖xed at (0,0,0); Sphere of ~11 radius. This is 鍖xed, done once. 2. Mean center atom centers point set of a conformation to be 鍖ngerprinted. 3. Find sample points in the atom spheres (or function or your choice, e.g., solvent accessible surface ...) 4. Find the principal axes for a shape using the sampled points and the atom centers with SVD. Adjust for SVD sign ambiguity (Bro, et al, 2008).
  • 16. Fingerprint Generation 5. Rotate atom centers point set to Principal Axes (PA) 6. Find points in PA con鍖guration and create 鍖ngerprint with 4 orientations -- points in molecule 1, points out 0 7. Sub鍖ngerprints, each the length of the number of points.
  • 17. Molecule in a Volume of quasi-Random points
  • 21. Shape Fingerprints 4 鍖ngerprints per shape. e.g., 10,240 X 4 bits = 5.12 Kbytes. Number and density of points determines resolution, speed of comparison, and storage size. Number of bits on is small, typically 3-10% with CPK model. Fingerprints compress signi鍖cantly.
  • 22. Fingerprint Similarity Choose 1 subFP of Shape FP A and compare it with all 4 subFPs of Shape FP B. The largest similarity comparison is chosen. Bit difference per subFP is small in a given FP
  • 23. Fingerprint Similarity Inherent slight asymmetry of alignment of point sets of different sizes. Aligning B to As principal axes, vs aligning A to Bs principal axes. Try this on your favorite method... Use bit string similarity measure of your choice... Tanimoto, Ochai (Cosine), ..., Tversky,... Baroni-Urbani/Bush...
  • 24. Performance With ~10K points and including IO: Fingerprint generation: ~500K conformations per hour Fingerprint Tanimoto comparisons: 130KX130K matrix in 24 hours Large scale similarity searching trivial, e.g., 10X1M < 10 minutes Numbers estimated with 2009 iMac, with an Intel Core i7 2.8GHz processor and 4GB RAM, running Mac OS X 10.6.2; small to large compounds;
  • 25. Space! 5.12 Kbytes per 鍖ngerprint -- ~90% compression 2 million conformations = 1+ GBs Space and CPU improving all of the time...
  • 26. FP Experiments Typical all-pairs Tanimoto similarity distribution 1357, 2D 768 MACCS Keys 鍖ngerprints Where the action is: Similarity searching, Clustering, etc. above 0.7
  • 27. FP Experiments All-pairs ShapeFingerprints 4961 confs. Where the action is: Similarity searching, Still values above 0.7 Clustering, Alignment, etc. above ~0.65
  • 28. Examples Multi-conformation set of actives in a secondary screen in ROCK2 from PubChem Xray ROCK2 ligand Fingerprint shape cluster Analyze shape cluster that contains the xray structure with ChemTattoo 3D Fragment database analysis using molecular shape fingerprints, ACS San Francisco, Wednesday 8:40 , CINF 106
  • 30. Future Work Other shape functions Other low discrepancy point sets in different distributions -- density where it is needed. Speed (time) and compression (space) More practical applications
  • 31. Acknowledgments JMol Balloon OpenBabel R PubChem PDB john.maccuish@mesaac.com

Editor's Notes