�ݺ�ߣ

Quasi-Monte Carlo integration for
the fast and effective generation of
molecular shape ﬁngerprints
John D. MacCuish
Norah E. MacCuish,
Michael Hawrylycz, and Mitch Chapman

ACS San Francisco 2010
COMP Drug Discovery

john.maccuish@mesaac.com

Outline
• Why Shape? Why Shape Fingerprints?
• Prior Work

1.00
0.90
0.80
0.70
0.60
• quasi-Monte Carlo Integration

0.50
0.40
0.30
0.20
0.10
0.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

• Shape Fingerprint Generation
• Performance, Examples
• Future Work

Why Shape?
• Shape is a component in binding...
• 3D QSAR
• similarity searching
• compound acquisition (shape diversity)
• virtual screening...

• Can be confounding -- multiple
binding sites, surface binding...
• One more tool in the toolbox...
• Boost to 2D
• Pluses and minuses: Molecular Shape and Medicinal Chemistry:
A Perspective, JMedChem, Feb. 2010, Nicholls, et al

Why Shape Fingerprints?

• Compute shape comparisons
efﬁciently on a large scale

• Efﬁcient storage
• Simple.

Prior Work
• Shape as a mixture of spherical
gaussians (atoms) (Grant and
Pickup...’96)
• Fingerprints composed of key
shapes, given the above method
(Haigh, et al...’05).
• Ultrafast Shape Recognition:
Ballester, Richards 2007
• ShaEP, Vainio, et al 2009

Monte Carlo Integration
• Approximate an integral (e.g., a 3D
volume) by random sampling
• Approximate volume of odd shapes or
manifolds that are analytically difﬁcult
or have no closed form solution.
• Better error convergence than grid
sampling (Metropolis).

quasi-Monte Carlo
Integration (QMC)
• In practice, quasi-randomly generated
points have best error convergence in
low dimensions.
• Beats uniform random sampling (e.g.,
pseudo, dart throwing, etc.)
• QMC became popular in early-90s in
computer graphics (Shirley, ’91) mid-90s
in ﬁnance - options/futures pricing
(Morokoff, Caﬂisch, ’95)

Approximate Volume
quasi-random
grid pseudo-random (low discrepancy) P
1.00

1.00

1.00
0.90

0.90

0.90
0.80

0.80

0.80
0.70

0.70

0.70
0.60

0.60

0.60
0.50

0.50

0.50
0.40

0.40

0.40
0.30

0.30

0.30
0.20

0.20

0.20
0.10

0.10

0.10
0.00

0.00

0.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Error related to the Error related to just Error related to just
number of points number of points number of points
and the dimension

Approximation Error
• grid -- for error ε, need 1/εd grid points
-- error convergence exponential in
the dimension
• pseudo -- 1⁄√n convergence
• quasi -- 1/n convergence in practice
in low dimensions
• quasi-random fewest points needed
for equivalent approximation error.

Approximating the
Volume of a molecule
• E.g., CPK with van der Waals radii --
Set of intersecting spheres
• Find a suitable bounding region for
molecules
• Point sample the bounding region
and tally up the points either inside
or outside of the atom spheres.

Approximating the
Volume of a molecule
• Bounding Volume times (Points
Inside)/(All Points) ~= molecular
volume.

• Sphere or scalene ellipsoid as a
bounding region reduces total overall
points.

Volume Percent Error
95% below 95% below
4961 low energy
4.5% error 3.6% error
conformations
1357 Molecules
100-550 MW
ﬂexible, inﬂexible

balloon
conformations
95% below 95% below 1-15 conformations
2% error 1.3% error

Fingerprint Generation
Preliminaries
• Need a bounding region that covers the
conformational space of the database of
small molecule conformations.

• Cube? Sphere? Scalene Ellipsoid.
• Need orientation and alignment.

Fewer points
1.00
0.90
0.80
0.70
0.60

Don’t need these points Don’t need these points
0.50
0.40
0.30
0.20
0.10
0.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

1. Generate a sample quasi-point set bounding
region centered and fixed at (0,0,0); Sphere of
~11 Å radius. This is fixed, done once.
2. Mean center atom centers point set of a
conformation to be fingerprinted.
3. Find sample points in the atom spheres (or
function or your choice, e.g., solvent accessible
surface ...)
4. Find the principal axes for a shape using the
sampled points and the atom centers with SVD.
Adjust for SVD sign ambiguity (Bro, et al, 2008).

5. Rotate atom centers point set to Principal
Axes (PA)
6. Find points in PA configuration and
create fingerprint with 4 orientations --
points in molecule ‘1’, points out ‘0’
7. Subfingerprints, each the length of the
number of points.

Molecule in a Volume
of quasi-Random points

Shape Fingerprints
• 4 ﬁngerprints per shape. e.g., 10,240
X 4 bits = 5.12 Kbytes.

• Number and density of points
determines resolution, speed of
comparison, and storage size.

• Number of bits on is small, typically
3-10% with CPK model. Fingerprints
compress signiﬁcantly.

Fingerprint Similarity

• Choose 1 subFP of Shape FP A and compare it
with all 4 subFPs of Shape FP B.

• The largest similarity comparison is chosen.
• Bit difference per subFP is small in a given FP

Fingerprint Similarity
• Inherent slight asymmetry of alignment of
point sets of different sizes.

• Aligning B to A’s principal axes, vs aligning A
to B’s principal axes.

• Try this on your favorite method...
• Use bit string similarity measure of your
choice... Tanimoto, Ochai (Cosine), ...,
Tversky,... Baroni-Urbani/Bush...

Performance
• With ~10K points and including IO:
• Fingerprint generation: ~500K
conformations per hour

• Fingerprint Tanimoto comparisons:
130KX130K matrix in 24 hours

• Large scale similarity searching trivial,
e.g., 10X1M < 10 minutes
• Numbers estimated with 2009 iMac, with an Intel Core i7 2.8GHz processor
and 4GB RAM, running Mac OS X 10.6.2; small to large compounds;

Space!
• 5.12 Kbytes per ﬁngerprint -- ~90%
compression

• 2 million conformations = 1+ GBs
• Space and CPU improving all of the
time...

FP Experiments
Typical all-pairs
Tanimoto similarity
distribution
1357, 2D 768 MACCS
Keys ﬁngerprints

Where the action is:
Similarity searching,
Clustering, etc.
above 0.7

FP Experiments
All-pairs ShapeFingerprints
4961 confs.

Where the action is:
Similarity searching,
Still values above 0.7
Clustering,
Alignment, etc.
above ~0.65

Examples
• Multi-conformation set of actives in a
secondary screen in ROCK2 from
PubChem

• Xray ROCK2 ligand
• Fingerprint shape cluster
• Analyze shape cluster that contains the
xray structure with ChemTattoo 3D
“Fragment database analysis using molecular shape fingerprints”,
ACS San Francisco, Wednesday 8:40 , CINF 106

Shape Clustering
Example - ROCK2

Future Work
• Other shape functions
• Other low discrepancy point sets in
different distributions -- density
where it is needed.

• Speed (time) and compression
(space)

• More practical applications

Acknowledgments
• JMol

• Balloon

• OpenBabel

• R
• PubChem

• PDB

john.maccuish@mesaac.com

�ݺ�ߣ

QMC-based Shape Fingerprints

More Related Content

QMC-based Shape Fingerprints

Editor's Notes