One thing I’ve needed to do a couple of times recently is give an idea of how many similar compounds are available to the set of compounds I’m currently viewing. For example in designing a fragment library it is very useful to know for a particular fragment how many similar fragments are commercially available. Or when looking at the results of a high-throughput screen how many similar analogues to a particular hit were also screened.
To do this we need a way of doing a rapid similarity search of the reference database. I use OpenBabel in particular using the fast search capability with molecular fingerprints.
Molecular fingerprints encode molecular structure in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints will allow you rapidly to determine the similarity between two molecule
Using OpenBabel
We first need to create the fast search index, by default the fingerprint is FP2 (Indexes linear fragments up to 7 atoms) and does not need to be defined but I’ve left it in so you can see the format if you want to use other fingerprints
1 2 |
/usr/local/bin/obabel '/Users/swain/Projects/FragFiles/sdf/allFragments.sdf' -O '/Users/swain/Projects/FragFiles/sdf/allFragments.fs -xfFP2 |
The available fingerprints are
1 2 3 4 5 |
FP2 Indexes linear fragments up to 7 atoms. FP3 SMARTS patterns specified in the file patterns.txt FP4 SMARTS patterns specified in the file SMARTS_InteLigand.txt MACCS SMARTS patterns specified in the file MACCS.txt |
This process may take a little while, on my MacBook Pro a 50K sdf took less than a minute to generate the fast search index. You can test using a simple search, the following Terminal command returns all structures with a Tanimoto similarity of > 0.85.
1 2 |
/usr/local/bin/obabel '/Users/swain/Projects/FragFiles/sdf/allFragments.fs' -osmi -s'C1CN(CCN1)C1=CC=CC=C1' -at0.85 |
c1(N2CCNCC2)ccccc1
N1(CCN(CC1)c1ccccc1)c1ccccc1
N1(CCN(CC1)C)c1ccccc1
N(c1ccccc1)CCN(CC)CC
N1(c2ccc(N)cc2)CCN(CC1)C
N1(c2ccc(N)cc2)CCN(CC1)CC
N1(c2ccc(N)cc2)CCN(CC1)C
N1(c2ccc(N)cc2)CCN(CC1)CC
N1(c2ccc(N)cc2)CCN(CC1)CC
N1(c2ccc(N)cc2)CCNCC1
N1(c2c(N)cccc2)CCNCC1
N1(c2c(N)cccc2)CCN(CC1)C
N1(c2cc(N)ccc2)CCN(CC1)C
N1(c2c(N)cccc2)CCN(CC1)C
N1(c2c(N)cccc2)CCN(CC1)CC
c1cccc(c1)N(CCN)CC
16 molecules converted
Whilst the following command would return the 5 most similar structures
1 2 |
/usr/local/bin/obabel '/Users/swain/Projects/FragFiles/sdf/allFragments.fs' -osmi -s'C1CN(CCN1)C1=CC=CC=C1' -at5 |
c1(N2CCNCC2)ccccc1
N1(CCN(CC1)c1ccccc1)c1ccccc1
N1(CCN(CC1)C)c1ccccc1
N(c1ccccc1)CCN(CC)CC
N1(c2ccc(N)cc2)CCN(CC1)CC
5 molecules converted
There is a slight catch, whilst all the output is displayed in the Terminal window not all is derived from stdout, this is because by default stderr is also displayed in the Terminal window and stdlog is mapped to stderr. We can check what comes from where by selectively suppressing either stdout or stderr. Many thanks to Chris Morley for explaining this.
This command stops the stderr output
1 |
/usr/local/bin/obabel '/Users/swain/Projects/FragFiles/sdf/allFragments.fs' -osmi -s'C1CN(CCN1)C1=CC=CC=C1' -at5 2> /dev/null |
This results in
c1(N2CCNCC2)ccccc1 N1(CCN(CC1)c1ccccc1)c1ccccc1
N1(CCN(CC1)C)c1ccccc1
N(c1ccccc1)CCN(CC)CC
N1(c2ccc(N)cc2)CCN(CC1)CC
If we now stop the stdout output
1 |
/usr/local/bin/obabel '/Users/swain/Projects/FragFiles/sdf/allFragments.fs' -osmi -s'C1CN(CCN1)C1=CC=CC=C1' -at5 > /dev/null |
This results in
5 molecules converted
Since for this script we only need to know the number of similar molecules we need to capture the stderr output.
Script Explanation
The first part imports various libraries and I have to confess I now use a simple boilerplate so not all are actually needed. We then open a dialog to allow the user to select the appropriate fast search file and get the absolute path. We then create a column to store the results in, and then loop through the table extracting the SMILES string for each structure. The next step is key, we create the subprocess but ensure we capture both stdout and stderr, (we actually only need stderr but by capturing both it does give options for modifying the script). You can obviously change the similarity coefficient from 0.85 if you want.
1 2 |
p = subprocess.Popen(['/usr/local/bin/obabel', fsFile, '-o', 'smi', '-s', smiles, '-at0.85'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
The output is in the following format
1 2 |
'1 molecule converted\n’ |
We now parse the output, add it to the table and update the display.
Since we are doing a separate search for each row in the table this might be slow for large datasets.
The Vortex Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# This uses the OpenBabel fastsearch (http://openbabel.org/wiki/Tutorial:Fingerprints) import sys import com.dotmatics.vortex.util.Util as Util if Util.getPlatform() == Util.PlatformIsWindows: sys.path.append(vortex.getVortexFolder() + '\\modules\\jythonconsole') sys.path.append(vortex.getVortexFolder() + '\\modules\\jythonlib') else: sys.path.append(vortex.getVortexFolder() + '/modules/jythonconsole') sys.path.append(vortex.getVortexFolder() + '/modules/jythonlib') import urllib2 import urllib import subprocess # Open a dialog to choose a file # getFile(title, extensions, 0 = Open, 1 = Save) # vortex will keep track of the last folder you looked in etc file = vortex.getFile("Choose a fastsearch file", [".fs"], 0) #display file path if file: vortex.alert(str(file)) # Need to coerce file path fsFile=file.getAbsolutePath() # /usr/local/bin/babel '/Users/swain/Projects/FragFiles/sdf/allFragments.fs' -osmi -xt -s'NC(=N)c1ccccc1' column = vtable.findColumnWithName('NumSim', 1, 3) mfm = vtable.getMolFileManager() rows = vtable.getRealRowCount() for r in range(0, rows): smiles = vortex.getMolProperty(mfm.getMolFileAtRow(r), 'SMILES') p = subprocess.Popen(['/usr/local/bin/obabel', fsFile, '-o', 'smi', '-s', smiles, '-at0.85'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) # for stdout output = p.communicate()[0] # stderr output = p.communicate()[1] # output format '1 molecule converted\n’ # Parse the output vals= output.split(' ') column = vtable.findColumnWithName('NumSim', 0) column.setValueFromString(r, vals[0]) vtable.fireTableStructureChanged() |
The vortex script can be downloaded from here
Last Updated 7 May 2013