Vortex script using Using OpenBabel fastsearch

One thing I’ve needed to do a couple of times recently is give an idea of how many similar compounds are available to the set of compounds I’m currently viewing. For example in designing a fragment library it is very useful to know for a particular fragment how many similar fragments are commercially available. Or when looking at the results of a high-throughput screen how many similar analogues to a particular hit were also screened.

To do this we need a way of doing a rapid similarity search of the reference database. I use OpenBabel in particular using the fast search capability with molecular fingerprints.

Molecular fingerprints encode molecular structure in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints will allow you rapidly to determine the similarity between two molecule

Using OpenBabel

We first need to create the fast search index, by default the fingerprint is FP2 (Indexes linear fragments up to 7 atoms) and does not need to be defined but I’ve left it in so you can see the format if you want to use other fingerprints

The available fingerprints are

This process may take a little while, on my MacBook Pro a 50K sdf took less than a minute to generate the fast search index. You can test using a simple search, the following Terminal command returns all structures with a Tanimoto similarity of > 0.85.

c1(N2CCNCC2)ccccc1 
N1(CCN(CC1)c1ccccc1)c1ccccc1 
N1(CCN(CC1)C)c1ccccc1 
N(c1ccccc1)CCN(CC)CC 
N1(c2ccc(N)cc2)CCN(CC1)C 
N1(c2ccc(N)cc2)CCN(CC1)CC 
N1(c2ccc(N)cc2)CCN(CC1)C 
N1(c2ccc(N)cc2)CCN(CC1)CC 
N1(c2ccc(N)cc2)CCN(CC1)CC 
N1(c2ccc(N)cc2)CCNCC1 
N1(c2c(N)cccc2)CCNCC1 
N1(c2c(N)cccc2)CCN(CC1)C 
N1(c2cc(N)ccc2)CCN(CC1)C 
N1(c2c(N)cccc2)CCN(CC1)C 
N1(c2c(N)cccc2)CCN(CC1)CC 
c1cccc(c1)N(CCN)CC 
16 molecules converted

Whilst the following command would return the 5 most similar structures

c1(N2CCNCC2)ccccc1 
N1(CCN(CC1)c1ccccc1)c1ccccc1 
N1(CCN(CC1)C)c1ccccc1 
N(c1ccccc1)CCN(CC)CC 
N1(c2ccc(N)cc2)CCN(CC1)CC 
5 molecules converted

There is a slight catch, whilst all the output is displayed in the Terminal window not all is derived from stdout, this is because by default stderr is also displayed in the Terminal window and stdlog is mapped to stderr. We can check what comes from where by selectively suppressing either stdout or stderr. Many thanks to Chris Morley for explaining this.

This command stops the stderr output

This results in

c1(N2CCNCC2)ccccc1 N1(CCN(CC1)c1ccccc1)c1ccccc1 
N1(CCN(CC1)C)c1ccccc1 
N(c1ccccc1)CCN(CC)CC 
N1(c2ccc(N)cc2)CCN(CC1)CC

If we now stop the stdout output

This results in 

5 molecules converted

Since for this script we only need to know the number of similar molecules we need to capture the stderr output.

Script Explanation

The first part imports various libraries and I have to confess I now use a simple boilerplate so not all are actually needed. We then open a dialog to allow the user to select the appropriate fast search file and get the absolute path. We then create a column to store the results in, and then loop through the table extracting the SMILES string for each structure. The next step is key, we create the subprocess but ensure we capture both stdout and stderr, (we actually only need stderr but by capturing both it does give options for modifying the script). You can obviously change the similarity coefficient from 0.85 if you want.

The output is in the following format

We now parse the output, add it to the table and update the display.

Since we are doing a separate search for each row in the table this might be slow for large datasets.

The Vortex Script

The vortex script can be downloaded from here

Last Updated 7 May 2013

Related Posts