Using the Python 3 library FPsim2 for similarity searches

FPsim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. It was developed as a Python3 library to support either in memory or out-of-core fast similarity searches on such dataset sizes.

Installation

Source code is available on GitHub https://github.com/chembl/FPSim2 and Conda packages are also available for either mac or linux, it requires Python 3.6 and a recent version of RDKit. I’ve written a page of instructions for installing Cheminformatics tools on a Mac, but if you only want to install RDKit then the best way is to use conda.

To instal RDKit

conda install -c rdkit rdkit

1 2	conda install -c rdkit rdkit

Then to install FPsim2

conda install -c efelix fpsim2

1 2	conda install -c efelix fpsim2

There is a demo file available for download but I decided to create my own, the entire ChEMBL 1.8M structures in sdf format are available for download https://www.ebi.ac.uk/chembl/downloads.

I converted a copy of CHEMBL in sdf format to SMILES using OpenBabel.

 /usr/local/bin/babel     '/Users/Chris/Desktop/chembl_24_1.sdf.gz'     '/Users/Chris/Desktop/output.smi'

1 2	/usr/local/bin/babel '/Users/Chris/Desktop/chembl_24_1.sdf.gz' '/Users/Chris/Desktop/output.smi'

and then edited the CHEMBL_Id to remove the text so it has this format because FPSim only supports integer ids for molecules, so converted CHEMBL1234567 to 1234567

The file now looks like this

C(=O)(c1ccc(cc1)I)NO 155287
C1C2CC3(CC1CC(C2)C3)C(=O)Oc1c(cc(cc1)CC=C)OC 265174
N1(C(=O)c2c(C1=O)cccc2)OCCCN1CCN(c2cc(ccc2)C)CC1 264472
c1(=N)n(c2c(s1)cccc2)CCN1CCC(c2ccc(cc2)F)CC1 405225
c1c(ccc(c1)C(=O)N[C@H](C(=O)O)[C@H](CC)C)CNC(=O)CCCCCCCCCCCCC 409812
c1c(ccc(c1)[C@H](C(=O)Nc1sc(cn1)F)CC1CCOCC1)S(=O)(=O)C1CC1 499520
c1cc(ccc1Cl)C/C(=N/c1ccc(cc1)O)/c1c(c(c(cc1)O)O)O 1082532

It is stored in the file named ChEMBL.smi and it contains 1.8M structures.

I then created the Morgan fingerprint file ChEMBL.h5 using the instructions on the GitHub site.

You can do this from the command line but I created a jupyter notebook to save me looking up the commands in the future. Creation of the fingerprint file took a little while but you only have to do it once. I’ve also included the commands for creating the file from either an sdf or Python list.

You can download the jupyter notebook here

GenerateFPfile.ipynb Download

Several different fingerprints can be calculated using RDKit https://www.rdkit.org/UGM/2012/LandrumRDKitUGM.Fingerprints.Final.pptx.pdf

MACCSKeys
Avalon
Morgan Extended-connectivity fingerprints (ECFPs)
TopologicalTorsion
AtomPair
RDKit
RDKPatternFingerprint

Extended-connectivity fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterisation, similarity searching, and structure-activity modelling DOI and are a pretty good default option. The H5 file format is a compressed file format with optimised read speed. The popular data manipulation python package pandas can import from and export to HDF5 via pytables.

Running the search

Whilst the search can be completed in a few lines of python it returns mol_id and coefficient only, I’d prefer to be able to actually see the similar structures. So this notebook does the search and then searches the original file containing the ChEMBL SMILES to get the SMILES string for the similar structures and then uses RDKit to render the structures.

We use the fingerprint file created above and input the query as a SMILES string.

The search itself is shown below

results = fpe.similarity(query, 0.7, n_workers=5)

1 2	results = fpe.similarity(query, 0.7, n_workers=5)

The similarity threshold is set to 0.7 and we use 5 cores (you can modify this) on my MacPro the search was completed almost instantaneously. The results are then imported into a pandas data frame. Is it possible to either use SMILES, InChi or molfiles as the query.

The next step is to get the SMILES strings corresponding to the hit mol_ids.

We first select the molid from the data frame and then use these molids to search the original SMILES file as shown below. For each matching molid the corresponding SMILES and molid are copied to theResults. These are then imported into a data frame.

Next the two data frames are merged using the mol_id as the key, finally a new column is added using the RDKit PandasTools.AddMoleculeColumnToFrame to render the structure in the data frame.

This all works fine, the slow part is the look up of the mol_id to get the SMILES step, for a 1.8M structure file this takes a few seconds, for larger files that might be an issue but I’m sure it could be speeded up. I’d be interested to hear of any suggestions.

You can download the jupyter notebook here InMemorySearch.ipynb.zip

InMemorySearch.ipynb Download

Last Updated 1 February 2019

Using the Python 3 library FPsim2 for similarity searches

Installation

Running the search

Related Posts

Clustering large datasets

Chemical registration software