A Jupyter Notebook to compare similarity between molecules

I’m sometimes asked for a tool to compare the similarity of a list of molecules with every other molecule in the list. I suspect there may be commercial tools to do this but for small numbers of compounds it is easy to visualise in a Jupyter notebook using RDKit. This is an old notebook that I’ve just updated to reflect the latest versions of RDKit and Pandas.

The RDKit has a variety of built-in functionality for generating molecular fingerprints and using them to calculate molecular similarity. Morgan fingerprints, better known as circular fingerprints, are built by applying the Morgan algorithm to a set of user-supplied atom invariants. The generated fingerprints are then compared using Dice similarity metric.

The input data file format is tab separated text, in this example I’ve taken 100 random molecules from ChEMBL.

SMILES ID

CC(NC(=O)N(C)O)c1cc2ccccc2s1 CHEMBL61706

Cc1nc(c(c(n1)-c1ccc(cc1)F)CCC1C[C@@H](O)CC(=O)O1)C CHEMBL158323

The first few cells of the notebook simply import the required libraries and the input file into a pandas dataframe.

from rdkit.Chem import AllChem as Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit import DataStructs

import numpy
import seaborn as sns
import matplotlib

import pandas as pd
#Allow inline images
%matplotlib inline

from rdkit.Chem import AllChem as Chem

from rdkit.Chem.Draw import IPythonConsole

from rdkit.Chem import PandasTools

from rdkit.Chem import Draw

from rdkit import DataStructs

import numpy

import seaborn as sns

import matplotlib

import pandas as pd

#Allow inline images

%matplotlib inline

Next we read in the file and display the first five rows

#If you want to read a local file then simply edit this filepath
#datafile = pd.read_csv('myfile.tsv', sep = '\t')

#The file format is tab separated text
#SMILES	ID	Name

#This example uses 100 random structures from ChEMBL

datafile = pd.read_csv('fortest.tsv', sep = '\t')

#View first five rows
datafile.head(5)

#If you want to read a local file then simply edit this filepath

#datafile = pd.read_csv('myfile.tsv', sep = '\t')

#The file format is tab separated text

#SMILES ID Name

#This example uses 100 random structures from ChEMBL

datafile = pd.read_csv('fortest.tsv', sep = '\t')

#View first five rows

datafile.head(5)

Convert the SMILES string to an RDKit molecular object

We can see the different datatypes in the dataframe

datafile.dtypes

1	datafile.dtypes

SMILES    object
ID        object
dtype: object

SMILES object

ID object

dtype: object

At the moment the molecule structures are represented by a SMILES string, we can convert the SMILES string to an RDKit molecular object and then display

PandasTools.AddMoleculeColumnToFrame(datafile,'SMILES','Mol',includeFingerprints=True)
>>> print([str(x) for x in  datafile.columns])

datafile.head(3)

PandasTools.AddMoleculeColumnToFrame(datafile,'SMILES','Mol',includeFingerprints=True)

>>> print([str(x) for x in datafile.columns])

datafile.head(3)

If we want to view all structures we can diaplay them as a grid

PandasTools.FrameToGridImage(datafile,column= 'Mol', molsPerRow=5,subImgSize=(150,150),legendsCol="ID")

1	PandasTools.FrameToGridImage(datafile,column= 'Mol', molsPerRow=5,subImgSize=(150,150),legendsCol="ID")

Calculation of molecular similarities

Now calculate fingerprints using RDKit, adding them to the end of the dataframe.

fplist = [] #fplist
for mol in datafile['Mol']:
    fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )
    fplist.append(fp)
    
datafile['mfp2']=fplist

fplist = [] #fplist

for mol in datafile['Mol']:

fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )

fplist.append(fp)

datafile['mfp2']=fplist

Comparing the similarity of two molecules

fp1=datafile.at[0,'mfp2']
fp2=datafile.at[1,'mfp2']

from rdkit import DataStructs
DataStructs.DiceSimilarity(fp1,fp2)

fp1=datafile.at[0,'mfp2']

fp2=datafile.at[1,'mfp2']

from rdkit import DataStructs

DataStructs.DiceSimilarity(fp1,fp2)

0.1686746987951807

We can now do this for molecules in the data set

for r in datafile.index:
#r =0
    fp1 = datafile.at[r,'mfp2']
    colname = datafile.at[r,'ID']
    simlist = [] #fplist
    for mol in datafile['Mol']:
        fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )
        sim =DataStructs.DiceSimilarity(fp1,fp)
        sim = round(sim,2) #only need value to 2 decimal places
        simlist.append(sim)
    datafile[colname]=simlist

for r in datafile.index:

#r =0

fp1 = datafile.at[r,'mfp2']

colname = datafile.at[r,'ID']

simlist = [] #fplist

for mol in datafile['Mol']:

fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )

sim =DataStructs.DiceSimilarity(fp1,fp)

sim = round(sim,2) #only need value to 2 decimal places

simlist.append(sim)

datafile[colname]=simlist

We now have all the similarity measures in the dataframe so we can remove any columns not needed.

#difficult to view dataframe so remove fingerprint column and others
newdatafile = datafile.drop(['mfp2','SMILES'], axis=1)

newdatafile

#difficult to view dataframe so remove fingerprint column and others

newdatafile = datafile.drop(['mfp2','SMILES'], axis=1)

newdatafile

Contextual colouring of dataframe

We can also use contextual colouring on the dataframe, in this instance we are going to highlight similarity scores but it could be used to highlight affinity, IC50 or a calclated property like LogP.

You can create “heatmaps” with the background_gradient method. These require matplotlib, and here we use Seaborn to get a nice colormap.

import seaborn as sns

cm = sns.light_palette("red", as_cmap=True)
s = newdatafile.style.background_gradient(cmap=cm).format(precision=2)
s

import seaborn as sns

cm = sns.light_palette("red", as_cmap=True)

s = newdatafile.style.background_gradient(cmap=cm).format(precision=2)

You can download the notebook and the data file here.

MolsimNotebook Download

A Jupyter Notebook to compare similarity between molecules

Convert the SMILES string to an RDKit molecular object

Calculation of molecular similarities

Contextual colouring of dataframe

Related Posts

Isambard-AI and Dawn AIRR supercomputers: Rapid Access route

AI meetings in Cambridge