A Jupyter Notebook to compare similarity between molecules

I’m sometimes asked for a tool to compare the similarity of a list of molecules with every other molecule in the list. I suspect there may be commercial tools to do this but for small numbers of compounds it is easy to visualise in a Jupyter notebook using RDKit. This is an old notebook that I’ve just updated to reflect the latest versions of RDKit and Pandas.

The RDKit has a variety of built-in functionality for generating molecular fingerprints and using them to calculate molecular similarity. Morgan fingerprints, better known as circular fingerprints, are built by applying the Morgan algorithm to a set of user-supplied atom invariants. The generated fingerprints are then compared using Dice similarity metric.

The input data file format is tab separated text, in this example I’ve taken 100 random molecules from ChEMBL.

The first few cells of the notebook simply import the required libraries and the input file into a pandas dataframe.

Next we read in the file and display the first five rows

Convert the SMILES string to an RDKit molecular object

We can see the different datatypes in the dataframe

At the moment the molecule structures are represented by a SMILES string, we can convert the SMILES string to an RDKit molecular object and then display

If we want to view all structures we can diaplay them as a grid

Calculation of molecular similarities

Now calculate fingerprints using RDKit, adding them to the end of the dataframe.

Comparing the similarity of two molecules

0.1686746987951807

We can now do this for molecules in the data set

We now have all the similarity measures in the dataframe so we can remove any columns not needed.

Contextual colouring of dataframe

We can also use contextual colouring on the dataframe, in this instance we are going to highlight similarity scores but it could be used to highlight affinity, IC50 or a calclated property like LogP.

You can create “heatmaps” with the background_gradient method. These require matplotlib, and here we use Seaborn to get a nice colormap.

You can download the notebook and the data file here.

Related Posts