There was an interesting publication from the Todd group at UCL on Chemrxiv “Idler Compounds: A Simple Protocol for Openly Sharing Fridge Contents for Cross-Screening” https://chemrxiv.org/doi/10.26434/chemrxiv-2025-nqjb4. Now published https://pubs.acs.org/doi/10.1021/acs.jmedchem.5c02354.

Matt Todd is heavily involved in a number of open-source drug discovery projects and this paper highlights the opportunity this brings for sharing molecules that have been made for one project with other unrelated biological targets.

Since the structures are in the public domain it is possible for anyone to access them, details are on GitHub https://todd-lers.github.io/about/idler.html. However, whilst a Google sheet does provide easy access it is not chemically intelligent. This Jupyter notebook shows how to download the data, then import it into a Pandas data frame and then use RDKit to convert the SMILES strings to molecular objects, these can then be used to calculate physicochemical properties.

GetIdlerCompounds

A Jupyter Notebook to access structures and data from the Idler master worksheet¶

This notebook demonstates how to get the structures and data from the Google worksheet, then convert the SMILES to molecule objects that allow some simple manipulisations and visualisations. SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Getting the data¶

In [ ]:

#https://docs.google.com/spreadsheets/d/1heWWU_xi_NSQRvNA5_wRuw_vl9IhMzXtihmAKnpZMWw/export?format=tsv&gid=2078630269
#First get data from Google doc

!wget -O example.tsv "https://docs.google.com/spreadsheets/d/1heWWU_xi_NSQRvNA5_wRuw_vl9IhMzXtihmAKnpZMWw/export?format=tsv"
#The data is downloaded to a file called example.tsv in the same folder as the notebook, in tab separated format

Import the required python modules and then import the example.tsv file into a Pandas dataframe called datafile

In [ ]:

from rdkit.Chem import AllChem as Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw



import pandas as pd
#datafile = pd.read_table('./export?format=tsv')
datafile = pd.read_csv('example.tsv', sep = '\t', skiprows=1, skipfooter =1, engine = "python")

In [ ]:

#Allow inline images
%matplotlib inline

In [ ]:

#View first five rows
datafile.head(5)

In [ ]:

#Find how many rows
len(datafile.index)

Convert the SMILES string to an RDKit molecular object¶

At the moment the molecule structures are represented by a SMILES string, we can convert the SMILES string to an RDKit molecular object and then display

In [ ]:

#Remember the first row is row zero.
smiles = datafile['SMILES'].loc[3]

In [ ]:

#convert SMILES string to a RDKit molecular object
mol = Chem.MolFromSmiles(smiles)

In [ ]:

mol

We can see the different datatypes in the dataframe

In [ ]:

datafile.dtypes

Adding structures to pandas dataframe¶

We can now convert the SMILES string to a RDKit molecular object for every row in the dataframe

In [ ]:

PandasTools.AddMoleculeColumnToFrame(datafile,smilesCol='SMILES')

In [ ]:

datafile.dtypes

If we view the dataframe the molecule object has been added to the last column. It would be better if the structure was more readily visible. So we change the column order.

In [ ]:

datafile.tail(3)

In [ ]:

#display the current order
cols = list(datafile.columns.values)
cols

In [ ]:

datafile.head(3)

In [ ]:

If we want to view structures we can diaplay them as a grid

In [ ]:

PandasTools.FrameToGridImage(datafile,column= 'ROMol', molsPerRow=4,subImgSize=(150,150),legendsCol="Compound Code")

Calculation of molecule properties¶

Now calculate a variety of properties using RDKit, adding them to the end of the dataframe. you can choose which properties to add here.

In [ ]:

# Some of the availble descriptors are described here http://rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html
from rdkit.Chem import rdMolDescriptors

In [ ]:

hbdlist = [] #hydrogen bond donors
hbalist = [] #hydrogen bond acceptors
tpsalist = [] #Total polar surface area
mwtlist = [] #Exact molecular weight
logPlist = [] #Crippen LogP
mrlist = [] #Crippen MR
for mol in datafile['ROMol']:
    hbd = rdMolDescriptors.CalcNumHBD(mol)
    hbdlist.append(hbd)
    hba = rdMolDescriptors.CalcNumHBA(mol)
    hbalist.append(hba)
    TPSA = rdMolDescriptors.CalcTPSA(mol)
    tpsalist.append(TPSA)
    mwt = rdMolDescriptors.CalcExactMolWt(mol)
    mwtlist.append(mwt)
    crippen = rdMolDescriptors.CalcCrippenDescriptors(mol) #returns a 2-tuple with the Wildman-Crippen logp,mr values
    logPlist.append(crippen[0])#first is logP
    mrlist.append(crippen[1])#second is mr

In [ ]:

mrlist

In [ ]:

We now add each of the properties to the dataframe

In [ ]:

datafile['HBD']=hbdlist
datafile['HBA']=hbalist
datafile['TPSA']=tpsalist
datafile['MWt']=mwtlist
datafile['LogP']=logPlist
datafile['MR']=mrlist
datafile.head(3)

We can also add a molecular properties

In [ ]:

datafile['NumHeavyAtoms']=datafile.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

In [ ]:

datafile.head(3)

Plotting properties¶

We can using seaborn (http://seaborn.pydata.org/index.html) a Python visualization library based on matplotlib to generate a variety of plots.

In [ ]:

import seaborn as sns

In [ ]:

myTPSA = datafile['TPSA']
myMWt = datafile['MWt']

In [ ]:

#Scatter plot
sns.scatterplot(x = myMWt, y = myTPSA,)

In [ ]:

#bar chart
sns.histplot(myMWt, kde=False, color='red', bins =10)

In [ ]:

A couple of points

In the first cell we use wget a tool for downloading files using HTTP, HTTPS, FTP and FTPS. Note that it is preceded by an exclamation mark.

1	!wget -O example.tsv "https://docs.google.com/spreadsheets/d/1heWWU_xi_NSQRvNA5_wRuw_vl9IhMzXtihmAKnpZMWw/export?format=tsv"

This allows Jupyter to run shell commands within cells, the file is saved as example.tsv (tab separated format). This can then be imported into a pandas data frame using RDKit tools. However, the first row of the file contains a description

and the final row is a comment. We don’t want to import these rows so we skip the header and footer, using Python as the parser engine (Python parser engine if more feature complete).

1	datafile = pd.read_csv('example.tsv', sep = '\t', skiprows=1, skipfooter =1, engine = "python")

The SMILES strings are then converted to RDKit molecular objects.

1	PandasTools.AddMoleculeColumnToFrame(datafile,smilesCol='SMILES')

These can then be rendered using a couple of options, a variety of physicochemical properties are calculated and plotted using seaborne.

A Jupyter Notebook to access structures and data from the Idler master worksheet

A Jupyter Notebook to access structures and data from the Idler master worksheet¶

Getting the data¶

Convert the SMILES string to an RDKit molecular object¶

Adding structures to pandas dataframe¶

Calculation of molecule properties¶

Plotting properties¶

A couple of points

Related Posts

SCORE MLX Distilled CheMeleon molecular fingerprints on Apple Silicon

PDB reaches a quarter of a million structures