t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

Whilst there are a number of options for clustering molecules with some options ideal for very large datasets the results are sometimes not ideal for interactive exploration (an alternative is PCA). I like t-SNE because the local structure of the dataset is maintained.

This Jupyter notebook uses  t-SNE from the scikit-learn package, and Morgan fingerprints generated using RDKit. The NK1 IC50 antagonist data is taken from ChEMBL, target CHEMBL249. (https://www.ebi.ac.uk/chembl/explore/target/CHEMBL249). Data flagged by the ChEMBL team as unreliable was removed. The hNK1 IC50 is in the “Standard Value” field.

The Jupyter notebook is shown below. The generation of the different fingerprints has now been updated to use a consistent api https://greglandrum.github.io/rdkit-blog/posts/2023-01-18-fingerprint-generator-tutorial.html

The idea of the new code is that all supported fingerprinting algorithms can be used the same way: you create a generator for that fingerprint algorithm with the appropriate parameters set and then ask the generator to give you the fingerprint type you want for each molecule.

The old way to generate the Morgan fingerprints was

ECFP_fps = [AllChem.GetMorganFingerprintAsBitVect(x,radius=radius, nBits=nBits) for x in AllStruturesDF['ROMol']]

Now we create a generator and then use it to yield the fingerprint.

mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=radius,fpSize=nBits)
ECFP_fps = [mfpgen.GetFingerprint(x) for x in AllStruturesDF['ROMol']]

The full notebook is shown below.

UsingTSNE

The whole process is very quick taking less than a minute on my desktop machine. The resulting dataframe is then exported as a compressed sdf file (Alldata.sdf.gz). This can be read into Vortex for display as shown below. You could also use DataWarrior.

I’ve used a custom colour coding for points based on the hNK1 IC50 data for display.

If we now select one of the clusters the corresponding molecules are highlighted in the table.

We can use the slider on the right (highlighted in red box below) to display only molecules with hNK1 IC50 < 1 nM and then select the different classes of potent NK1 antagonists.

You ca also add a variety of calculated physicochemical properties to the table and use them to filter the selection.

Bluesky Discussion

View on Bluesky

No replies yet. Be the first to comment on Bluesky!

Related Posts