Options for Clustering large datasets of Molecules

Clustering is an invaluable cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. One of the advantages is that once clustered you can store the cluster identifiers and then refer to them later this is particularly valuable when dealing with very large datasets. This often used in the analysis of high-throughput screening results, or the analysis of virtual screening or docking studies.

In theory any molecular descriptor can be used in clustering, in practice the use of a large number of descriptors, in particular some of the more esoteric ones, can make it difficult for a medicinal chemist to understand why compounds might be in the same cluster. It is perhaps better to stick to molecular fingerprints based on structural fragments, this usually results in clusters that chemists can see are similar. For this work I’ve tried to stick with Morgan/Circular type descriptors, however since all the packages will have their own implementation they are probably not exactly the same. A number of clustering techniques are available but the most common are nearest neighbours, this method assigns a single compound to each of the desired number of clusters, ideally the most diverse selection. Next each compound remaining is assigned to the nearest cluster, the centroid of each cluster is recalculated as each new compound is assigned to the cluster. 

I’ve looked at a number of options for clustering molecules from toolkits like RDKit to commercial applications such as Vortex and I’ve tried them with a variety of different sized data sets, one containing 789 molecules, one 150,000 molecules and a large set containing 4,400,000 molecules. The clustering tools often have many different options but I’ve tried to stick to the default values. All work was run on a 12-core MacPro desktop with 64GB of RAM with 1TB of internal SSCD storage. I’m sure there are other options but I’ve limited the work to those that have in built support for chemistry. I’ve also limited the investigation to applications that are run locally since I suspect most chemists won’t want to run calculations on an external server.


RDKit is an open source toolkit for cheminformatics, it has Core data structures and algorithms in C++ with a user friendly Python wrapper. It also integrates nicely with Jupyter Notebooks. The RDKit Cookbook contains tips for using the the Butina clustering algorithm D Butina, ‘Unsupervised Database Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way to Cluster Small and Large Data Sets’, JCICS, 39, 747-750 (1999) and I’ve implemented it in a Jupyter Notebook.

Now read in the molecules

This dataset contains < 1000 molecules, now generate fingerprints

Then cluster

For a modest dataset of <1000 molecules the clustering is almost instantaneous , we can then look at a specific cluster

Pasted Graphic

Using the larger dataset of 150,000 molecules, reading in the structures took 90 seconds and 3.8 GB RAM, adding the fingerprints took another 20 secs and took RAM usage up to 3.83GB. Clustering used only one core, and after 23 mins and consuming 80GB RAM the kernel died.

After chatting to Greg Landrum, he suggested if using a large set of molecules you wouldn’t want to ever have all the molecules in memory, so it might be better to read the molecules and generate the fingerprints in a single step like this

Also for the clustering use

The reordering flag is documented as follows “if this toggle is set, the number of neighbors is updated for the unassigned molecules after a new cluster is created such that always the molecule with the largest number of unassigned neighbors is selected as the next cluster center”.

Whilst this lowered the initial memory requirements, unfortunately once the clustering started the memory usage steadily increased and the kernel died.


Chemfp is a set of command-line tools and a Python library for fingerprint generation and high-performance similarity searches. Whilst the current version is commercial older versions are available for free and can be installed using PIP. However it seems you have to explicitly define the version to install

This installs a variety of command line tools.

The command-line tools are written in Python, with a C extension for better performance. The Python library has a large and well-documented public API which you can use from your own programs. The core functionality works with fingerprints as byte or hex strings, reads and writes fingerprint files, and performs similarity searches. Chemfp does not understand chemistry. Instead, it knows how to use RDKit, OEChem/OEGraphSim, and Open Babel to handle molecule I/O and to compute fingerprints for a molecule, and makes all of that work through a portable cross-toolkit API.

The help is incredibly well documented, for example to generate the fingerprint file we need to use rdkit2fps, typing

Generating the fingerprint file

Andrew Dalke has provided a python script to help with the clustering and give some information on timings etc. it is available from here 

Running on the ApprovedDrugs 789 dataset with a similarity threshold of 0.6 was effectively instantaneous 

The resulting cluster file can be opened in a text editor and has this format

With some Python scripting it should be straight-forward to edit the script to output to cluster numbers and combine them with the original sdf file that was used to generate the fingerprint file.

Running the same process with a larger dataset took an unexpectedly long time and I noticed that it was only using one core. Chatting with Andrew Dalke it was apparent that Clang, the “LLVM native” C/C++/Objective-C compiler, under MacOSX, doesn’t include OpenMP support. Chemfp uses OpenMP for parallelism.

Compiling Chemfp for OpenMP

To compile Chemfp for OpenMP you’ll need to install gcc then use the CC environment variable to configure the build step to use that compiler. I’ve written up instructions for installing gcc (plus other things) hereusing Homebrew.

I had to first remove Chemfp

However, you’ll have to remove your local pip wheel file. A “wheel” file is the new(ish) Python distribution format. It can include pre-compiled code, including byte code. “pip install” internally makes a wheel file, stores it in a cache, then installs from the wheel file. The “pip uninstall” will remove the installation but notthe wheel file.

Then reinstall Chemfp

I then ran the same exercise with the 150,000 structure file, this time running at similarities of 0.8 and 0.6 to see how it might effect performance, generating the fingerprint file took very little time and the timings for the clustering are shown below.

took just under a minute for 0.8 threshold and made use of all available cores.

took just over a minute at 0.6, so a little slower. As before the cluster output was a text file with the same format.

I had some problems generating the fingerprint file for the 4.4 million structures, it is often a problem when a file is derived from mulitiple sources and quality control is not consistent. 

I ended up tidying up the file using this script.

This gave me a data set in SMILES format of just over 4.3 million structures. I then regenerated the sdf file and used that to generate the fingerprints without incident. (Apparently you can add “–errors ignore” to the rdkit2fps command, I’ve not tested it).

Generating the fingerprint file for the 4.3 million structures took 35 mins, generating the clusters using a 0.8 similarity took 10 hours. The python script also outputs a useful summary.


DataWarrior is an open-source data analysis package I’ve reviewed previously. The cluster algorithm implemented in DataWarrior is simple, reproducible, but computationally demanding. By default Datawarrior uses Fragfp fingerprints

FragFp is a substructure fragment dictionary based binary fingerprint similar to the MDL keys. It relies on a dictionary of 512 predefined structure fragments.

To get a better comparison with the other clustering algorithms I generated the SphereFp fingerprint.

SphereFp descriptor encodes circular spheres of atoms and bonds into a hashed binary fingerprint of 512 bits. From every atom in the molecule, DataWarrior constructs fragments of increasing size by including n layers of atom neighbours (n=1 to 5). These circular fragments are canonicalized considering aromaticity, but neglecting stereo configurations. From the canonical representation a hash code is generated, which is used to set the respective bit of the fingerprint.

One the dataset of 789 molecules the clustering was instantaneous, and it was very easy to browse through the results.

Reading in the 150,000 structure file took 20 seconds and consumed 550 MB of RAM, adding the descriptors took the RAM usage up to 1.14GB. Unfortunately when I tried to do the clustering I got this message.


MOE is a molecular modelling package from Chemical Computing Group. Whilst MOE does not include Morgan/circular type fingerprints there is a SVL script based on the work by Rogers et al DOI that can be used to generate them. One the small dataset of 789 structures the clustering was complete within 15 seconds. One the larger set of 150,000 molecules generation of the fingerprints took 2 mins and the subsequent clustering took (>40 hours and counting)

I did not attempt to cluster the 4.4 million structure file using MOE.


Vortex is a high performance, interactive data visualization and analysis application, with native cheminformatics and bioinformatics intelligence built in. Reading in the 789 structure file and then clustering was instantaneous. Vortex provides k-means clustering as standard, and uses either atom-based similarity (called DotFPCA) hex-packed to 1024 bits. This is the Dotmatics implementation of Morgan/circular type fingerprints.

Working with the 150,000 structure file, again using the default options, took less than a minute to generate the fingerprints and complete the clustering.

Using the 4.4M structure file took a little longer, reading in the file took less than 3 mins and consumed 13GB of RAM. Generating the fingerprints took 18 mins. used 4 cores and took the RAM usage up to 16GB, the clustering then took a further 40 mins using available cores with the RAM usage rising to 17.5 GB. Two columns are added: a CLUSTER number, and a DISTANCE; the centroid has a distance of 0 and all other compounds vary in distance up to 1.

With over 4 million structures it is not really practical to simply scroll through the table, but here are a couple of tools that might help with further analysis. A question that might then arise is “How many molecules belong to each cluster?”, the GenericClusterAnalysis script creates a new workspace containing two columns. The first containing the cluster number the second the count of occurrences for each cluster, for this 4.4 million structures it took a couple of seconds to run.

Alternatively you can use a variation on the duplicate check script to select the first example from each cluster, you can download it here http://macinchem.org/reviews/vortexscripts/ChoseOneFromClusters.vpy.zip. An alternative is to select the centroid of each cluster by modifying the script to perform an additional selection based on the DISTANCE value, the script can be downloaded here http://macinchem.org/reviews/vortexscripts/ChoseCentreFromClusters.vpy.zip.

Dan Ormsby at Dotmatics provided me with a couple of script snippets that expose the fingerprint generation code, it turns out that under the hood there are a wide variety of fingerprint options the default options are all pre-set for the typical medicinal chemist user. I used the 150,000 molecule set to explore the impact of varying a couple of the options. Changing the atom path length in the fingerprint from 2 to 10 had minimal impact on the time taken, all less than 1 minute. To improve efficiency when working with these fingerprints they are usually delivered into Vortex in a hex packed format, the default is 1024 bits. I also did the clustering using the unpacked fingerprints since this would represent the slowest method. Using the unpacked fingerprints the run took two minutes to complete.


There is also a RSC CICAG workshop describing the use of KNIME for clustering.


For small data sets of a few thousand structures there are multiple options for clustering, from Open Source toolkits to sophisticated desktop applications. However as the dataset increases the computational demands increase, and support for the use of all available cores becomes essential. For very large datasets I’ve only found two options Chemfp and Vortex. I’ve only used the free version of Chemfp (the commercial version may have better performance) and whilst it copes with several million structures it is very considerably slower than Vortex. Vortex also has the advantage of having the nice graphical user interface and scripting tools to allow further exploration of the dataset.

Last Updated 22 July 2017

Related Posts

2 thoughts on “Options for Clustering large datasets of Molecules

Comments are closed.