Clustering is an invaluable cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. One of the advantages is that once clustered you can store the cluster identifiers and then refer to them later this is particularly valuable when dealing with very large datasets. This often used in the analysis of high-throughput screening results, or the analysis of virtual screening or docking studies.
I wrote an article comparing the options a while back https://macinchem.org/2023/03/05/options-for-clustering-large-datasets-of-molecules/
A recent publication described bitbirch an Efficient clustering of large molecular libraries DOI and I thought I’d update the article to include bitbirch and compare it with the latest version of RDKit Butina clustering on updated hardware. Clustering 150K molecules using bitbirch took 22 seconds with very low memory demands. I’ll have a look at larger data sets in the future.
The updated article is here.
https://macinchem.org/2023/03/05/options-for-clustering-large-datasets-of-molecules/