Vortex is a high performance data visualisation and analysis application. Vortex is also scientifically aware providing native cheminformatics and bioinformatics analysis and visualisations. I’m using the latest build but this has not been yet compiled for the M1 chip.
For testing a selection of structures from ChEMBL, 2D structures in sdf file format were used. MWt 250 to 500, calc LogP 0 to 5. This is a 2.6 GB file containing 1,144,624 molecules.
After import a selection of Vortex scripts were run.
One of the really neat features of Vortex is the ability to script multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with a variety of screens. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.
PAINS filter script
Jonathan B. Baell and Georgina A. Holloway published a very interesting paper on their analysis of frequent hitters from screening assays. DOI, in the supplementary information they provided the corresponding filters in Sybyl Line Notation (SLN) format. These were converted to SMARTS format and used to create a Vortex script to provide a PAINS filter. The filter contains 540 different SMARTS queries.
The first part of the script converts all 1,144,624 structures to SMILES and then runs multiple SMARTS matches. There are 540 different SMARTS in the file so with 1,144,624 molecules we are running 618,096,960 SMARTS matches.
The Intel MacBook Pro completed the task in 8 minutes and 3 seconds
The MacBook Pro M1 max completed the task in 3 minutes and 27 seconds
The M2 MacBook Air completed the task in 4 mins and 23 seconds
Dotmatics have now released a build of Vortex for Apple Silicon
Using the latest daily build
The MacBook Pro M1 max completed the task in 3 minutes and 2 seconds
The M2 Mac Studio Ultra completed the task in 1 min and 53 seconds
It is also worth noting that whilst the fans started up on the Intel MacBook Pro after 1 min 40 secs, the fans on the M1 MacBook Pro were silent throughout the test.
Duplicate check script
This script first generates the InChikey for each molecule and then compares them all to identify potential duplicates, using a 1.4M structure file.
The Intel MacBook Pro completed the task in 21 mins
The MacBook Pro M1 max completed the task in 11.9 mins
K-means clustering is a popular technique for clustering data sets, I’ve found particularly useful when dealing with larger datasets in this case I’m using the 1,144,624 molecule ChEMBL dataset.
The Intel MacBook Pro completed the task in 5 min 30 secs
The MacBook Pro M1 max completed the task in 4 mins
The M2 MacBook Air completed the task in 4 mins and 30 seconds
The M2 Mac Studio Ultra completed the task in 2 mins and 45 seconds
Last update 4 July 2023