Vortex script for Analysis of Categories

I often need to tag individual molecules within a dataset with a specific property, perhaps the results of clustering algorithms, the results of PAINS filtering, or Liver toxicity filters. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.

A question that might then arise is “How many molecules belong to each category?”. Whilst you can see the numbers in the sidebar there is not an easy way to export the results.

After discussions with Dan and Matt this script evolved. The script allows you create a new workspace containing the category information.

The first part of the script allows the user to select the categorical column, we then identify the column and its name. 

We then use a defaultdict, this works exactly like a normal dictionary, but it is initialized with a function (“default factory”) that takes no arguments and provides the default value for a nonexistent key. Also without using defaultdict, we need to check if our category had been assigned yet before we can add 1.

Finally we sort by count decreasing and we then create a new workspace with two columns, first containing the categories the second the count of occurrences for each category as shown in the examples below.

I’ve tested the script on a data set of 161,000 molecules and it took less than 1 second to complete for a variety of types of categories.

The Vortex Script

The script can be downloaded here

Page Updated 28 September 2016

Related Posts