In a previous post I illustrated how to download PubChem and create a local searchable database using a Jupyter notebook. I also included a vortex/python script to search for PubChem ID from a InChiKey. A couple of readers have asked if I could give more examples, but using a smaller database since the PubChem sqlite database with fingerprints is over 450 GB!

So instead I’ve created a Jupyter notebook that uses the latest version of ChEMBL database that can be downloaded from their ftp site, the file you need is highlighted below. https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_35/

The Jupyter notebook is shown below and can be downloaded here

The Jupyter notebook can be used to generate the chembl.sqlite database and also includes various cells that show options for querying the database. These can be simply modified to python scripts that can be called from the command line or accessed via a Vortex script.

Search using InChiKey

The python script below takes as input an inChiKey and returns the chembl_id.

We can test this using the command line.

And we can access this using a Vortex script, the first part of the script asks the user to select the column containing the InChiKeys, subprocess is then used to generate the query and collect the response. The response is then entered into the Vortex workspace.

The result is shown below.

ChEMBL ID to SMILES

If you have a list of ChEMBL ID and you want to add the structures the script should help.

Again this can be tested from the command line, and accessed from a Vortex script in a similar manner to before.

Count of molecules containing a substructure

If you have a list of molecules and you want to know how many times they appear as a substructure in ChEMBL. In this case we are using the chemicalite extension so after connecting to the database chemicalite is loaded. The SQL query is created and run.

We can test this from the command line but there are a couple of things to watch. the python environment used to create the database needs to be used. In addition, SMILES can contain characters that might be interpreted incorrectly so they should be enclosed with quotes.

The Vortex script is shown below. The first part gets the SMILES from the structure and then we generate the query, making sure to put the SMILES in quotes.

Count of similar molecules

An alternative search might to try and find how many similar molecules there are. The python script is shown below, there are a couple of points to note. Again the chemiclite extension needs to be loaded. Command-line arguments are STRINGS so the threshold value (Tanimoto similarity) needs to be converted to a FLOAT.

The results of these three scripts are shown below, starting with a list of ChEMBL ID, first the SMILES string is added, then the results of the substructure search and then the number of similar (Tanimoto 0.6) are added

The python and Vortex scripts can be downloaded here

I put the python scripts in the same folder as the chembl.sqlite database, and the Vortex scripts into a “ChEMBL” folder I created inside the Vortex scripts folder. You will need to edit the paths within the scripts for your machine.

Some of the more eagle-eyed will have noticed the commented out line in the python code shown below, given that the sqlite databases can be quite large I also tested having them on an external drive. I tried out various options and ended up buying an ACASIS 40Gbps M.2 NVMe SSD Enclosure with Crucial P3 Plus 4TB SSD (Thunderbolt).

Using the external drive was slightly slower but I’ve not done extensive testing

Related Posts