The previous couple of pages described creating a SQLite database using the Terminal and then running exact searches from a Jupyter notebook, python script or ChemDraw AppleScript using ChEMBLID or InChiKeys.

Whilst this provides a very fast look up for specific structures there are of course occasions when a substructure search is required. Fortunately, due to fantastic work from Riccardo Vianello there is Chemicalite a SQLite database extension for cheminformatics applications. This supports storing molecules and fingerprints, compute descriptors, run chemical queries on a SQLite database. It is available on GitHub and can be installed using Conda. I first created a conda environment called my_SQL 

You need to know where the Chemicalite extension is downloaded, for me it is

Creating a Database

Whilst this can be done from the Terminal, in this instance we will use a Jupyter notebook. First we need to download the ChEMBL data, download chembl31chemreps.txt.gz from the ChEMBL ftp site The format of the data in the downloaded file is shown below. There are 4 tab separated columns of data, chemblid, Canonicalsmiles, standardinchi and standardinchi_key.

I created a folder called “ChEMBLSQLite” and put the above file in the folder, I then used the Terminal to cd into the folder and created a new notebook. First a couple of imports.

Then create a database and load the extension, because this operation loads external code it is initially disabled to avoid potential security risks, so we first enable, load the extension and then disable the load functionality.

Next we create a table with the appropriate fields including a molecule field.

The next step is to parse the ChEMBL data

And then load the data into the database, at the same time using the mol_from_smiles to create a molecule object.

This will populate the database and create a 9GB chembldb.db file.

We can check everything is working by running a simple search for CHEMBL4741695

The output is shown in the image below, chemblid, Canonicalsmiles, standardinchi, standardinchi_key and the MOL as a binary object.

Structure-Based Searching

We can perform a substructure search using the Mol_is_substruc command. If the query is a SMILES string we need to convert to a mol using mol_from_smiles

We can use the type of search to return the different record elements in the database, for example to return ChemId or SMILES

If we try to return the molecule then it is not as straightforward , I tried

Then I tried converting to an rdkit mol object but got an error.

After some experimentation it seems you need to use the mol_to_binary_mol. I’m not sure why perhaps the SQLite mol object is not the same as the RDKit mol object?

Then convert to rdkit mol

I contacted Riccardo and got this explanation

Regarding the stored mol and how it relates to an RDKit molecule. The general idea is that the code should handle the binary blob as some kind of opaque data, and never access it directly, but rather wrap it in a function call that would transform into something suitable for processing. Under the hood, the mol binary blob is still a pickled / binary serialized RDKit molecule, but it’s prepended by a 32 bits header that the chemicalite code uses to verify that the input blob matches the expected data type. Binary fingerprints (and in the future reactions) are similarly stored as binary blobs, but using a different header. These headers are just meant to prevent some low level unrecoverable errors (or worse) that would occur if arbitrary binary data is passed as input to one function or the other, or if a mol is passed where a bfp is expected. The call to mol_to_binary_mol in practice just removes this header and returns the bare rdkit binary mol.

Speeding up the searching

Whilst the searching described above works it is slow because it has to search every record in the database.

ChemicaLite uses a virtual table mechanism to support indexing binary fingerprints in an RD-tree data structure, and this way improve the performances of substructure and similarity queries.

This will take a little time to process.

We can then search using

The results are returned very quickly, we can also measure the time taken to execute a cel using the %%time command.

On my Apple Silicon MacBook Pro M1 max 

The original search 

Using the indexed fingerprints

A huge increase in speed.

Similarity searching

The RDKit can be used to generate a variety of fingerprints one of the most popular is the Morgan fingerprint which is a a reimplementation of the extended connectivity fingerprint (ECFP). This fingerprint data is stored and indexed into RD-tree virtual tables as before. We use the Tanimoto coefficient as a measure of similarity.

The similarity search requires to input variables, the query molecule as a SMILES string, and the similarity threshold

To find out how many similar molecules the search is

We can get all the records using

Displaying the results using mols2grid

mols2grid is an interactive molecule viewer for 2D structures, based on RDKit details are on GitHub and installation instructions are on GitHub . First we import the data into a pandas data frame and set the numbers to 2 decimal places.

We then import the data frame into mols2grid (there are many options for display but I decided to keep it simple.

Links to notebooks

Constructing the database

Searching the database

Last Updated 12 April 2023

Related Posts