Whilst on most occasions chemical searches involve substructure or similarity searches there are occasions when a simple exact match of a compound identifier (or list of identifiers) is all that is needed. For these types of queries a Sqlite database offers very rapid searching and it is included on Macs by default.

SQLite is a software library that provides a relational database management system, it is self-contained, serverless, and requires little or no admin.

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.

The source code is freely available https://sqlite.org/src/doc/trunk/README.md but since it is already installed on the Mac you don’t need to install.

Creating the database

It is fairly straight-forward to build a database with a couple of Terminal commands. This example uses 2 million structures from ChEMBL, the file can be downloaded from here http://macinchem.org/reviews/exactsearch/exactsearch_files/ChEMBLInchiKeySMILES.csv.zip (72MB). For each record there is the ChEMBLID, SMILES and InChKey. The first couple of rows are shown below.

Of course the database could be an internal compound collection, a reagent collection, or a compiled vendor catalogue. 

I first created a folder called ChEMBLInChiKey and put the downloaded .csv file in the folder. Then used the cd command in the Terminal to change to that directory

The next step is to create the sqlite database, this is achieved using the sqlite command in the Terminal followed by the name of the database you want to create.

The next step is to create the table that will contain the data, and the column list for the table, each column has a name, data type, and the column constraint. In this case the table is called “ID_DATA” and the three columns are ChEMBLID, SMILES and INCHIKEY. They are all of datatype TEXT, and the CHEMBLID is set as the primary key that uniquely identifies each row of the table. Make sure you type a semicolon at the end of each SQL command! The sqlite3 program looks for a semicolon to know when your SQL command is complete.

We can check all have been created correctly using the following commands. The .tables command displays all the tables in the current database

Most of the time, sqlite3 just reads lines of input and passes them on to the SQLite library for execution. But input lines that begin with a dot (“.”) are intercepted and interpreted by the sqlite3 program itself. These “dot commands” are typically used to change the output format of queries, or to execute certain prepackaged query statements

To display the structure of a table, you use the .schema command

The next step is to load the data from the .csv file into the table. The .mode command controls the format in this case csv, then use .import to import the file.

We can now check all is working by searching for a specific record using a SELECT command. You should find the result is returned instantly.

The SQLite website https://www.sqlite.org/index.html is a great place to look for all commands/keywords.

Accessing the database from Jupyter

Whilst you can search the database using the command line in the Terminal, one really useful features is that you can use python. SQLite3 can be integrated with Python using sqlite3 module. You do not need to install this module separately because it is shipped by default along with Python version 2.5.x onwards. 

I created a jupyter notebook to test a few different queries. This can be downloaded here

The first step is to create a connection to the database.

The returned Connection object “connection” represents the connection to the on-disk database. In order to execute SQL statements and fetch results from SQL queries, we will need to use a database cursor. Call connection.cursor() to create the Cursor:

In order to search the database we use the SELECT command, in this case the search is for a single InChiKey. The targetinchikey is first defined and the cursor.execute to assign the returned results to “rows”

Notice that ? placeholders are used to bind data to the query. Apparently this avoids SQL injection attacks https://en.wikipedia.org/wiki/SQL_injection. We can look at the results using

The result is a list of tuples (CHEMBLID , SMILES , INCHIKEY). Of course SMILES strings are not that user friendly when it comes to viewing chemical structures so we can use RDKit to display the structure.

First we import the list into a pandas dataframe and add column headers

The RDKIt molecule object

Of course sometimes you will want to search for a list of identifiers, we use a similar approach. In this the search is for a list of ChEMBL IDs. The placeholder “?” approach above is used for single search terms so in this case we use the “IN” command to search for a list and generate one placeholder value for each item in the list.

Use RDKit pandas dataframe to view the results

Of course you don’t have to return every column from the table, in this case we only fetch the SMILES string

Again it returned in a list format

This can be rendered using RDKit

On the next page How to a access this from other applications.

Related Posts

2 thoughts on “Using sqlite for exact search

Comments are closed.