Whilst on most occasions chemical searches involve substructure or similarity searches there are occasions when a simple exact match of a compound identifier (or list of identifiers) is all that is needed. For these types of queries a Sqlite database offers very rapid searching and it is included on Macs by default.
SQLite is a software library that provides a relational database management system, it is self-contained, serverless, and requires little or no admin.
SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.
The source code is freely available https://sqlite.org/src/doc/trunk/README.md but since it is already installed on the Mac you don’t need to install.
Creating the database
It is fairly straight-forward to build a database with a couple of Terminal commands. This example uses 2 million structures from ChEMBL, the file can be downloaded from here http://macinchem.org/reviews/exactsearch/exactsearch_files/ChEMBLInchiKeySMILES.csv.zip (72MB). For each record there is the ChEMBLID, SMILES and InChKey. The first couple of rows are shown below.
1 2 3 4 5 |
chembl_id,SMILES,InChIKey CHEMBL153534,Cc1cc(cn1C)-c1csc(n1)N=C(N)N,MFRNFCWYPYSFQQ-UHFFFAOYSA-N CHEMBL405398,Brc1cccc(c1NCCN1CCOCC1)Nc1ncnc2ccncc12,VDSXZXJEWIWBCG-UHFFFAOYSA-N CHEMBL503634,COc1c(cc(c(c1O)C(=N)Cc1ccc(cc1)O)O)O,OPELSESCRGGKAM-UHFFFAOYSA-N |
Of course the database could be an internal compound collection, a reagent collection, or a compiled vendor catalogue.
I first created a folder called ChEMBLInChiKey and put the downloaded .csv file in the folder. Then used the cd command in the Terminal to change to that directory
1 2 3 |
(base) chrisswain@ChrisM1MBP ~ % cd /Users/chrisswain/Projects/ChEMBLInChiKey (base) chrisswain@ChrisM1MBP ChEMBLInChiKey % |
The next step is to create the sqlite database, this is achieved using the sqlite command in the Terminal followed by the name of the database you want to create.
1 2 3 4 |
(base) chrisswain@ChrisM1MBP ChEMBLInChiKey % sqlite3 chembldata.db; SQLite version 3.39.4 2022-09-29 15:55:41 Enter ".help" for usage hints. |
The next step is to create the table that will contain the data, and the column list for the table, each column has a name, data type, and the column constraint. In this case the table is called “ID_DATA” and the three columns are ChEMBLID, SMILES and INCHIKEY. They are all of datatype TEXT, and the CHEMBLID is set as the primary key that uniquely identifies each row of the table. Make sure you type a semicolon at the end of each SQL command! The sqlite3 program looks for a semicolon to know when your SQL command is complete.
1 2 3 4 5 6 |
sqlite> CREATE TABLE ID_DATA ( ...> CHEMBLID TEXT PRIMARY KEY, ...> SMILES TEXT NOT NULL, ...> INCHIKEY TEXT NOT NULL ...> ); |
We can check all have been created correctly using the following commands. The .tables command displays all the tables in the current database
Most of the time, sqlite3 just reads lines of input and passes them on to the SQLite library for execution. But input lines that begin with a dot (“.”) are intercepted and interpreted by the sqlite3 program itself. These “dot commands” are typically used to change the output format of queries, or to execute certain prepackaged query statements
1 2 3 |
sqlite> .tables ID_DATA |
To display the structure of a table, you use the .schema command
1 2 3 4 5 6 7 |
sqlite> .schema ID_DATA CREATE TABLE ID_DATA ( CHEMBLID TEXT PRIMARY KEY, SMILES TEXT NOT NULL, INCHIKEY TEXT NOT NULL ); |
The next step is to load the data from the .csv file into the table. The .mode command controls the format in this case csv, then use .import to import the file.
1 2 3 |
sqlite> .mode csv sqlite> .import /Users/chrisswain/Projects/ChEMBLInChiKey/ChEMBLInchiKeySMILES.csv ID_DATA |
We can now check all is working by searching for a specific record using a SELECT command. You should find the result is returned instantly.
1 2 3 4 5 6 |
sqlite> SELECT * FROM ID_DATA WHERE INCHIKEY == "MFRNFCWYPYSFQQ-UHFFFAOYSA-N"; CHEMBLID SMILES INCHIKEY ------------ ------------------------------------------------------------ --------------------------- CHEMBL153534,Cc1cc(cn1C)-c1csc(n1)N=C(N)N,MFRNFCWYPYSFQQ-UHFFFAOYSA-N sqlite> |
The SQLite website https://www.sqlite.org/index.html is a great place to look for all commands/keywords.
Accessing the database from Jupyter
Whilst you can search the database using the command line in the Terminal, one really useful features is that you can use python. SQLite3 can be integrated with Python using sqlite3 module. You do not need to install this module separately because it is shipped by default along with Python version 2.5.x onwards.
I created a jupyter notebook to test a few different queries. This can be downloaded here
The first step is to create a connection to the database.
1 2 3 |
import sqlite3 connection = sqlite3.connect("chembldata.db") |
The returned Connection object “connection” represents the connection to the on-disk database. In order to execute SQL statements and fetch results from SQL queries, we will need to use a database cursor. Call connection.cursor() to create the Cursor:
1 2 |
cursor = connection.cursor() |
In order to search the database we use the SELECT command, in this case the search is for a single InChiKey. The targetinchikey is first defined and the cursor.execute to assign the returned results to “rows”
1 2 3 |
targetinchikey = "GOLPEFAQLVNWBD-OEDNWNMQSA-N" rows = cursor.execute("SELECT * FROM ID_DATA WHERE INCHIKEY == ?", (targetinchikey,),).fetchall() |
Notice that ? placeholders are used to bind data to the query. Apparently this avoids SQL injection attacks https://en.wikipedia.org/wiki/SQL_injection. We can look at the results using
1 2 3 |
print(rows) [('CHEMBL414339', 'CC(=O)O[C@H]1C(=O)[C@]2(C)[C@@H](O)C[C@H]3OC[C@]33OC(=O)CCCCCOc4cccc(c4)[C@H](NC(=O)c4ccccc4)[C@@H](O)C(=O)O[C@H]4C[C@@](O)([C@@H](OC(=O)c5ccccc5)[C@@H]23)C(C)(C)C1=C4C', 'GOLPEFAQLVNWBD-OEDNWNMQSA-N')] |
The result is a list of tuples (CHEMBLID , SMILES , INCHIKEY). Of course SMILES strings are not that user friendly when it comes to viewing chemical structures so we can use RDKit to display the structure.
1 2 3 4 |
import pandas as pd from rdkit import Chem from rdkit.Chem import PandasTools |
First we import the list into a pandas dataframe and add column headers
1 2 3 4 5 6 7 8 |
df = pd.DataFrame(rows) df.columns = ['ChEMBLID','SMILES','InChiKey'] df ChEMBLID SMILES InChiKey CHEMBL414339 CC(=O)O[C@H]1C(=O)[C@]2(C)[C@@H](O)C[C@H]3OC[C... GOLPEFAQLVNWBD-OEDNWNMQSA-N |
The RDKIt molecule object
1 2 3 |
PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule',includeFingerprints=True) df |
Of course sometimes you will want to search for a list of identifiers, we use a similar approach. In this the search is for a list of ChEMBL IDs. The placeholder “?” approach above is used for single search terms so in this case we use the “IN” command to search for a list and generate one placeholder value for each item in the list.
1 2 3 4 5 6 7 |
targetchembl = ["CHEMBL439095", "CHEMBL107768", "CHEMBL4741695"] rows = cursor.execute(f'SELECT * FROM ID_DATA WHERE CHEMBLID IN ({", ".join(["?"]*len(targetchembl))})', targetchembl).fetchall() print(rows) [('CHEMBL107768', 'Cc1onc(c1CCC(N)C(=O)O)O', 'NZDIZJGEDFARSV-UHFFFAOYSA-N'), ('CHEMBL439095', 'CC(C)C[C@@H]1NC(=O)CNC(=O)[C@@H](NC(=O)[C@H](NC(=O)[C@@H](NC(=O)[C@@H](CCCN)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H](NC(=O)[C@H](NC(=O)[C@@H](NC(=O)[C@H](NC(=O)[C@@H](CCCN)NC(=O)[C@H](NC(=O)[C@@H](CNC(=O)[C@H](CC(=O)N)NC(=O)C/C=C/C=C/C(C)C)[C@H](OC(=O)[C@@H](NC(=O)[C@@H](C)NC1=O)c1ccc(c(c1)Cl)O)C(=O)N)c1ccc(cc1)O)C(C)C)c1ccc(cc1)O)c1ccc(cc1)O)[C@@H](C)O)c1ccc(cc1)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)[C@@H](C)O)c1ccc(cc1)O', 'JZUDLBXYQZSERJ-BKSFDNMBSA-N'), ('CHEMBL4741695', 'NC1=Nc2[nH]c(c(c2C(=O)N1)CN1CCN(CC1)Cc1cccc(c1)F)CN1CCN(CC1)Cc1cccc(c1)F', 'XLXWABBNDCNDAB-UHFFFAOYSA-N')] |
Use RDKit pandas dataframe to view the results
1 2 3 4 5 |
df = pd.DataFrame(rows) df.columns = ['ChEMBLID','SMILES','InChiKey'] PandasTools.AddMoleculeColumnToFrame(df,'SMILES','Molecule',includeFingerprints=True) df |
Of course you don’t have to return every column from the table, in this case we only fetch the SMILES string
1 2 3 |
targetchembl = "CHEMBL4741695" targetsmiles = cursor.execute("SELECT SMILES FROM ID_DATA WHERE CHEMBLID == ?", (targetchembl,),).fetchall() |
Again it returned in a list format
1 2 3 |
targetsmiles [('NC1=Nc2[nH]c(c(c2C(=O)N1)CN1CCN(CC1)Cc1cccc(c1)F)CN1CCN(CC1)Cc1cccc(c1)F',)] |
This can be rendered using RDKit
1 2 3 4 |
from rdkit import Chem m = Chem.MolFromSmiles(targetsmiles[0][0]) m |
On the next page How to a access this from other applications.
2 thoughts on “Using sqlite for exact search”
Comments are closed.