Too often I come across datasets that Chemical names or identifiers but no actual molecular structure, recently Dan at Dotmatics suggested I look at OPSIN. There are several tools for converting the names to structure and I’ve highlighted a couple of options here and described scripts that allow them to be used from with Vortex.
OPSIN
OPSIN is a Java(1.6+) library for IUPAC name-to-structure conversion offering high recall and precision on organic chemical nomenclature. Supported outputs are SMILES, CML (Chemical Markup Language) and InChI (IUPAC International Chemical Identifier). The latest version can be downloaded here OPSIN-2.2.0-jar-with-dependencies.jar. To access it from within Vortex you need to put the jar file in the folder
1 2 |
/Users/USERNAME/vortex/libs/ |
Where USERNAME is your username
OPSIN can be called from the command line using
1 2 |
java -jar OPSIN-2.2.0-jar-with-dependencies.jar -osmi input.txt output.txt |
where input.txt contains a series of chemical name/s, one per line. Or for an individual chemical name
1 2 3 |
NameToStructure nts = NameToStructure.getInstance(); String smiles = nts.parseToSmiles("acetonitrile"); |
We can use the latter in the Vortex script as shown below. The file containing the chemical names looks like this, each chemical name is on a single line and is in plain text. OPSIN was designed to support IUPAC names but an increasing number of trivial (but widely used) chemical names and synonyms are also supported.
Name
iodobenzene
2,15-dimethyl-14-(1,5-dimethylhexyl)tetracyclo[8.7.0.02,7.011,15]heptadec-7-en-5-ol
acetone
quinuclidine
1-Azabicyclo[2.2.2]octane
2-Methyl-1,3,5-trinitrobenzene
5-{2-Ethoxy-5-[(4-methylpiperazin-1-yl)sulfonyl]phenyl}-1-methyl-3-propyl-1H,6H,7H-pyrazolo[4,3-d]pyrimidin-7-one
Ethyl Magnesium Bromide
Lithium Bromide
Anisole
Phenylalanine
When importing the file into Vortex it is important NOT to use comma as the delimiter or it will break up the chemical names that contain a comma (I used tab).
Once the file containing the chemical names has been imported it should look like this.
After running the script it should look like this, where Vortex has automatically rendered the SMILES strings as structures.
The first part of the script imports OPSIN, then create the dialog box to allow the user to identify the column containing the chemical name. Then loop through the rows in the workspace and for each row generate the SMILES string from the chemical name and put it into a new column called SMILES.
The Vortex Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# Vortex imports import os import sys sys.path.append("/Users/USERNAME/vortex/libs/OPSIN-2.2.0-jar-with-dependencies.jar") #Need to edit USERNAME to include your username from uk.ac.cam.ch.wwmm.OPSIN import NameToStructure, NameToStructureConfig nts = NameToStructure.getInstance() ntsconfig = NameToStructureConfig colsmi = vtable.findColumnWithName('SMILES', 0) input_label = swing.JLabel("Name column (for input)") input_cb = workspace.getColumnComboBox() panel = swing.JPanel() layout.fill(panel, input_label, 0, 0) layout.fill(panel, input_cb, 1, 0) ret = vortex.showInDialog(panel, "Choose Chemical Name Column column") if ret == vortex.OK: input_idx = input_cb.getSelectedIndex() if input_idx == 0: vortex.alert("you must choose a column") else: col = vtable.getColumn(input_idx - 1) rows = vtable.getRealRowCount() for r in range(0, int(rows)): drugName = col.getValueAsString(r) mySMILES = nts.parseToSmiles(drugName) colsmi = vtable.findColumnWithName('SMILES', 1) colsmi.setValueFromString(r, mySMILES) vtable.fireTableStructureChanged() |
The Vortex script can be downloaded here
Chemical Identifier Resolver
The Chemical Identifier Resolver (CIR) by the CADD Group at the NCI/NIH is a web service that performs various chemical name to structure conversions. The service works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another representation or structure identifier. It can help you identify and find the chemical structure if you have an identifier such as an InChIKey or CAS Number. You can either use the resolver web form at the web link above or use the following simple URL as a web service. Full documetation is here
1 2 |
http://cactus.nci.nih.gov/chemical/structure/"structure identifier"/"representation" |
Example: Chemical name to SMILES:
1 2 |
http://cactus.nci.nih.gov/chemical/structure/aspirin/smiles |
The input identifier can be a chemical name, SMILES, CAS Number, InChi etc and the returned representation can be SMILES, sdf, png etc.
Chemical names are resolved by a database lookup into a full structure representation. The service has currently approx. 68 million chemical names available linked to approx. 16 million unique structure records. The set of available names includes trivial names, synonyms, systematic names, registry numbers, etc.
Much of the script is similar to the one using OPSIN, the different this that this time we construct the URL for the web service
1 2 |
mystr = "http://cactus.nci.nih.gov/chemical/structure/" + encoded_name + "/smiles" |
We encode the drugName to ensure that special characters will not break the URL.
1 2 |
encoded_name = urllib.quote(drugName) |
The SMILES returned is then added to the workspace.
The Vortex Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# Python imports import urllib2 import urllib # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os input_label = swing.JLabel("Name column (for input)") input_cb = workspace.getColumnComboBox() panel = swing.JPanel() layout.fill(panel, input_label, 0, 0) layout.fill(panel, input_cb, 1, 0) ret = vortex.showInDialog(panel, "Choose Drug Name Column column") if ret == vortex.OK: input_idx = input_cb.getSelectedIndex() if input_idx == 0: vortex.alert("you must choose a column") else: col = vtable.getColumn(input_idx - 1) # "http://cactus.nci.nih.gov/chemical/structure/" & the_encode_text & "/smiles" colsmi = vtable.findColumnWithName('SMILES', 0) rows = vtable.getRealRowCount() for r in range(0, int(rows)): drugName = col.getValueAsString(r) encoded_name = urllib.quote(drugName) mystr = "http://cactus.nci.nih.gov/chemical/structure/" + encoded_name + "/smiles" try: myreturn = urllib2.urlopen(mystr).read() except urllib2.HTTPError: continue # some not found colsmi = vtable.findColumnWithName('SMILES', 1) colsmi.setValueFromString(r, myreturn) vtable.fireTableStructureChanged() |
Using CIR is significantly slower but it it is better able to assign structures to trade names etc.
The Vortex script can be downloaded here
ChemSpider
ChemSpider is a free chemical structure database providing fast access to over 58 million structures, properties, and associated information. There are also a series of web services that provide access to the data.
The ChemSpider webservices are a powerful suite of tools that provide access to many of the commonly used features of ChemSpider through Application Programming Interfaces (APIs). The webservices make it possible to enrich your Apps, your website, your in-house data systems and data workflow tools.
To access this web service you will need to register and obtain a security token. Registration does also give you access to a wide range of web services covering structure and spectra searching, and generic conversion between chemical file formats. If you are a Python user you should also look at ChemSciPy. If you use PIP then installation is straightforward.
1 2 |
pip install chemspipy |
For many tasks that you might want to perform on ChemSpider (searches etc), there is no need to have a ChemSpider User account. However, if you want to save Results sets, Curate records, add Data or use certain Web services, then you will need to have a ChemSpider account linked to an RSC ID.
You will need to edit the downloaded script to enter your security token.
The Vortex Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# Python imports import httplib import urllib2 import urllib from xml.etree import ElementTree as etree # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os import sys input_label = swing.JLabel("Name column (for input)") input_cb = workspace.getColumnComboBox() panel = swing.JPanel() layout.fill(panel, input_label, 0, 0) layout.fill(panel, input_cb, 1, 0) ret = vortex.showInDialog(panel, "Choose Drug Name Column column") # You need to replace with your security token token = 'Your Security Token' if ret == vortex.OK: input_idx = input_cb.getSelectedIndex() if input_idx == 0: vortex.alert("you must choose a column") else: col = vtable.getColumn(input_idx - 1) colsmi = vtable.findColumnWithName('SMILES', 1) colcsid = vtable.findColumnWithName('CSID', 1) rows = vtable.getRealRowCount() fails = [] for r in range(0, int(rows)): drugName = col.getValueAsString(r) encoded_name = urllib.quote(drugName) mystr = "http://www.chemspider.com/Search.asmx/SimpleSearch?query=%s&token=%s" % (encoded_name, token) try: myreturn = urllib2.urlopen(mystr).read() tree = etree.fromstring(myreturn) csid_el = tree.find('{http://www.chemspider.com/}int') if csid_el is None: continue colcsid.setValueFromString(r, csid_el.text) info_url = "http://www.chemspider.com/Search.asmx/GetCompoundInfo?CSID=%s&token=%s" % (csid_el.text, token) info_response = urllib2.urlopen(info_url) tree = etree.parse(info_response) smiles = tree.getroot().find('{http://www.chemspider.com/}SMILES').text #colsmi = vtable.findColumnWithName('SMILES', 1) colsmi.setValueFromString(r, smiles) except urllib2.HTTPError: #fails.append(drugName) continue except urllib2.URLError: #fails.append(drugName) continue except httplib.HTTPException: #fails.append(drugName) continue vtable.fireTableStructureChanged() |
The Vortex script can be downloaded here
Comparison of scripts
As you can see from the table below, OPSIN was by far the fastest. OPSIN ran through the workspace of nearly 14,000 structures in 5 seconds, however there were just over 1,500 names for which structures could not be assigned. ChemSpider identified the majority of structures from the names but was considerably slower. The chemical Identifier resolver left just under 1000 structures unresolved and was the slowest of the three methods. It should be noted however the performance of the web services will be dependent on network traffic and the load on the web server.
OPSIN | CIR | Chemspider | |
---|---|---|---|
Time | 5 sec | 4h 30min | 1h 50min |
Unresolved | 1571 | 939 | 161 |
I found that depending on the time of day sometimes the web servers stopped responding on long runs, if you are going to be looking up more than a few thousand names I’d recommend that you split it into chunks to avoid overloading the web server.
Change
1 2 |
for r in range(0, int(rows)): |
to
1 2 |
for r in range(0, 4000): |
then
1 2 |
for r in range(4000, 8000): |
etc.
It is important to note that OPSIN converts the chemical name to a structure, whilst CIR and ChemSpider are lookup services. So whilst OPSIN will be able to convert the chemical names of any novel molecules to structures, the look up services will only be able to provide structures that exist in the database. However the databases will also contain synonyms, trivial names and trade names that could be used to identify a molecule, OPSIN was not designed to use these as input.
Page Updated 12 January 2016
One thought on “Several ways of Vortex scripting Name to Structure”
Comments are closed.