Vortex script to determine the Amino Acids in a collection of peptides

I’ve recently become interested the comparison of the amino amino-acid composition of peptides, to allow comparison of cyclic versus linear peptides, or brain penetrant curses non-penetrant. I had a look around but could not find any tools that did this, in particular I wanted to include any non-proteinergic amino-acids. This would include natural amino acids that are not normally incorporated into peptides but also the many synthetic amino acids that have been published in the literature.

Compiling a list of Amino-Acids

Whilst sites like SwissSidechain have a database of several hundred amino-acid structures for download a quick inspection suggests it lacks most of the synthetic amino-acids that have been published. Fortunately with the advent of HELM notation ChEMBL have compiled a list of monomers generated by fragmenting all ChEMBL peptides that contain at least three amino acids.

For the most common unnatural amino acids, we’ve used peptide vendor catalogs to derive an ID and name. Additionally, in most cases where those amino acids are capped and/or substituted at the side-chain, the monomer ID has been prefixed/suffixed with the cap name and/or extended with the information about the side-chain substitution in parentheses. As an example, the monomer ‘methyl 4-Chloro-L-phenylalanine’ can be identified by the monomer ID ‘Me_Phe(4-Cl)’.

The file can be downloaded from the ftp site ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/file is: chembl23monomer_library.xml and it contains nearly 3000 amino-acids.

This file is in XML format as shown below

I needed the SMILES string and the ID and I did try to open the XML file in a couple of applications to no avail so instead I created a very basic Jupyter Notebook. Once we have defined the “root” we can use it to navigate to an element in the tree, to get the element containing the SMILES

We can isolate just the SMILES string using

One issue is that a substructure search using the SMILES string for Alanine would also flag other amino acids that contain the alanine substructure such as Leucine or Lysine as highlighted below.

The simplest way to avoid this is to add explicit hydrogens using RDKit. First we convert the SMILES string to an RDKit molecular object, then add hydrogens, then convert back to SMILES.

The result is shown below

We can now loop through the XML file, extracting the SMILES and the ID, adding explicit hydrogens to the SMILES string and then creating a list of all Explicit SMILES and associated ID. This list can then be exported to a file.

The file was then converted to the format needed Vortex by editing in BBEdit and then saved as SMARTS.txt

Substructure searching in Vortex

In the tutorial Scripting Vortex 37 the script flags the presence (or absence) of a variety of functional groups by matching SMARTS strings to provide categorisation of potential reagents/starting materials for reaction workflows. In this script we will use the same strategy using the amino-acid SMILES as queries and writing a flag for the presence (or absence) for each of the amino-acids for all the peptides in the workspace.

The first part of the script sets up the search to use multiple processors, we then read in the SMARTS patterns form the SMARTS.txt file. The script then generates the SMILES strings for the peptides in the workspace if none is present. It then runs multiple SMARTS matching in parallel, creating a new column in the workspace for each amino-acid.

Once the substructure searching is complete the next part of the script generates a new workspace with a count of the number of peptides that contain each amino acid.

The script runs remarkably quickly, for a dataset of nearly 9000 peptides search for round 3000 amino-acids the whole process took around 4 mins.

The Vortex Script

The Jupyter Notebook, SMARTS.txt and Vortex script can be downloaded here

Last updated 29 August 2019

Related Posts