Getting Ligand ID for multiple PDB files

The Protein Data Bank (https://www.rcsb.org) is an invaluable repository of 3D biomolecular structures. As of writing the database contains 214,791 structures (X-ray, Cryo-EM and NMR) and over 1 million computed structure models. Many of these structures have ligands bound to active sites, these can be co-factors, inhibitors or native ligands. In addition, there are of course inorganic salts, metal ions, or solvents, buffers and detergents used in the experimental process used to generate the structure.

Sometimes it is useful to get an idea of the type of ligands that bind to a particular group of proteins, whilst you can download all the PDB files and inspect manually it is possible to do this programmatically.

PDBe is a founding member of the Worldwide Protein Data Bank (wwPDB) which collects, organises and disseminates data on biological macromolecular structures. the wwPDB Partners are RCSB PDB, PDBj, BMRB, EMDB. The PDBe API (https://www.ebi.ac.uk/pdbe/pdbe-services) is just one of the services that is provided to access PDB information.

In this case we will be using the REST calls based on PDB entry data to access information about a particular PDB entry. In particular, the ligands call that provides a a list of modelled instances of ligands, i.e. ‘bound’ molecules that are not waters.

The is very simple as shown below where 4CTJ is the PDBid.

https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ

1	https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ

The data is returned in json format

{“4ctj”:[{“chain_id”:”A”,”author_residue_number”:1263,”author_insertion_code”:””,”chem_comp_id”:”SAM”,”alternate_conformers”:0,”entity_id”:2,”struct_asym_id”:”C”,”residue_number”:1,”chem_comp_name”:”S-ADENOSYLMETHIONINE”,”weight”:398.437,”carbohydrate_polymer”:false,”branch_name”:””},{“chain_id”:”C”,”author_residue_number”:1263,”author_insertion_code”:””,”chem_comp_id”:”SAM”,”alternate_conformers”:0,”entity_id”:2,”struct_asym_id”:”F”,”residue_number”:1,”chem_comp_name”:”S-ADENOSYLMETHIONINE”,”weight”:398.437,”carbohydrate_polymer”:false,”branch_name”:””},{“chain_id”:”A”,”author_residue_number”:1264,”author_insertion_code”:””,”chem_comp_id”:”3A9″,”alternate_conformers”:0,”entity_id”:3,”struct_asym_id”:”D”,”residue_number”:1,”chem_comp_name”:”2,3-dihydro-1-benzofuran-5-carboxylic acid”,”weight”:164.158,”carbohydrate_polymer”:false,”branch_name”:””},{“chain_id”:”A”,”author_residue_number”:1265,”author_insertion_code”:””,”chem_comp_id”:”NA”,”alternate_conformers”:0,”entity_id”:4,”struct_asym_id”:”E”,”residue_number”:1,”chem_comp_name”:”SODIUM ION”,”weight”:22.99,”carbohydrate_polymer”:false,”branch_name”:””}]}

Where the ligand ID is “chem_comp_id”:”SAM”. In this example there are actually 3 ligands, SAM, 3A9, NA. Not all are particularly interesting and in most cases we would only want the “drug-like” ligands.

This jupyter notebook takes a list of PDBid and uses the PDBe API to get a list of ligand ID for each PDB entry, then removes the less interesting ligands and finally exports the list to a csv file.

The first cell simply imports the necessary libraries.

# Use PDB ID to identify Ligands
# Authored by Chris Swain (http://www.macinchem.org)
# Copyright CC-BY

import csv
import os
import sys
import json
import requests

from urllib.request import urlretrieve

# Use PDB ID to identify Ligands

# Authored by Chris Swain (http://www.macinchem.org)

# Copyright CC-BY

import csv

import os

import sys

import json

import requests

from urllib.request import urlretrieve

The next step is to identify the file containing the PDBid and import them (you will need to edit the path to the file). Depending on the format of the file you will need to split the entries using newline “\n” or comma “,” as needed. Alternatively if there is only a few PDBid you might want to simply edit this line. pdb_codes = [‘4ctj’,’3EVB’]. Note the input is not case sensitive.

# You may want to edit these parameters

# File containing comma-separated list of the desired PDB IDs
pdb_codes_file = '/Users/username/Desktop/PDBids.txt'

# Read the PDB IDs from the input file
with open(pdb_codes_file) as f:
    # Change to .split(',') if PDB IDs are in csv
    pdb_codes = f.read().split('\n')
    
# Alternatively, hard code the PDB IDs:
#pdb_codes = ['4ctj','3EVB']

#For testing
#print(pdb_codes)

# You may want to edit these parameters

# File containing comma-separated list of the desired PDB IDs

pdb_codes_file = '/Users/username/Desktop/PDBids.txt'

# Read the PDB IDs from the input file

with open(pdb_codes_file) as f:

# Change to .split(',') if PDB IDs are in csv

pdb_codes = f.read().split('\n')

# Alternatively, hard code the PDB IDs:

#pdb_codes = ['4ctj','3EVB']

#For testing

#print(pdb_codes)

If you want to check the input you can uncomment the print(pdb_codes).

The next cell actually uses the PDBe API, the boring_list is simply a list of the less interesting ligands, you can modify this to your own needs. For each PDBid entry we generate the url to access the PDPe API, then post the request. Then the returned JSON is parsed to extract the required ligand ID. For each record the ligand IDs are stored in a python Set. One advantage is that Sets cannot have two items with the same value. So takes care of instances where are PDB make contain multiple examples of the ligand, for example inorganic ions or buffers. We can then simply remove the boring_list using mylist – boring_list.

#https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ
boring_list= {'SO4','CL','UNX','ZN','PO4','EDO','U','NA','MG','IOD'}
final_list = []
for pdb_code in pdb_codes:
    pdbURL = 'https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/%s' % pdb_code
    #pdbURL = "https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ"
    ligandData = requests.get(pdbURL)
   
    j = json.loads(ligandData.text)
    mylist = {pdb_code}
    forJSON = pdb_code.lower() #output has lower case PDBid
    for b in j[forJSON]:
        mylist.add(b["chem_comp_id"])
        mylist = mylist - boring_list
        
    mylist = sorted(mylist)
    final_list.append(mylist)

#https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ

boring_list= {'SO4','CL','UNX','ZN','PO4','EDO','U','NA','MG','IOD'}

final_list = []

for pdb_code in pdb_codes:

pdbURL = 'https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/%s' % pdb_code

#pdbURL = "https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/4CTJ"

ligandData = requests.get(pdbURL)

j = json.loads(ligandData.text)

mylist = {pdb_code}

forJSON = pdb_code.lower() #output has lower case PDBid

for b in j[forJSON]:

mylist.add(b["chem_comp_id"])

mylist = mylist - boring_list

mylist = sorted(mylist)

final_list.append(mylist)

Finally we convert the SET to list using mylist = sorted(mylist) and append it to the final list.

The results are then exported to a csv file.

with open('PDBis_Ligands.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(final_list)

with open('PDBis_Ligands.csv', 'w', newline='') as csvfile:

writer = csv.writer(csvfile)

writer.writerows(final_list)

The results look like this, the original PDBid followed by a list of ligands.

8APX,SAH
8B52,GOL,SAH,SAM
8BCR,AT9,SAH,SAM
8GYB,SAH
8GZR,CDP,MN,SAH
8JCE,SAH,YG4

You can download a copy of the Jupyter Notebook here

Get-LigandID-of-PDB.ipynb_Download

Getting Ligand ID for multiple PDB files

Related Posts

Selecting random clusters from a large dataset in Vortex

Using ChemDraw as input for Boltz docking