The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

The RCSB PDB RESTful Web Service interface

These web services provide programmatic access to the data, there are two types of services for the RESTful interface:

  • Search services: to return a list of IDs (e.g., PDB IDs, chain IDs, ligand IDs)
  • Fetch services: to return data given a ID (e.g. reports, descriptions, data items)

Sometimes I have a list of Uniprot accession IDs and I want to find out if there is any structural information in the PDB, I could search for each Uniprot ID individually using the PDB user search tools, but if you have more than a couple to look up it is better to use a script. I use Vortex as a flexible desktop tool to search and store information from a variety of sources, and the scripting interface provides a very powerful tool. The PDB search web service interface exposes the RCSB PDB advanced search interface as an XML Web Service. To use this service, we need to POST a XML representation of an advanced search to:-

We need a list of uniprot codes.


First read the Uniprot codes into Vortex.

The script first opens a dialog box asking the user to select the column contains the Uniprot ID, it then creates 2 new columns, one to contain the number of PDB entity id containing the Uniprot ID, the second to contain a list of all the PDB entity id that are returned. It is important to note there is not a 1:1 correspondence between Uniprot and PDB ids, a single Uniprot ID may be associated with multiple crystal structures, these might be the same structure at different resolutions or be structures containing different ligands, or even the protein without the ligand. In addition, a single PDB file can be associated with multiple Uniprot Id if it contains multiple different protein chains.

The next part of the script works through the table row by row selecting the uniprot id, creating the XML search query and POSTing it to the web service. The number of entries in the returned string is determined and the two columns completed.

The result is the table shown below. This is a nice summary but not that useful if you want to do further analysis.

The next part of the script pivots the results to a single PDB entity ID per row, as shown. The advantage of this format is we can now search and store information related to an individual PDB entity.

The UniprotPDBmapping Vortex Script!1&btvi=1&fsb=1&xpc=282aqx9QCs&p=https%3A//

Getting More Information from PDB

With a table containing PDB entity ID we can now mine the PDB for more information. The web service takes a simple string as input

and returns a description of the entry in XML format, detailing each entity in the PDB file. <molDescription> <structureId id=”4HHB”> <polymer entityNr=”1″ length=”141″ type=”protein” weight=”15150.4″> <chain id=”A”/> <chain id=”C”/> <Taxonomy name=”Homo sapiens” id=”9606″/> <macroMolecule name=”Hemoglobin subunit alpha”> <accession id=”P69905″/> </macroMolecule> <polymerDescription description=”HEMOGLOBIN (DEOXY) (ALPHA CHAIN)”/> </polymer> <polymer entityNr=”2″ length=”146″ type=”protein” weight=”15890.2″> <chain id=”B”/> <chain id=”D”/> <Taxonomy name=”Homo sapiens” id=”9606″/> <macroMolecule name=”Hemoglobin subunit beta”> <accession id=”P68871″/> </macroMolecule> <polymerDescription description=”HEMOGLOBIN (DEOXY) (BETA CHAIN)”/> </polymer> </structureId> </molDescription>

The vortex script firsts asks the user to select the PDB column, then for each row in the table generates the query string and runs the query. The returned XML is than parsed to extract some of the data and then generates the appropriate columns for each entity in the returned XML. As can be seen in the image below, some contain a single protein, others contain multiple proteins. This script pulls out the name of the entity, type (e.g. protein), number of amino acids, Uniprot id, which could be different to the original query if the PDB contains multiple proteins.

1Z7Q is the crystal structure of the 20s proteasome from yeast in complex with the proteasome activator PA26 from Trypanosome brucei at 3.2 angstroms resolution, it contains 15 different protein chains. It can be viewed here

The PDBinfo Vortex script

The scripts can be dowloaded here


If you also want to download the PDB files there are a few scripting options here, Downloading PDB

Last Updated 9 November 2017

Related Posts