The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.
The RCSB PDB RESTful Web Service interface
These web services provide programmatic access to the data, there are two types of services for the RESTful interface:
- Search services: to return a list of IDs (e.g., PDB IDs, chain IDs, ligand IDs)
- Fetch services: to return data given a ID (e.g. reports, descriptions, data items)
Sometimes I have a list of Uniprot accession IDs and I want to find out if there is any structural information in the PDB, I could search for each Uniprot ID individually using the PDB user search tools, but if you have more than a couple to look up it is better to use a script. I use Vortex as a flexible desktop tool to search and store information from a variety of sources, and the scripting interface provides a very powerful tool. The PDB search web service interface exposes the RCSB PDB advanced search interface as an XML Web Service. To use this service, we need to POST a XML representation of an advanced search to:-
1 2 |
http://www.rcsb.org/pdb/rest/search |
We need a list of uniprot codes.
P50225
Q70CQ3
A0A024QYR8
P00533
A0A023T6R1
P00519
First read the Uniprot codes into Vortex.
The script first opens a dialog box asking the user to select the column contains the Uniprot ID, it then creates 2 new columns, one to contain the number of PDB entity id containing the Uniprot ID, the second to contain a list of all the PDB entity id that are returned. It is important to note there is not a 1:1 correspondence between Uniprot and PDB ids, a single Uniprot ID may be associated with multiple crystal structures, these might be the same structure at different resolutions or be structures containing different ligands, or even the protein without the ligand. In addition, a single PDB file can be associated with multiple Uniprot Id if it contains multiple different protein chains.
The next part of the script works through the table row by row selecting the uniprot id, creating the XML search query and POSTing it to the web service. The number of entries in the returned string is determined and the two columns completed.
The result is the table shown below. This is a nice summary but not that useful if you want to do further analysis.
The next part of the script pivots the results to a single PDB entity ID per row, as shown. The advantage of this format is we can now search and store information related to an individual PDB entity.
The UniprotPDBmapping Vortex Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
#Use Uniprot accession id to find PDB structures # Python imports import urllib2 import urllib # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os input_label = swing.JLabel("Uniprot column (for input)") input_cb = workspace.getColumnComboBox() panel = swing.JPanel() layout.fill(panel, input_label, 0, 0) layout.fill(panel, input_cb, 1, 0) ret = vortex.showInDialog(panel, "Choose Uniprot column") if ret == vortex.OK: input_idx = input_cb.getSelectedIndex() if input_idx == 0: vortex.alert("you must choose a column") else: col = vtable.getColumn(input_idx - 1) url = 'http://www.rcsb.org/pdb/rest/search' colpdbNo = vtable.findColumnWithName('Num PDB', 0) # Number of PDB structures colpdbid = vtable.findColumnWithName('PDB', 1) # csv list of PDB id rowdata = [] rows = vtable.getRealRowCount() for r in range(0, int(rows)): uniprotid = col.getValueAsString(r) queryText = """ <orgPdbQuery> <queryType>org.pdb.query.simple.UpAccessionIdQuery</queryType> <description>Simple query for a list of UniprotKB Accession IDs: P50225</description> <accessionIdList>%s</accessionIdList> </orgPdbQuery> """ % uniprotid req = urllib2.Request(url, data=queryText) f = urllib2.urlopen(req) result = f.read() f.close() Nos = result.count(':') nos = str(Nos) #convert number to string colpdbNo = vtable.findColumnWithName('Num PDB', 1) colpdbNo.setValueFromString(r, nos) #newresult = string.replace(result, '\n', ',') # convert to csv not used colpdbid = vtable.findColumnWithName('PDB', 1) colpdbid.setValueFromString(r, result) #Need to convert P50225 1LS6,1Z28,2D06,3QVU,3QVV,3U3J,3U3K,3U3M,3U3O,3U3R,4GRA to #P50225, 1LS6 #P50225, 1Z28 #P50225, 2D06 my_list = result.strip().split("\n") # read csv into list for x in range(0, len(my_list)): row = [uniprotid] + [my_list[x]] rowdata.append(row) vtable.fireTableStructureChanged() #create new workspace in Vortex column_names = ['Uniprotid', 'PDB'] TableName = "PDB Mapping" arrayToWorkspace(rowdata, column_names, TableName) |
https://googleads.g.doubleclick.net/pagead/ads?client=ca-pub-4009912495867492&output=html&h=90&adk=3681551961&adf=500023830&w=728&lmt=1678561581&ad_type=text&format=728x90_as&url=https%3A%2F%2Fwww.macinchem.org%2Freviews%2Fvortex%2Ftut40%2Fscripting_vortex40.php&wgl=1&dt=1678561580967&bpp=27&bdt=603&idt=69&shv=r20230308&mjsv=m202302270101&ptt=5&saldr=sa&abxe=1&cookie=ID%3D0872c9006ca1d17f-22058f4146dd00b4%3AT%3D1678287737%3ART%3D1678287737%3AS%3DALNI_Ma-zDWYfI78gqPzovcNf4zLWH8ong&gpic=UID%3D00000b2c059a75e1%3AT%3D1670146468%3ART%3D1678546030%3AS%3DALNI_MZweVf526QV9gmdPeyDTUlpbTMryg&correlator=1541077865731&frm=20&pv=2&ga_vid=723569753.1677962380&ga_sid=1678561581&ga_hid=1511591668&ga_fc=1&ga_cid=970666773.1658677582&u_tz=0&u_his=24&u_h=1600&u_w=1920&u_ah=1575&u_aw=1920&u_cd=24&u_sd=1&adx=577&ady=4502&biw=1691&bih=1391&scr_x=0&scr_y=0&eid=44759875%2C44759837%2C44777877%2C44759926%2C44767166%2C31072715&oid=2&pvsid=1843624002627711&nvt=1&ref=https%3A%2F%2Fwww.macinchem.org%2Freviews%2Fhints_tutorials.php&fc=640&brdim=16%2C25%2C16%2C25%2C1920%2C25%2C1706%2C1494%2C1706%2C1391&vis=1&rsz=%7C%7Clebr%7C&abl=CS&pfx=0&fu=0&bc=31&ifi=1&uci=a!1&btvi=1&fsb=1&xpc=282aqx9QCs&p=https%3A//www.macinchem.org&dtd=135
Getting More Information from PDB
With a table containing PDB entity ID we can now mine the PDB for more information. The web service takes a simple string as input
1 2 |
http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb |
and returns a description of the entry in XML format, detailing each entity in the PDB file. &lt;molDescription&gt; &lt;structureId id=”4HHB”&gt; &lt;polymer entityNr=”1″ length=”141″ type=”protein” weight=”15150.4″&gt; &lt;chain id=”A”/&gt; &lt;chain id=”C”/&gt; &lt;Taxonomy name=”Homo sapiens” id=”9606″/&gt; &lt;macroMolecule name=”Hemoglobin subunit alpha”&gt; &lt;accession id=”P69905″/&gt; &lt;/macroMolecule&gt; &lt;polymerDescription description=”HEMOGLOBIN (DEOXY) (ALPHA CHAIN)”/&gt; &lt;/polymer&gt; &lt;polymer entityNr=”2″ length=”146″ type=”protein” weight=”15890.2″&gt; &lt;chain id=”B”/&gt; &lt;chain id=”D”/&gt; &lt;Taxonomy name=”Homo sapiens” id=”9606″/&gt; &lt;macroMolecule name=”Hemoglobin subunit beta”&gt; &lt;accession id=”P68871″/&gt; &lt;/macroMolecule&gt; &lt;polymerDescription description=”HEMOGLOBIN (DEOXY) (BETA CHAIN)”/&gt; &lt;/polymer&gt; &lt;/structureId&gt; &lt;/molDescription&gt;
The vortex script firsts asks the user to select the PDB column, then for each row in the table generates the query string and runs the query. The returned XML is than parsed to extract some of the data and then generates the appropriate columns for each entity in the returned XML. As can be seen in the image below, some contain a single protein, others contain multiple proteins. This script pulls out the name of the entity, type (e.g. protein), number of amino acids, Uniprot id, which could be different to the original query if the PDB contains multiple proteins.
1Z7Q is the crystal structure of the 20s proteasome from yeast in complex with the proteasome activator PA26 from Trypanosome brucei at 3.2 angstroms resolution, it contains 15 different protein chains. It can be viewed here http://www.rcsb.org/pdb/ngl/ngl.do?pdbid=1Z7Q&bionumber=1.
The PDBinfo Vortex script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
#Use PDB id to find PDB more info #http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb # Python imports import urllib2 import urllib import xml.etree.ElementTree as etree # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os input_label = swing.JLabel("PDB column (for input)") input_cb = workspace.getColumnComboBox() panel = swing.JPanel() layout.fill(panel, input_label, 0, 0) layout.fill(panel, input_cb, 1, 0) # Get column containing PDB id ret = vortex.showInDialog(panel, "Choose PDB column") if ret == vortex.OK: input_idx = input_cb.getSelectedIndex() if input_idx == 0: vortex.alert("you must choose a column") else: col = vtable.getColumn(input_idx - 1) #Format of query url #http://www.rcsb.org/pdb/rest/describeMol?structureId=4hhb rows = vtable.getRealRowCount() for r in range(0, int(rows)): pdbid = col.getValueAsString(r) if ":" in pdbid: #only search if pdb present pdbid = str(pdbid)[:4] #convert to string and remove :1 mystr = "http://www.rcsb.org/pdb/rest/describeMol?structureId=" + pdbid f = urllib2.urlopen(mystr) myreturn = f.read() f.close() #You may want to do more detailed error checking, tree = etree.fromstring(myreturn) for i, polymer in enumerate(tree.findall('.//polymer')): try: col_name = vtable.findColumnWithName('Name %s' % (i + 1), 1) node = polymer.find('macroMolecule') col_name.setValueFromString(r, node.get('name')) except: pass try: col_id = vtable.findColumnWithName('Type %s' % (i + 1), 1) col_id.setValueFromString(r, polymer.get('type')) except: pass try: col_length = vtable.findColumnWithName('Num AA %s' % (i + 1), 1) col_length.setValueFromString(r, polymer.get('length')) except: pass try: col_name = vtable.findColumnWithName('UniprotID %s' % (i + 1), 1) node = polymer.find('macroMolecule') subnode = node.find('accession') col_name.setValueFromString(r, subnode.get('id')) except: pass |
The scripts can be dowloaded here
Update
If you also want to download the PDB files there are a few scripting options here, Downloading PDB
Last Updated 9 November 2017