The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.
Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below.

They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I’ve found this a little temperamental and had issues with Java versions and security settings.
Since I’ve been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. This turned out to be very successful and I’ve used it to download a batch of 30,000 files.
Jupyter Notebook
I’ve become a great fan of Jupyter notebooks, I use them extensively to not only record work I’m doing but also as a workflow tool. They are also a great way to share code.
You will need to edit the path to the file containing the pdb codes, and folder where you want to download the PDB files to. We then read in the pub codes (either as comma-separated values or as one code per line). Then download the files (gzipped if requested). The try loop is needed in cases where files are unavailable, and a list of unavailable file sis printed out.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<span></span><span class="c1"># Use PDB ID to download PDB files from https://www.rcsb.org</span> <span class="c1"># Authored by Chris Swain (http://www.macinchem.org)</span> <span class="c1"># Copyright CC-BY</span> <span class="kn">import</span> <span class="nn">csv</span> <span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">sys</span> <span class="c1"># Python 2 and 3 compatibility</span> <span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">version_info</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span> <span class="kn">from</span> <span class="nn">urllib</span> <span class="kn">import</span> <span class="n">urlretrieve</span> <span class="k">else</span><span class="p">:</span> <span class="kn">from</span> <span class="nn">urllib.request</span> <span class="kn">import</span> <span class="n">urlretrieve</span> |
1 2 3 4 5 6 7 8 9 10 |
<span></span><span class="c1"># You may want to edit these parameters</span> <span class="c1"># File containing comma-separated list of the desired PDB IDs</span> <span class="n">pdb_codes_file</span> <span class="o">=</span> <span class="s1">'ForPDBdownload.csv'</span> <span class="c1"># Folder to download files to</span> <span class="n">download_folder</span> <span class="o">=</span> <span class="s1">'PDB2/'</span> <span class="c1"># Whether to download gzip compressed files</span> <span class="n">compressed</span> <span class="o">=</span> <span class="kc">True</span> |
1 2 3 4 5 6 7 8 9 10 |
<span></span><span class="c1"># Read the PDB IDs from the input csv file</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">pdb_codes_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="c1"># Change to .split('\n') if PDB IDs are 1 per line</span> <span class="n">pdb_codes</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="c1"># Alternatively, hard code the PDB IDs:</span> <span class="c1"># pdb_codes = ['1LS6', '1Z28', '2D06', '3QVU', '3QVV', '3U3J', '3U3K']</span> <span class="c1">#For testing</span> <span class="c1">#print(pdb_codes)</span> |
1 2 3 4 5 6 |
<span></span><span class="c1"># Ensure download folder exists</span> <span class="k">try</span><span class="p">:</span> <span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">download_folder</span><span class="p">)</span> <span class="k">except</span> <span class="ne">OSError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span> <span class="c1"># Ignore OSError raised if it already exists</span> <span class="k">pass</span> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<span></span><span class="k">for</span> <span class="n">pdb_code</span> <span class="ow">in</span> <span class="n">pdb_codes</span><span class="p">:</span> <span class="c1"># Add .pdb extension and remove ':1' suffix if entities</span> <span class="n">filename</span> <span class="o">=</span> <span class="s1">'</span><span class="si">%s</span><span class="s1">.pdb'</span> <span class="o">%</span> <span class="n">pdb_code</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span> <span class="c1"># Add .gz extension if compressed</span> <span class="k">if</span> <span class="n">compressed</span><span class="p">:</span> <span class="n">filename</span> <span class="o">=</span> <span class="s1">'</span><span class="si">%s</span><span class="s1">.gz'</span> <span class="o">%</span> <span class="n">filename</span> <span class="n">url</span> <span class="o">=</span> <span class="s1">'https://files.rcsb.org/download/</span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="n">filename</span> <span class="n">destination_file</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">download_folder</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span> <span class="c1"># Download the file</span> <span class="k">try</span><span class="p">:</span> <span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">destination_file</span><span class="p">)</span> <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">ex</span><span class="p">:</span> <span class="nb">print</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="k">continue</span> |
1 |
<span></span> |