Download multiple PDB files using a Jupyter notebook

The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below.

They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I’ve found this a little temperamental and had issues with Java versions and security settings.

Since I’ve been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. This turned out to be very successful and I’ve used it to download a batch of 30,000 files.

Jupyter Notebook

I’ve become a great fan of Jupyter notebooks, I use them extensively to not only record work I’m doing but also as a workflow tool. They are also a great way to share code.

You will need to edit the path to the file containing the pdb codes, and folder where you want to download the PDB files to. We then read in the pub codes (either as comma-separated values or as one code per line). Then download the files (gzipped if requested). The try loop is needed in cases where files are unavailable, and a list of unavailable file sis printed out.

You can download the jupyter notebook here

Related Posts