PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

PROteolysis TArgeting Chimeras (PROTACs) technology provides an alternative to module biological function by specially using the ubiquitin proteasome system to induce degradation of the target protein DOI.

The PROTAC is composed of three components.

  1. A head-group that targets the protein of interest
  2. A crosslinker
  3. A second ligand at the opposite end that binds an E3 ligase

PROTAC ligand binds to both the protein target and an E3 ubiquitin ligase to form a ternary complex, followed by transfer of ubiquitin from the E2 to the protein substrate. The ubiquitin is attached to a lysine on the target protein, subsequent ubiquitins are then added to a lysine residue of the first added ubiquitin. The ternary complex dissociates and PROTAC is recycled. Polyubiquitinated protein undergoes degradation via 26S proteasomes.

There are now many E3 ligases identified and multiple ligands for each ligase, a wide variety of cross linker have been employed and of course the head-group will vary depending on the protein of interest (POI). Whilst the PROTACs have a similar modular structure annotation of the different substructures is a challenge. In the past I’ve used a series of SMARTS queries but this becomes difficult to maintain, especially when one SMARTS string can be a substring of a larger SMARTS query.

Recently Astra Zeneca published a machine learning framework to split PROTACs into the different components. The code is freely available on GitHub.

https://github.com/ribesstefano/PROTAC-Splitter

I created a conda environment and then installed the package, note it is only tested using python 3.10.8. I also installed Jupyter

I then created a Jupyter notebook as shown below.

Protac_splitter

The cell below splits a single PROTAC and the output shows the format of the result. The individual components of the PROTAC are output as SMILES string separated by a full stop (period).

The rest of the notebook imports a file containing 9000 PROTACs into a pandas data frame and then splits them into the individual components, the results are then exported to a text file "output.csv", removing the index and header.

The result is a file containing the three components separated by a full stop, with linking positions annotated.

O=C1CCC(N2C(=O)c3cccc(N[:2])c3C2=O)C(=O)N1.O=C(COCCOCC[:1])[:2].O=c1c2nc3ccccc3[nH]c-2nn1[:1]

We can now import this file into Vortex, as you can see in the image below the default is to import as a comma separated values. However in this case we need to use a full stop as the separator.

We can do this by modifying the import options as shown below. First unselect "First line contains column names", then choose "other" as the column separator character, and type a full stop (period) into box. The file preview should now update to have column names (the default is C0,C1,C2) you now need to type in the names as shown below.

Click "OK" and the file will be imported.

You can then use the "Generic Category script" from the cluster analysis collection to see how any of each E3 ligase ligand are employed https://macinchem.org/2023/03/11/a-collection-of-vortex-scripts-to-aid-cluster-analysis/

Many thanks to AZ for making this tool available to the community.

Related Posts