If you have been trying to access patent data via SureChEMBL recently you will be very aware that the team have been struggling to provide timely updates. Whilst the concept of extracting chemical information seems attractive, the devil is in the details and it requires a combination of name-to-structure and image-to-structure combined with intelligence to know when an object actually is a chemical structure. The legacy technology underpinning all this has been creaking for a while, and it is fantastic to read about a complete overhaul.
You can now download all the updated data from the ftp site. https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/bulk_data/
If you’re familiar with relational databases, the structure should be straightforward. One file contains the compounds, another the patent documents, and a third the compound–patent relationships. A fourth smaller file holds metadata about patent sections (e.g., title, abstract, description, claims, images, MOL attachments).
The documentation describes the files in detail.
https://chembl.gitbook.io/surechembl/downloads/bulk-data