OpenBind cannot come soon enough

From a speech by Secretary of State for Science, Innovation and Technology, 10 June 2025:

Boston might be the birthplace of biotech.

But – with Google DeepMind on one side and the Crick on the other - King’s Cross is emerging as a global powerhouse for AI-driven drug discovery.

Today, we’re launching a new project, OpenBind, to create the world’s largest database explaining how drugs interact with the proteins they target.

20 times bigger than all the data collected worldwide over the last half a century, OpenBind will provide an exceptionally detailed picture of how diseases work.

And it could cut the cost of developing new treatments by up to £100 billion.

The results for the health of our people, our nation and our economy could be revolutionary.

Machine learning loves lots of quality data

Where there’s a foundational set of quality data, you can do things like AlphaFold and make serious breakthroughs. Sure, there are limitations in being able to generalise outside the data you learned from, but if that data is big enough and diverse enough to begin with, you’ve got a chance of doing well.

We don’t have good quality data for small molecules and protein targets

New AI models appear frequently—almost weekly—trained on small molecules and proteins to predict something about their interaction. Basically, “will this drug do what I want it to do?”. However, they are trained up on a limited amount of data, of mixed quality.

Organisations like Polaris are addressing part of this, by curating quality data sets and organising competitions.

And the Structural Genomics Consortium is generating data. They are screening 1 billion molecules against 2,000 proteins by 2035. Sounds good, but the data doesn’t tell you what the molecules are! Reverse engineering this information from their data is a violation of their terms.

What OpenBind is doing is important

It’s creating that quality and quantity of foundational data. Or at least that appears to be the promise, as there’s very little information to go on, other than who’s involved and:

there’s £8 million of investment (is that enough?);
the scale is “500,000 protein - ligand complex structures and affinity measurements over 5 year”.

Aside: I’d like to know what the “20x” claim in the speech is based on. It implies something with 25,000 data datapoints. PDBind+ has 27k complex structures, so maybe that’s about right.

Although there are larger datasets, OpenBind will have the actual structure (shape) of a protein with the molecule interacting. Drugs and proteins are often described as a key and a lock, but as others have said, it’s more dynamic, like a hand and a glove.

So: Quality and quantity. And if it lives up to the “open” in the name, I can’t wait.