Not committing confidential data to git

The Guardian report on Biobank data being leaked onto GitHub is either no big deal or a massive screw-up, depending on your point of view.

Either way, what I do is have a global gitignore file naming folders for confidential information. I put anything sensitive in there. If I want to push to Git, I have to try really, really hard.

The file would be something like this:

$ cat ~/.gitignore
confidential/
logs
target
.DS_Store
*mise.toml

You do have to tell git about the file, but then you're set for life.

I store data in confidential. But as Python Notebooks contain outputs, they should go in too.

* * *

Biobank contains the data for half a million people, so you'd think the risks of sharing the data are high.

Fortunately, there is a clear path forwards. In many other sectors – such as census work at ONS for 2 decades – data is not disseminated out to users. Instead the analysts go to the data, and work inside a secure platform called a Trusted Research Environment (TRE). This working style must be adopted in the NHS.

That's from the Goldacre review in 2022, and OpenSAFELY is that platform.

But it sounds like a burden for smaller projects. Until that's the common pattern, not committing the data to a public repository is a solid start.