Not committing confidential data to git
The Guardian report on Biobank data being leaked onto GitHub is either no big deal or a massive screw-up, depending on your point of view.
Either way, what I do is have a global gitignore file naming folders for confidential information. I put anything sensitive in there. If I want to push to Git, I have to try really, really hard.
The file would be something like this:
$ cat ~/.gitignore
confidential/
logs
target
.DS_Store
*mise.toml
You do have to tell git about the file, but then you're set for life.
I store data in confidential. But as Python Notebooks contain outputs, they should go in too.
* * *
Biobank contains the data for half a million people, so you'd think the risks of sharing the data are high.
Fortunately, there is a clear path forwards. In many other sectors – such as census work at ONS for 2 decades – data is not disseminated out to users. Instead the analysts go to the data, and work inside a secure platform called a Trusted Research Environment (TRE). This working style must be adopted in the NHS.
That's from the Goldacre review in 2022, and OpenSAFELY is that platform.
But it sounds like a burden for smaller projects. Until that's the common pattern, not committing the data to a public repository is a solid start.