What we learned when we took OpenSAFELY from GP data to schools data

OpenSAFELY was built for healthcare. When the National Institute of Teaching approached us about applying the same approach to school data, we were enthusiastic. Here is what we found when we got into the detail.

OpenSAFELY was built for healthcare. For more than five years, it has given researchers secure, reproducible access to NHS England's GP data, one of the richest health datasets in the world. When the National Institute of Teaching (NIoT) approached us about applying the same approach to school data, we were enthusiastic. Here is what we found when we got into the detail.

The data model is recognisable, but meaningfully different

In the NHS GP setting, everything revolves around a single entity type: the patient. Prescriptions, diagnoses, referrals all hang off that one concept.

Schools data has two: pupils and teachers. That sounds like a small change, but it ripples through the entire data model and query language. Our query language, ehrQL, was originally written with "patient" baked into its vocabulary. Adapting it for education meant rethinking that from the ground up. We chose not to simply swap "patient" for "student". Instead, we removed the patient-centric framing altogether, which produced a cleaner, more neutral design. We have demonstrated that this adapted version of ehrQL can reproduce real education research using dummy data. It is a proof of concept rather than a finished production system.

The NHS GP schema is mature and stable. It has been agreed, governed, and largely unchanged for years. On the other hand, the TED dataset (NIoT's Teacher Education Data platform) is newer, and its schema is still evolving as more schools join and new data types are incorporated. This is normal for a young system, but it means the investment required to build a fully stable query layer on top of it is not yet justified. For the current pilot we are using SQL Runner, a more direct SQL-based extraction tool originally designed for data development work in OpenSAFELY, rather than the full ehrQL implementation.

There is another significant difference. NHS GP data comes with codelists. Diagnoses, medications, and procedures all map to internationally standardised coding systems like SNOMED and BNF. OpenSAFELY has built substantial infrastructure around those. Schools data has nothing equivalent. Attainment scores, SEND categories, and intervention types vary between schools, between MATs, and over time.

Re-identification risk is a different problem in schools

Both healthcare and education data are sensitive, but the nature of the re-identification risk is quite different.

In the NHS GP setting, the main concern is statistical. Can someone reconstruct an individual's record from a combination of output values? Standard controls like rounding and suppression of small counts are well understood and largely effective.

In schools, a subtler risk emerges. Imagine a researcher publishing this result: "Intervention Group A: 0% of pupils achieved a pass." Although output looks safe, in a real school, a teacher, parent, or community member may know exactly which pupils were in Intervention Group A. The disclosure does not come from a small number being visible, but from the group itself being recognisable.

This is amplified by the community-local nature of the school environment. Staff lists are often public. Parents attend events. Local WhatsApp groups share information. People in school communities know each other in ways that hospital patients simply do not. Some categorical fields in the TED data also have very high cardinality (lots and lots of possible values), meaning a single unusual field value could, in principle, single out an individual.

The mitigations we are helping NIoT develop focus on upstream data design rather than just the output-checking stage. That means standardising and aggregating local category codes before analysis, applying minimum group size thresholds, and being careful about which cross-tabulations are permitted.

Transparency requires a phased approach

OpenSAFELY's default model is radical transparency. Code, codelists, and outputs are published openly. That remains the goal for the schools work too. But getting there responsibly takes time.

In the NHS GP setting, the coding systems are standardised enough that publishing them publicly carries little re-identification risk. In schools, publishing highly granular, locally-defined category codes too early could expose sensitive structure before the right aggregation and standardisation work has been done. We are treating the current pilot phase as a controlled, internal environment where outputs are shared only with approved users, while we work out what appropriate aggregation looks like in practice.

The bigger picture

The schools work has been the most concrete test of what expanding OpenSAFELY beyond the NHS actually requires. The core concepts translate well. Secure analysis, reproducible code, and human output checking all carry over. However, there are domain differences, such as different data model, no codelists, an evolving schema, and a social re-identification landscape that demands different upstream controls. Each of these is solvable and will be subject of future work.