Maximizing Data Utility In Clinical Disclosure: Reference Populations

Choosing the right reference population maximizes data utility. While choosing the right reference population, the right approaches and methods further facilitate data sharing and secondary analyses. Without a reference population, the risk of re-identification of patients is real. 

When anonymization of patients is done, researchers focus on blending in any unique patients remaining in a given dataset or the removal of uniqueness. When other similar people with the same characteristics exist in the dataset then an adversary might not be able to definitively re-identify a patient. A widely accepted numerical threshold for similar individuals is 11. (Here similar individuals are in regards to their indirect identifiers). Different ways to select reference populations are worth further discussion. 

The risk of re-identification

Reference population means “the group of individuals used to determine the risk of re-identification”. The reference population of a dataset is the set of persons who are considered to have ‘similar’ characteristics to those being modeled for risk. Here ‘similar’ refers to suffering from the same disease(s) of the trial in question, demographics, a period in time, locations, as well as having participated in a clinical trial for the indication/treatment in question. 

Sponsors and Clinical Research Organizations (CROs) need to meet the regulatory requirement of 0.09 risk threshold which demonstrates that there are at least 11 individuals with each set of identifying attributes. Or the cell size equivalent is 10 individuals. There are multiple ways to demonstrate that there are at least 11 individuals with a set of characteristics but using the K-anonymity model (k=11 threshold) is one of the common ways. 

In 2019, Health Canada has issued further guidelines about reference population selection. Its guidance document states that:

“The selection of the appropriate reference population determines the total patient group size and the amount of anonymization (i.e. data transformation) that is necessary to reduce the risk of patient re-identification. The reference population can be informed from patients in the single trial in question (smallest population), all patients in similar trials by a specific study sponsor, all patients in similar trials (e.g., by disease or therapeutic intervention category), or all patients in a geographic area (largest population).

When the appropriate reference population is one other than the single trial in question, an extrapolation of the trial population can be applied to achieve an estimate of the population size. In keeping with the first and second guiding principle, risk of re-identification should be informed not only by the number of individuals in a single study, but also by the number that reflects real-world risk.”

Determining a reference population

Since each dataset is unique, a blanket method cannot be applied to selecting the reference population. The selection is done on a case-by-case basis. CROs typically conduct trials with small sample sizes and focus on indications with limited established and pre-existing research. These factors drive the decision-making when selecting reference populations and must be evaluated prior to making the selection.

There are four possibilities for selecting a reference population. The most conservative one is where each individual study serves as its own population. Another method is to combine all the participants. In this method, all the studies in the submission together serve as a “pooled” population. The third method goes beyond the submission and it includes all the studies in the submission plus other recent similar studies for the same indication/treatment. Lastly, the least conservative selection would be having larger geographic / prevalence estimates. 

For rare and ultra-rare diseases, a more conservative approach is usually adopted while selecting the reference population. When each individual study serves as its own population, researchers do not assume a wider disease population. This allows them to focus the risk measurement only on the population of the study. Conversely, if there are additional studies with sufficient patient counts, outside of the study population, then geographical estimates are used. But often in the case of ultra-rare diseases that are yet to be extensively researched, such populations are not found. Please see our case study on working with rare disease populations.

As you can notice, selecting a reference population requires a nuanced understanding of the dataset at hand. If the reference population is selected using too conservative criteria then the risk of re-identification is lower and less conservative criteria may increase the risk of re-identification. At Real Life Sciences, data science experts can provide insights to make the decision-making process easier. Based on the input from sponsors, we provide recommendations for selecting the reference population. Furthermore, these experts can later anonymize sensitive variables in the dataset per regulatory requirements. We welcome further discussions about clinical trial datasets for rare or common diseases.  


crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram