Clinical Trial Anonymization: Maintaining Data Utility
Prior to Health Canada’s anonymization guidelines issued in 2019, subjective input from researchers helped in maintaining data utility, and the process was primarily conducted at the clinical level. At present, preserving data utility during the anonymization process must involve quantitative measurements at the document/data level. Similarly, it must include a well-defined and precise implementation of the selected rules to prevent over-redaction or over-anonymization. Health Canada provided these guidelines to maximize the release of analytically-valuable information.
Sponsors and Clinical Research Organizations (CROs) often consult experts at data transformation or anonymization such as Real Life Sciences (RLS) to conduct this process. At RLS, we follow a quantitative and metrics-driven process for preserving data utility for mandatory disclosures. This process is facilitated by the RLS PROTECT solution that is designed to balance the protection of patient identity and data utility. The solution supports and expedites regulatory submissions and further facilitates voluntary data-sharing projects.
Using this solution, RLS data anonymization specialists preserve data utility by taking a methodical approach that has six distinct steps.
First, they determine the set of possible transformations across all identifiers to meet the minimum 0.09 threshold in line with the EMA/HC guidance on anonymization techniques. This enables the data to preserve the clinical or research value after transformation and it aligns with guidelines issued by the European Medicine Agency (EMA) 2018 guidelines. Regulatory authorities such as Health Canada prefer anonymization of data instead of redactions. RLS has demonstrated its thought leadership by adopting this process early on. Overall the preference is to employ transformations that allow for a higher level of granularity over outright redaction of data. This process yields multiple possible transformation scenarios across indirect identifiers.
The next step is to measure the ‘Risk of Re-identification’ (ROR) across all possible transformation scenarios. This ROR metric is used to filter transformation scenarios that meet the risk threshold. In this step, several options for a number of variables or identifiers are considered. (Health Canada requires a 0.09 threshold but it can be customized to meet thresholds recommended by other regulatory authorities as necessary.)
In the third step, we prioritize transformation scenarios by the ‘Information Loss’ (IL) metric. These metrics are used to ensure the optimal anonymization solution with minimal loss of data quality. The metric guides the next steps in the process such as optimization and transformation. The IL metric is also used to rank the different transformation scenarios in terms of data utility/data loss.
For the fourth step, input from the client team plays a critical role. In this step, we optimize transformation options in consultation with clinical scientists, who independently prioritize quasi-identifiers and evaluate the Clinical Utility (CU) based on the context of the drug or condition in study. At this point, clinical scientists from the client team also further review the risk associated with medical events, including minor adverse events, found in the documents.
Then we focus on selecting the optimal transformation scenario using optimal ROR, IL, and CU trade-offs. This selection requires a thoughtful approach as too stringent criteria will suppress data from too many patients. If new information is found that was previously missed, the process may be updated to incorporate new measurements if necessary.
The last step is to measure implementation risks post-transformation, i.e. the precision of implementation to ensure over-anonymization has not occurred. Anonymization is an iterative process that means fine-tuning may be necessary upon manual review. Our data scientists and client team work together to ensure regulatory requirements are adequately met. This step-by-step approach ensures that the anonymization process does not have a detrimental effect on data utility and any undesirable impacts on data can be caught in a timely manner.
Each clinical trial or program tries to solve a specific clinical problem, and as such, no two clinical trials are alike. A cookie-cutter approach to data transformation is oftentimes not the best way to go. RLS data scientists work closely with the client team and customize data solutions for queries at hand to meet quantitative regulatory requirements. This approach ensures quick approvals from the regulatory authorities and helps complete clinical trials within a given timeframe. Our specialists are available to further discuss data-related challenges and propose solutions.
A Cultural Shift is Happening Before our Eyes - The Impact of the European Medicines Agency's (EMA) CTR & CTIS on Clinical Trial Transparency
The rollout of the Clinical Trial Information System (CTIS) has resulted in a cultural shift in the ways in which study sponsor leaders and operational teams think about clinical trial transparency and how this impacts their planning and operations. A primary aim of the Clinical Trial Regulation (CTR) EU No 536/2014, improving clinical trial transparency, will help to enhance trust and confidence with the public and provide material benefit in aiding and improving research. However, the operational impacts of CTIS and how clinical trials are managed in conformance with the Regulation in Europe is igniting changes that are far reaching within the organization. These changes will be adopted through well defined SOPs and a clear understanding of roles and responsibilities for clinical teams and their counterparts. Study sponsors that choose to embrace the change holistically by building a transparency minded culture throughout the organization will wind up on top.
What History Tells Us?
Transparency requirements have evolved and progressed for years. In 2017, the European Medicines Agency (EMA) published external guidance for Policy 0070, and in 2019, Health Canada published a Public Release of Clinical Information (HC PRCI) guidance. The pharma industry has gradually become accustomed to disclosing trial results. Some have progressed to sharing beyond what is mandated by regional health authorities. The documents for sharing results may include trial synopsis, layperson summaries, and Clinical Study Reports (CSRs), for example.
In 2022, the EMA launched the CTIS, and in 2023, its use will become mandatory for new clinical trials. The transparency requirements implemented as a result of the CTR has brought several questions and issues to the forefront.
What are the Aims of the Clinical Trial Regulation?
Few regulations have impacted pharmaceutical manufacturers more than Regulation (EU) No 536/2014, otherwise known as The Clinical Trials Regulation or CTR. The CTR aims to centralize the regulatory submission and review process for all trials conducted in the twenty-seven (27) European Economic Union countries. The intent is to position Europe as a favorable region to conduct clinical trials while increasing the transparency of clinical trials information to the public at large.
The CTIS is the secure online portal that is used for the implementation and operation of the CTR. This portal facilitates interactions between study sponsors, researchers, regulatory bodies and ethics committees throughout the lifecycle of a clinical trial. The public can also access a subset of trial information from the portal. The effort to centralize and streamline the clinical trial process in Europe using the CTIS has had a widespread impact on the operational teams that coordinate and manage clinical trials. To provide further clarity, leading up to the launch of CTIS, the EMA published guidance to use the CTIS, which has completely changed current conversations around clinical trial transparency.
Why is the cultural shift happening?
Apart from the use of CTIS, the transparency requirements alone have caused manufacturers to rethink their approach to clinical transparency and the ways their cross-functional teams think about and manage the regulatory process. For example, CTR’s transparency requirements have brought changes to what trial information and documentation is disclosed to the public and when. Five important needs draw our attention now:
The need to share detailed information about the trial as it is happening rather than after the trial results are analyzed.
The need to adjust processes, roles and responsibilities, standard operating procedures, and critical operational decision-making to ensure the trial adheres to the regulation’s detailed expectations.
The need to rethink regulatory document authoring practices to minimize document redaction workload and drive consistency within and across trials.
The need to protect and anonymize personal data and Commercially Confidential Information published via the CTIS while remaining in compliance with GDPR.
The need to organize cross-functional teams to respond quickly to Member State questions in accordance with the regulation which may include updates to core regulatory documents that require redaction and local language translation.
This combination of requirements has triggered study sponsors to rethink how their teams work together to achieve this new set of expectations. As a result, the importance of a common understanding of what clinical transparency means, how it’s implemented and supported, and ultimately how it can be embraced must be tackled not only through roles, procedures, and processes but also how it is weaved in the cultural fabric of the organization.
How do we embrace this cultural shift?
Pharma leaders and regulatory affairs teams are looking for solutions that will help them adapt to the current evolving landscape and seamlessly integrate CTIS into their internal day-to-day workflow. These solutions include digital tools, streamlined processes, and expert teams that can help manage the transition. Real Life Sciences (RLS) has been at the forefront of developing and implementing solutions for current clinical transparency challenges.
This 6-part blog series will focus on the high-impact areas being realized by pharmaceutical manufacturers today. Our experts will share their perspectives and observations on how study sponsors can implement changes to improve operational efficiency and employee satisfaction while embracing the new world of clinical transparency we find ourselves in. Together, these focus areas make up the challenges and opportunities that are upon us in clinical transparency.
Live Webinar: Planning for Publication of Trial Documents and Plain Language Summaries under the CTR
Today's Disclosure and Transparency teams are faced with new pressure points resulting from recent changes in the regulatory space and beyond. The CTR has had an immediate impact on authoring and redaction processes and how clinical teams work together. This session will highlight common pressure points and offer practical tools and solutions in support of preparing ‘for publication’ documents and Plain Language Summaries.
Are we ready for the Clinical Trial Information System (CTIS)?
Industry expectations from a regulatory perspective are evolving as the new Clinical Trial Information System (CTIS) for clinical trials in the European Union (EU) goes live. On 31 January 2022, the launching of the CTIS facilitated the meeting of the requirements of the Clinical Trials Regulation (CTR). Beginning 31 January 2023, study sponsors can apply only under the CTR instead of the prior regulation (the Clinical Trials Directive).
But what’s the difference between CTR and CTIS?
The CTR (EU) No 536/2014 is the new regulation that replaces and expands the EU Clinical Trials Directive 2001/20/EC. The new regulation focuses on three main aims - fostering a favorable environment for conducting clinical trials in the EU, ensuring the highest standards of safety for the study participants, and increasing transparency of clinical trial information i.e. data sharing. The next step for CTR is the Accelerating Clinical Trials in the EU (ACT EU) initiative that focuses on increasing transparency in data sharing among various study sponsors.
The CTIS is a digital portal and database built especially to facilitate the implementation of the CTR. Study sponsors can submit the applications using the portal and also send documents to regulatory authorities throughout the life cycle of their clinical trial. The member states of the EU will use this portal for conducting their daily business processes. The CTIS digital portal will streamline communication between study sponsors and member states of the EU.
Are we ready for the CTIS?
Most study sponsors are dealing with a time-sensitive question - are we ready for the CTIS?
As of July 2022, 195 clinical trial applications were submitted through the CTIS. However in earlier years, each year approximately 4000 clinical trials are approved. Clearly, study sponsors are still seeking more information and reorganizing their internal processes to make submissions via the CTIS.
The CTR has introduced new standards of transparency and disclosure of in-process and completed trials. Study sponsors now have four major disclosure considerations before their clinical trial application (CTA) submissions. First, personal data and commercially confidential information (CCI) are exempt from disclosure. Study sponsors need to determine what constitutes the CCI for their studies. Secondly, public and non-public versions of documents need to be submitted simultaneously.
Next, after the approval of the study, the documents will become available to the general public. If a deferral was requested then the documents will not be published along with the public version of the document. Instead, the documents will be published per the approved deferral timelines. An additional note to remember is that the timelines for pediatric studies and adult studies are different. Lastly, now clinical trial documents need to include a plain language summary for Phase 2-4 trials within 12 months of the close of the trial.
Although there are four major disclosure considerations, each of them has several decision points (e.g. what constitutes the CCI?) and these decisions can have an impact on the entire lifecycle of the study. Study sponsors may benefit from having regulatory affairs teams in-house or consulting external agencies that can provide expert opinions about regulatory affairs. Since this is a time-sensitive endeavor, several study sponsors are currently looking for more information.
Where can I find more information?
Some study sponsors struggle with finding reliable and clear information about the CTIS. The European Medicines Agency website is a good place to start. It includes various guideline documents and videos to inform about the new regulations. However, some of these rules can be complicated, and seeking advice from experienced professionals with several years of regulatory experience usually helps.
Recently at the Clinical Data Disclosure Day 2022, a virtual webinar series, Real Life Sciences (RLS) team presented the CTIS readiness webinar. It offered practical tools and reference materials to meet the disclosure-related requirements during the time crunch. Additionally RLS experts also shared insights about how to plan for the publication of trial documents under the CTR. For a copy of the recording Contact RLS.
Clinical Trial Transparency: The Relationship Between Data Utility & The Risk of Re-Identification
Anonymizing patient information is crucial for privacy protection. Due to data sharing initiatives across research institutes and research projects, anonymization has become an important step in clinical research. When research studies that require expensive diagnostic resources such as fMRI, genetic testing, research organizations come together to share data for secondary analysis. Regulatory bodies such as the US Food and Drug Administration, Health Canada and EMA encourage data sharing initiatives.
Although there are several factors that contribute to decisions around what and how to anonymize patient information, regulatory requirements continue to be one of the most important factors. Depending on the regulatory requirements, disclosures and submissions have their own unique nuances. As such, the transformation approaches used in one submission may not be applicable to others, and a different approach may be needed.
Since 2019, Health Canada strongly encourages employing quantitative risk modeling methodologies. The European Medicine Agency too has suggested similar guidelines. These guidelines are reinforced by not only Policy 0070 but also with Clinical Trials Regulation (Regulation (EU) No 536/2014). In particular, pseudonymization of identifiers as opposed to outright suppression or redaction to preserve data utility. When sample sizes are small i.e. less than twenty patients, quantitative risk assessments will often yield the decision to outright suppress many of the identifiers. But the quantitative modeling process can help by generating transformation options for each identifier and provide the supporting evidence and rationale for the anonymization approach taken.
A fundamental problem in privacy-preserving data disclosures is how to make the right tradeoff between protecting patient privacy and data utility. Broadly speaking, patient privacy and data utility may have an inverse relationship. Researchers have observed that even for modest privacy gains complete destruction of the data utility may be needed. Another possible unintended consequence is that excessive protection of privacy, using techniques such as suppression (or removing data), can give misleading results and that could pose a public health risk.
These challenges can be tackled with the help of machine learning technology. Let’s elaborate using Figure 1.
Figure 1. Finding an Acceptable trade-off
In Figure 1, data utility is represented on the x-axis and privacy protection is represented on the y-axis. For researchers, preserving data utility while maintaining patient privacy is the ideal situation. But often the ‘ideal situation’ of maximum patient privacy protection and maximum data utility may be impossible to achieve as it is ill-defined. Researchers try to focus on finding an acceptable ‘trade-off’ or the sweet spot where adequate data is retained without compromising patient privacy.
With the help of statistical modeling, instead of an ideal case scenario, an acceptable trade-off may be computed. This process can sometimes become iterative depending on how much data is lost in each round of quantitative modeling. Therefore, anonymization approaches must counterbalance the level of anonymization with the level of information loss. Typically, this will entail anonymizing certain identifiers with greater levels of anonymization than others, i.e. a tradeoff is made between levels of anonymization across identifiers to maximize the data utility and minimize the loss of information.
Data sharing and secondary analyses of research data are becoming common practices. Disclosure of individual-level participant data has become the new normal. While sharing data from clinical trials, sponsors and clinical research organizations (CROs) use various techniques to protect the privacy and confidentiality of patients. These techniques are commonly referred to as anonymization techniques. These techniques have their pros and cons. Some techniques protect patient data but make secondary analyses difficult, while others make secondary analyses possible but carry a risk of patient re-identification.
Here we discuss further details of four popular anonymization techniques.
Anonymization techniques in clinical research
Earlier, the most common method for data anonymization was suppression. Suppression involves redacting or removing the data, so it is no longer readable. This is also known as “Masking” or “Redaction”. However, a major disadvantage of this procedure is that it decreases the data utility for secondary analyses. The research community needs alternative statistical approaches that help preserve data utility along with low re-identification risk.
One of the popular methods is “generalization”. Generalizationinvolvesmaking patient identifiers less granular or more general. For instance, instead of specific patient ages, age ranges are shared. Or instead of specific geographical locations, broader areas such as state, country, or even continent are disclosed. This method is useful when data is relatively homogenous and there are no extreme outliers. Outliers may pose a risk of re-identification of patients.
Another commonly practiced method is pseudonymization. In Pseudonymization new data elements are introduced instead of existing data. For clinical trials, it involves re-codifying a value such as a Patient ID into a different number. This method is typically used for directly-identifying variables such as patient ID, medical record numbers, or even phone numbers (e.g., patient ID 5280 will be shared as 2805). One major advantage of this method is that it keeps the link to entire study participant data intact. However, it still has a significant re-identification risk.
A less popular technique adopted is Random Noise. When this approach is adopted, it involves adding or subtracting random values/amounts from numeric or data-oriented identifiers to make it difficult to determine original values. Data scientists may sometimes use specific methods such as additive or multiplicative noise while introducing noise. This procedure allows them to reduce the re-identification risk even when outliers or influential observations may be present in the dataset. Since this method requires data science expertise, it may not be easy to find such experts in a niche domain.
Apart from a variety of approaches, sponsors and CROs need to be mindful of regulatory requirements and industry standards. There are suggested data standards such as Clinical Data Interchange Standards Consortium (CDISC), Study Data Tabulation Model (SDTM), Analysis Data Model (ADaM), and anonymization guidance for data transformation from industry consortia/working groups (e.g. PhUSE). After thoughtful deliberation, various authorities selected these standards because they provide a reasonable tradeoff in terms of removing risk, preserving data utility, and are reasonable to implement with current technologies and tools. The choice of transformations for a variable depends on:
Type of variable: Whether the identifier is a Direct Identifier (DI), Quasi-Identifier (QI), or Sensitive Information
Sensitive information: Whether the identifier is Relational or Transactional Data
Type of data: The data type such as a Categorical, Ordinal Data, or Numeric/Interval
Re-identification risk: Data utility considerations such as the possibility of patient linkage
Sponsors/CROs make anonymization decisions after careful consideration to ensure that it meets regulatory requirements. For example, Since 2020, Health Canada has discouraged redactions to promote data sharing. Along with clinical scientists, data scientists weigh-in to pick the right anonymization technique. If data expertise is not available in-house then often CROs choose to collaborate with an external data agency such as Real Life Sciences. At RLS, our team of experts treats each clinical trial dataset as a unique challenge and provides customized anonymization solutions. We keep up with the current regulatory requirements and suggest solutions accordingly. These solutions have helped our clients to get quick approval from the regulatory authorities. Connect with us to further discuss clinical trials and the solutions we offer.
Maximizing Data Utility In Clinical Transparency: Outlier Patients
Regulatory authorities such as Health Canada now encourage data sharing for secondary analyses and derivative research. Data sharing with other institutions needs to be done in a safe manner while protecting patient privacy and confidentiality. One method commonly adopted in the recent past was redacting the personally identifiable information. But this approach reduces the data utility. Health Canada now encourages quantitative risk assessment while anonymizing data for safe data sharing.
A common practice in data anonymization is to transform personal information, such as an actual age to an age range, in which the actual age will fall. For instance, prior to data anonymization, in a dataset under the age variable a patient's actual age such as 42 years is included. During the data anonymization process, 42 years is replaced by a range of 40-45 years. The age range is defined as part of the risk assessment process. Since there will be more such individuals in the dataset, the risk of re-identification is quite low.
This common practice of data anonymization is not well-suited for data points that are extreme. These extreme observations are termed as ‘outliers’ and need to be dealt with before the dataset is analyzed. The central tendency of a dataset is indicated by the average or the mean. For instance, the average age of patients in a dataset is 49 years. However, this dataset may include individuals who are 98 years of age or 18 years of age. These extreme values are the outliers as the data point is far away from the mean.
Outliers pose several problems for data analysis. Outliers skew the averages for the group, which may not be a true representation of the patient group or the control group. Also, these extreme values may lead to invalid results when various statistical tests are conducted to analyze the dataset. Sometimes outliers could simply be due to errors in data entry. Hence, sponsors/CROs check data periodically to ensure correct data is entered or if possible, a measure may be repeated to get the normal value.
The bigger problem regarding outliers from a patient privacy perspective is the risk of re-identification. These outliers are (a single or) a small set of patients with extreme attributes, such as a single elderly patient aged 98 years. In these cases, often the data utility will suffer, because the patient will not fit into any equivalence class with others. Providing an age range for data anonymization would not be a useful solution. An adversary may attempt to re-identify an outlier when a range is provided. HIPAA regulation encourages addressing outliers for eighteen direct identifiers such as age, phone numbers, race etc at a ‘safe harbor’ anonymization level. Safe harbor means the data for these eighteen identifiers is removed from the dataset.
This strategy often leads to suppression of that attribute for all patients, greatly degrading the data utility. One solution to this issue is to allow the system to treat some patients as outliers and suppress their entire record (all attributes). For instance, a sponsor would suppress all attributes of the single elderly patient. While one might lose a little data utility for that single patient, it might lead to a greater increase in data utility by allowing us to retain age for the rest of the patients.
Using this approach for increasing data utility across smaller populations which are often more sensitive to single outliers is beneficial for data sharing.
Maximizing Data Utility In Clinical Disclosure: Reference Populations
Choosing the right reference population maximizes data utility. While choosing the right reference population, the right approaches and methods further facilitate data sharing and secondary analyses. Without a reference population, the risk of re-identification of patients is real.
When anonymization of patients is done, researchers focus on blending in any unique patients remaining in a given dataset or the removal of uniqueness. When other similar people with the same characteristics exist in the dataset then an adversary might not be able to definitivelyre-identify a patient. A widely accepted numerical threshold for similar individuals is 11. (Here similar individuals are in regards to their indirect identifiers). Different ways to select reference populations are worth further discussion.
The risk of re-identification
Reference population means “the group of individuals used to determine the risk of re-identification”. The reference population of a dataset is the set of persons who are considered to have ‘similar’ characteristics to those being modeled for risk. Here ‘similar’ refers to suffering from the same disease(s) of the trial in question, demographics, a period in time, locations, as well as having participated in a clinical trial for the indication/treatment in question.
Sponsors and Clinical Research Organizations (CROs) need to meet the regulatory requirement of 0.09 risk threshold which demonstrates that there are at least 11 individuals with each set of identifying attributes. Or the cell size equivalent is 10 individuals. There are multiple ways to demonstrate that there are at least 11 individuals with a set of characteristics but using the K-anonymity model (k=11 threshold) is one of the common ways.
In 2019, Health Canada has issued further guidelines about reference population selection. Its guidance document states that:
“The selection of the appropriate reference population determines the total patient group size and the amount of anonymization (i.e. data transformation) that is necessary to reduce the risk of patient re-identification. The reference population can be informed from patients in the single trial in question (smallest population), all patients in similar trials by a specific study sponsor, all patients in similar trials (e.g., by disease or therapeutic intervention category), or all patients in a geographic area (largest population).
When the appropriate reference population is one other than the single trial in question, an extrapolation of the trial population can be applied to achieve an estimate of the population size. In keeping with the first and second guiding principle, risk of re-identification should be informed not only by the number of individuals in a single study, but also by the number that reflects real-world risk.”
Determining a reference population
Since each dataset is unique, a blanket method cannot be applied to selecting the reference population. The selection is done on a case-by-case basis. CROs typically conduct trials with small sample sizes and focus on indications with limited established and pre-existing research. These factors drive the decision-making when selecting reference populations and must be evaluated prior to making the selection.
There are four possibilities for selecting a reference population. The most conservative one is where each individual study serves as its own population. Another method is to combine all the participants. In this method, all the studies in the submission together serve as a “pooled” population. The third method goes beyond the submission and it includes all the studies in the submission plus other recent similar studies for the same indication/treatment. Lastly, the least conservative selection would be having larger geographic / prevalence estimates.
For rare and ultra-rare diseases, a more conservative approach is usually adopted while selecting the reference population. When each individual study serves as its own population, researchers do not assume a wider disease population. This allows them to focus the risk measurement only on the population of the study. Conversely, if there are additional studies with sufficient patient counts, outside of the study population, then geographical estimates are used. But often in the case of ultra-rare diseases that are yet to be extensively researched, such populations are not found. Please see our case study on working with rare disease populations.
As you can notice, selecting a reference population requires a nuanced understanding of the dataset at hand. If the reference population is selected using too conservative criteria then the risk of re-identification is lower and less conservative criteria may increase the risk of re-identification. At Real Life Sciences, data science experts can provide insights to make the decision-making process easier. Based on the input from sponsors, we provide recommendations for selecting the reference population. Furthermore, these experts can later anonymize sensitive variables in the dataset per regulatory requirements. We welcome further discussions about clinical trial datasets for rare or common diseases.
Anonymization Primer: Risk Thresholds for Patient Re-identification
As data sharing becomes the new norm, protecting patient privacy and confidentiality has new challenges. Earlier in clinical trial datasets, redactions or blocking out sensitive information was a common practice. However, it made secondary analysis almost impossible. Regulatory authorities such as Health Canada have issued new guidelines for de-identifying data. These guidelines ensure patient privacy is protected and yet sufficient data is shared for derivative research projects.
While de-identifying data, sponsors and clinical research organizations are expected to comply with re-identification risk thresholds. Here we discuss how sources of the risk of re-identification influence quantitative thresholds of de-identification.
Sources of the risk of re-identification
Patient re-identification risk can increase due to various sources. One of the common sources that increase the risk is cell size. Cell size refers to “the number of patients with the same indirectly-identifying variable values”. Fewer observations (frequencies less than 5 or 6) are often considered as a high risk for re-identification of patients and hence is unacceptable. In smaller datasets, these risks are evident. Some research centers may even require suppression of these variables. In a large dataset where frequencies are higher, there still may be a potential risk of re-identification. The risk of re-identification increases with the uniqueness of the record.
Another factor that creates risk is the visibility of the dataset. If a research dataset is public, then an adversary may launch a ‘demonstration attack’ to gain more publicity. This requires the maximum possible deidentification. For non-public datasets, how partial health information can be linked with public datasets such as credit data, consumer records, etc. can influence the de-identification process. An adversary may link these two unrelated datasets and re-identify patients. Sponsors and CROs are often aware of sensitive variables in the dataset and may have taken the necessary steps to de-identify these variables. But an adversary may link an adequate number of least sensitive variables to re-identify patients.
Lastly, personal knowledge or press accounts of an event (e.g., suicide attempt by a celebrity) may enable an adversary to search for that high-profile subject in a research dataset. Although press accounts or personal knowledge cannot be controlled, sponsors and CROs can make patient re-identification difficult by de-identifying data at appropriate statistical risk thresholds.
Determining Risk Threshold
Statistical computation of risk thresholds can help determine low, medium, or high levels of risks. In 2016, the Information and Privacy Commissioner of Ontario (IPCO) issued de-identification guidelines for structured data that guide determining an acceptable re-identification risk threshold. If the re-identification risk of data is higher or the threshold for re-identification risk was liberal then a greater amount of de-identification will become necessary to protect patient identity. The risk threshold is defined as “the minimum amount of de-identification that must be applied to a data set in order for it to be considered de-identified”. Risk thresholds may vary by the circumstances of data sharing. For instance, if data is shared in a controlled environment under a signed data sharing agreement, the risk threshold required on the data may be lower, as the investigator using it is legally agreeing not to re-identify individuals. Here we consider risk thresholds as suggested by regulatory authorities.
Based on the cell sizes, the IPCO has also provided guidelines for risk re-identification threshold.
Invasion of Privacy
Re-identification Risk Threshold
Cell Size Equivalent
Table 1. The IPCO guideline for re-identification risk threshold.
Various researchers have proposed threshold values ranging from 0.3 to 0.01. However, Health Canada currently recommends a threshold of 0.09 or a 9% risk of patient re-identification. A 9% risk means a target cell size is 11 patients. The EMA Policy 0070 External Guidance also recommends 0.09 as a threshold but its implementation is not yet widespread. These quantitative thresholds make sure that data utility is preserved. They also make the de-identification process less subjective and precise which allows artificial intelligence technology tools to do the task.
Ensuring risk thresholds are met can be a complicated task. Fortunately, since risk thresholds are expressed quantitatively, data science experts can help with the task. As part of the data utility considerations, we actively collaborate to ensure key variables for the disclosure are best maintained. Additionally, if the clinical trial is conducted within a small and sensitive population, it requires special consideration and our experts can help with the risk thresholds. Our team of specialists can help sponsors to de-identify data at risk thresholds determined by the applicable regulatory authority. Real Life Sciences has developed tools for de-identifying the data using Natural Language Processing technology. RLS Protect platform makes the process of de-identification at a specific threshold easier and more accurate. Real Life Sciences offers tailor-made solutions for data de-identification. Contact Real Life Sciences.
Risk Thresholds and Equivalence Class Sizes Based on Types of Disclosure
Anonymization Primer - Health Canada PRCI: Personally Identifiable Information
Health Canada has published guidance documents for submitting new drug submissions and device applications. In particular, these guidelines provide direction for sharing information gathered during various clinical trials in a safe manner. Using these guidelines, data sharing for secondary analyses is possible while protecting patient privacy and confidentiality.
To protect patient identity and also reduce the risk of re-identification, Health Canada now requires anonymization or de-identification of data. In its Anonymization Report Template, Health Canada asks the sponsors to classify the variables considered personal information into two main categories- directly identifying and indirectly identifying categories. Furthermore, the sponsors are required to state and justify the reasons for describing information as personal information.
Health Canada defines directly-identifying variables as identifying information that is replicable, distinguishable and knowable. Most common examples of directly-identifying variables are name, medical record number, telephone number, or other numbers such as social security number, and address.
Although Health Canada uses this three pronged test for directly-identifying variables, other regulatory publications such as the Health & Human Services (HHS) De-identification Guidelines have used it for both directly and indirectly-identifying variables.
Replicability of data can be an important factor. For instance, blood sugar levels fluctuate throughout the day, over months and years. It is less likely that a patient can be identified based on their blood sugar levels solely. However, anatomical anomalies identified in a CT scan or MRI are relatively stable over time unless the patient receives treatments. These findings can be replicated with certainty and increase the risk of patient re-identification.
Another aspect is distinguishability of the data. In clinical trials, indications are often the same for all patients. Hence, indication alone cannot distinguish between patients. But if indirectly-identifying variables such as Date of Birth, Gender, and 5-Digit ZIP Code, were known then the individual can be re-identified.
Lastly, knowability of the data depends on who the adversary is. For example, an adversary can know all of the demographic information of a subject and has access to data in public sources such as census data or claims, self-disclosures via social media. This increases the risk of re-identification. On the other hand, most lab reports are not disclosed in public domain and they run a lower risk of re-identification.
Indirectly-identifying variables are defined as identifying information that can identify an individual through a combination of indirect identifiers. When information such as medical history, height, weight, body mass index (BMI), race, or gender is available, it is possible to identify a patient in a clinical trial by someone who knows them.
Research sponsors and Clinical Research Organizations (CROs) may collect sensitive information such as sexual orientation, addictions, and illegal behaviors. These variables may also be considered as indirectly-identifying variables. Disclosure of sensitive information can be permitted only under special circumstances.
When indirectly-identifying variables are present in a structured database, their de-identification process is relatively easier as compared to unstructured text data. Electronic Health Records (EHR) are an example of commonly used unstructured texts for research purposes. Unstructured texts also include focus groups, interviews, or any other type of clinical narratives from other stakeholders such as family members, allied health professionals. Some sponsors may have a dedicated data science department to conduct the de-identification process. But often start-up companies or companies that do not have expert data teams may need to collaborate with external agencies
At Real Life Sciences, we help sponsors comply with these regulatory requirements using cutting-edge technology such as natural language processing (NLP). Our RLS Protect platform that was built using NLP technology facilitates the process of de-identifying variables. RLS Protect platform has successfully de-identified thousands of clinical trial datasets. For complicated clinical trial scenarios, our team of data experts collaborates with the sponsor team to expedite the de-identification process. In particular, RLS commonly uses the PHUSE CDISC SDTM data standards which provides a detailed and low-level view of what the PHUSE working group classifies as directly-identifying and indirectly-identifying variables. RLS data team welcomes your queries about de-identification and can provide solutions tailor-made for your datasets.