Clinical Trial Transparency: The Relationship Between Data Utility & The Risk of Re-Identification

Anonymizing patient information is crucial for privacy protection. Due to data sharing initiatives across research institutes and research projects, anonymization has become an important step in clinical research. When research studies that require expensive diagnostic resources such as fMRI, genetic testing, research organizations come together to share data for secondary analysis. Regulatory bodies such as the US Food and Drug Administration, Health Canada and EMA encourage data sharing initiatives.  

Although there are several factors that contribute to decisions around what and how to anonymize patient information, regulatory requirements continue to be one of the most important factors. Depending on the regulatory requirements, disclosures and submissions have their own unique nuances. As such, the transformation approaches used in one submission may not be applicable to others, and a different approach may be needed.

Since 2019, Health Canada strongly encourages employing quantitative risk modeling methodologies. The European Medicine Agency too has suggested similar guidelines. These guidelines are reinforced by not only Policy 0070 but also with Clinical Trials Regulation (Regulation (EU) No 536/2014).  In particular, pseudonymization of identifiers as opposed to outright suppression or redaction to preserve data utility. When sample sizes are small i.e. less than twenty patients, quantitative risk assessments will often yield the decision to outright suppress many of the identifiers. But the quantitative modeling process can help by generating transformation options for each identifier and provide the supporting evidence and rationale for the anonymization approach taken.

A fundamental problem in privacy-preserving data disclosures is how to make the right tradeoff between protecting patient privacy and data utility. Broadly speaking, patient privacy and data utility may have an inverse relationship. Researchers have observed that even for modest privacy gains complete destruction of the data utility may be needed. Another possible unintended consequence is that excessive protection of privacy, using techniques such as suppression (or removing data), can give misleading results and that could pose a public health risk.

These challenges can be tackled with the help of machine learning technology. Let’s elaborate using Figure 1.      

jPSi2Kh3DgnAJKohg wc2TF7hg 8PWXsmCTOGsgSxdTMAFUe1j8MzWHNJ1DM 00hqoGVr1Y6rDP52hOwb9GcPqLctMROi3LbRtEf3X

                       Figure 1. Finding an Acceptable trade-off

In Figure 1, data utility is represented on the x-axis and privacy protection is represented on the y-axis. For researchers, preserving data utility while maintaining patient privacy is the ideal situation. But often the ‘ideal situation’ of maximum patient privacy protection and maximum data utility may be impossible to achieve as it is ill-defined. Researchers try to focus on finding an acceptable ‘trade-off’ or the sweet spot where adequate data is retained without compromising patient privacy.  

With the help of statistical modeling, instead of an ideal case scenario, an acceptable trade-off may be computed. This process can sometimes become iterative depending on how much data is lost in each round of quantitative modeling. Therefore, anonymization approaches must counterbalance the level of anonymization with the level of information loss. Typically, this will entail anonymizing certain identifiers with greater levels of anonymization than others, i.e. a tradeoff is made between levels of anonymization across identifiers to maximize the data utility and minimize the loss of information.

At Real Life Sciences (RLS), we have developed the industry leading purpose-built anonymization platform RLS Protect.  RLS team of data anonymization experts collaborates with  clients to determine the balance between the risk of re-identification and data utility. We welcome any and all queries related to your clinical trial transparency efforts

Screenshot 286

Anonymization Techniques

Data sharing and secondary analyses of research data are becoming common practices. Disclosure of individual-level participant data has become the new normal. While sharing data from clinical trials, sponsors and clinical research organizations (CROs) use various techniques to protect the privacy and confidentiality of patients. These techniques are commonly referred to as anonymization techniques. These techniques have their pros and cons. Some techniques protect patient data but make secondary analyses difficult, while others make secondary analyses possible but carry a risk of patient re-identification. 

Here we discuss further details of four popular anonymization techniques.  

Anonymization techniques in clinical research

Earlier, the most common method for data anonymization was suppression. Suppression involves redacting or removing the data, so it is no longer readable. This is also known as “Masking” or “Redaction”. However, a major disadvantage of this procedure is that it decreases the data utility for secondary analyses. The research community needs alternative statistical approaches that help preserve data utility along with low re-identification risk.  

One of the popular methods is “generalization”. Generalization involves making patient identifiers less granular or more general. For instance, instead of specific patient ages, age ranges are shared. Or instead of specific geographical locations, broader areas such as state, country, or even continent are disclosed. This method is useful when data is relatively homogenous and there are no extreme outliers. Outliers may pose a risk of re-identification of patients.  

Another commonly practiced method is pseudonymization. In Pseudonymization new data elements are introduced instead of existing data. For clinical trials, it involves re-codifying a value such as a Patient ID into a different number. This method is typically used for directly-identifying variables such as patient ID, medical record numbers, or even phone numbers (e.g., patient ID 5280 will be shared as 2805). One major advantage of this method is that it keeps the link to entire study participant data intact. However, it still has a significant re-identification risk.    

A less popular technique adopted is Random Noise. When this approach is adopted, it involves adding or subtracting random values/amounts from numeric or data-oriented identifiers to make it difficult to determine original values. Data scientists may sometimes use specific methods such as additive or multiplicative noise while introducing noise. This procedure allows them to reduce the re-identification risk even when outliers or influential observations may be present in the dataset. Since this method requires data science expertise, it may not be easy to find such experts in a niche domain.  

Regulatory requirements

Apart from a variety of approaches, sponsors and CROs need to be mindful of regulatory requirements and industry standards. There are suggested data standards such as Clinical Data Interchange Standards Consortium (CDISC), Study Data Tabulation Model (SDTM), Analysis Data Model (ADaM), and anonymization guidance for data transformation from industry consortia/working groups (e.g. PhUSE). After thoughtful deliberation, various authorities selected these standards because they provide a reasonable tradeoff in terms of removing risk, preserving data utility, and are reasonable to implement with current technologies and tools. The choice of transformations for a variable depends on:

Sponsors/CROs make anonymization decisions after careful consideration to ensure that it meets regulatory requirements. For example, Since 2020, Health Canada has discouraged redactions to promote data sharing. Along with clinical scientists, data scientists weigh-in to pick the right anonymization technique. If data expertise is not available in-house then often CROs choose to collaborate with an external data agency such as Real Life Sciences. At RLS, our team of experts treats each clinical trial dataset as a unique challenge and provides customized anonymization solutions. We keep up with the current regulatory requirements and suggest solutions accordingly. These solutions have helped our clients to get quick approval from the regulatory authorities. Connect with us to further discuss clinical trials and the solutions we offer.   

Maximizing Data Utility In Clinical Transparency: Outlier Patients

Regulatory authorities such as Health Canada now encourage data sharing for secondary analyses and derivative research. Data sharing with other institutions needs to be done in a safe manner while protecting patient privacy and confidentiality. One method commonly adopted in the recent past was redacting the personally identifiable information. But this approach reduces the data utility. Health Canada now encourages quantitative risk assessment while anonymizing data for safe data sharing.

A common practice in data anonymization is to transform personal information, such as an actual age to an age range, in which the actual age will fall. For instance, prior to data anonymization, in a dataset under the age variable a patient's actual age such as 42 years is included. During the data anonymization process, 42 years is replaced by a range of 40-45 years. The age range is defined as part of the risk assessment process. Since there will be more such individuals in the dataset, the risk of re-identification is quite low. 

This common practice of data anonymization is not well-suited for data points that are extreme. These extreme observations are termed as ‘outliers’ and need to be dealt with before the dataset is analyzed. The central tendency of a dataset is indicated by the average or the mean. For instance, the average age of patients in a dataset is 49 years. However, this dataset may include individuals who are 98 years of age or 18 years of age. These extreme values are the outliers as the data point is far away from the mean. 

Outliers pose several problems for data analysis. Outliers skew the averages for the group, which may not be a true representation of the patient group or the control group. Also, these extreme values may lead to invalid results when various statistical tests are conducted to analyze the dataset. Sometimes outliers could simply be due to errors in data entry. Hence, sponsors/CROs check data periodically to ensure correct data is entered or if possible, a measure may be repeated to get the normal value.  

The bigger problem regarding outliers from a patient privacy perspective is the risk of re-identification. These outliers are (a single or) a small set of patients with extreme attributes, such as a single elderly patient aged 98 years. In these cases, often the data utility will suffer, because the patient will not fit into any equivalence class with others. Providing an age range for data anonymization would not be a useful solution. An adversary may attempt to re-identify an outlier when a range is provided. HIPAA regulation encourages addressing outliers for eighteen direct identifiers such as age, phone numbers, race etc at a ‘safe harbor’ anonymization level. Safe harbor means the data for these eighteen identifiers is removed from the dataset.    

This strategy often leads to suppression of that attribute for all patients, greatly degrading the data utility. One solution to this issue is to allow the system to treat some patients as outliers and suppress their entire record (all attributes). For instance, a sponsor would suppress all attributes of the single elderly patient. While one might lose a little data utility for that single patient, it might lead to a greater increase in data utility by allowing us to retain age for the rest of the patients.

Using this approach for increasing data utility across smaller populations which are often more sensitive to single outliers is beneficial for data sharing.

Data related decisions can be difficult as they may have unintended consequences down the road. Sponsors/CROs often engage expert services for consulting and addressing queries from regulatory bodies. Real Life Sciences (RLS) provides expert advice on challenges such as outliers to meet the requirements of regulatory authorities. RLS data scientists help in identifying outliers and then suppressing their data as needed.  

Maximizing Data Utility In Clinical Disclosure: Reference Populations

Choosing the right reference population maximizes data utility. While choosing the right reference population, the right approaches and methods further facilitate data sharing and secondary analyses. Without a reference population, the risk of re-identification of patients is real. 

When anonymization of patients is done, researchers focus on blending in any unique patients remaining in a given dataset or the removal of uniqueness. When other similar people with the same characteristics exist in the dataset then an adversary might not be able to definitively re-identify a patient. A widely accepted numerical threshold for similar individuals is 11. (Here similar individuals are in regards to their indirect identifiers). Different ways to select reference populations are worth further discussion. 

The risk of re-identification

Reference population means “the group of individuals used to determine the risk of re-identification”. The reference population of a dataset is the set of persons who are considered to have ‘similar’ characteristics to those being modeled for risk. Here ‘similar’ refers to suffering from the same disease(s) of the trial in question, demographics, a period in time, locations, as well as having participated in a clinical trial for the indication/treatment in question. 

Sponsors and Clinical Research Organizations (CROs) need to meet the regulatory requirement of 0.09 risk threshold which demonstrates that there are at least 11 individuals with each set of identifying attributes. Or the cell size equivalent is 10 individuals. There are multiple ways to demonstrate that there are at least 11 individuals with a set of characteristics but using the K-anonymity model (k=11 threshold) is one of the common ways. 

In 2019, Health Canada has issued further guidelines about reference population selection. Its guidance document states that:

“The selection of the appropriate reference population determines the total patient group size and the amount of anonymization (i.e. data transformation) that is necessary to reduce the risk of patient re-identification. The reference population can be informed from patients in the single trial in question (smallest population), all patients in similar trials by a specific study sponsor, all patients in similar trials (e.g., by disease or therapeutic intervention category), or all patients in a geographic area (largest population).

When the appropriate reference population is one other than the single trial in question, an extrapolation of the trial population can be applied to achieve an estimate of the population size. In keeping with the first and second guiding principle, risk of re-identification should be informed not only by the number of individuals in a single study, but also by the number that reflects real-world risk.”

Determining a reference population

Since each dataset is unique, a blanket method cannot be applied to selecting the reference population. The selection is done on a case-by-case basis. CROs typically conduct trials with small sample sizes and focus on indications with limited established and pre-existing research. These factors drive the decision-making when selecting reference populations and must be evaluated prior to making the selection.

There are four possibilities for selecting a reference population. The most conservative one is where each individual study serves as its own population. Another method is to combine all the participants. In this method, all the studies in the submission together serve as a “pooled” population. The third method goes beyond the submission and it includes all the studies in the submission plus other recent similar studies for the same indication/treatment. Lastly, the least conservative selection would be having larger geographic / prevalence estimates. 

For rare and ultra-rare diseases, a more conservative approach is usually adopted while selecting the reference population. When each individual study serves as its own population, researchers do not assume a wider disease population. This allows them to focus the risk measurement only on the population of the study. Conversely, if there are additional studies with sufficient patient counts, outside of the study population, then geographical estimates are used. But often in the case of ultra-rare diseases that are yet to be extensively researched, such populations are not found. Please see our case study on working with rare disease populations.

As you can notice, selecting a reference population requires a nuanced understanding of the dataset at hand. If the reference population is selected using too conservative criteria then the risk of re-identification is lower and less conservative criteria may increase the risk of re-identification. At Real Life Sciences, data science experts can provide insights to make the decision-making process easier. Based on the input from sponsors, we provide recommendations for selecting the reference population. Furthermore, these experts can later anonymize sensitive variables in the dataset per regulatory requirements. We welcome further discussions about clinical trial datasets for rare or common diseases.  

Anonymization Primer: Risk Thresholds for Patient Re-identification

As data sharing becomes the new norm, protecting patient privacy and confidentiality has new challenges. Earlier in clinical trial datasets, redactions or blocking out sensitive information was a common practice. However, it made secondary analysis almost impossible. Regulatory authorities such as Health Canada have issued new guidelines for de-identifying data. These guidelines ensure patient privacy is protected and yet sufficient data is shared for derivative research projects.  

While de-identifying data, sponsors and clinical research organizations are expected to comply with re-identification risk thresholds. Here we discuss how sources of the risk of re-identification influence quantitative thresholds of de-identification.  

Sources of the risk of re-identification

Patient re-identification risk can increase due to various sources. One of the common sources that increase the risk is cell size. Cell size refers to “the number of patients with the same indirectly-identifying variable values”. Fewer observations (frequencies less than 5 or 6) are often considered as a high risk for re-identification of patients and hence is unacceptable. In smaller datasets, these risks are evident. Some research centers may even require suppression of these variables. In a large dataset where frequencies are higher, there still may be a potential risk of re-identification. The risk of re-identification increases with the uniqueness of the record.

Another factor that creates risk is the visibility of the dataset. If a research dataset is public, then an adversary may launch a ‘demonstration attack’ to gain more publicity. This requires the maximum possible deidentification. For non-public datasets, how partial health information can be linked with public datasets such as credit data, consumer records, etc. can influence the de-identification process. An adversary may link these two unrelated datasets and re-identify patients. Sponsors and CROs are often aware of sensitive variables in the dataset and may have taken the necessary steps to de-identify these variables. But an adversary may link an adequate number of least sensitive variables to re-identify patients.  

Lastly, personal knowledge or press accounts of an event (e.g., suicide attempt by a celebrity) may enable an adversary to search for that high-profile subject in a research dataset. Although press accounts or personal knowledge cannot be controlled, sponsors and CROs can make patient re-identification difficult by de-identifying data at appropriate statistical risk thresholds.   

Determining Risk Threshold

Statistical computation of risk thresholds can help determine low, medium, or high levels of risks. In 2016, the Information and Privacy Commissioner of Ontario (IPCO) issued de-identification guidelines for structured data that guide determining an acceptable re-identification risk threshold. If the re-identification risk of data is higher or the threshold for re-identification risk was liberal then a greater amount of de-identification will become necessary to protect patient identity. The risk threshold is defined as “the minimum amount of de-identification that must be applied to a data set in order for it to be considered de-identified”. Risk thresholds may vary by the circumstances of data sharing. For instance, if data is shared in a controlled environment under a signed data sharing agreement, the risk threshold required on the data may be lower, as the investigator using it is legally agreeing not to re-identify individuals. Here we consider risk thresholds as suggested by regulatory authorities.

Based on the cell sizes, the IPCO has also provided guidelines for risk re-identification threshold.   

Invasion of PrivacyRe-identification Risk ThresholdCell Size Equivalent
Low0.110
Medium0.07515 
High 0.0520

Table 1. The IPCO guideline for re-identification risk threshold. 

Various researchers have proposed threshold values ranging from 0.3 to 0.01. However, Health Canada currently recommends a threshold of 0.09 or a 9% risk of patient re-identification. A 9% risk means a target cell size is 11 patients.  The EMA Policy 0070 External Guidance also recommends 0.09 as a threshold but its implementation is not yet widespread. These quantitative thresholds make sure that data utility is preserved. They also make the de-identification process less subjective and precise which allows artificial intelligence technology tools to do the task.      

Ensuring risk thresholds are met can be a complicated task. Fortunately, since risk thresholds are expressed quantitatively, data science experts can help with the task. As part of the data utility considerations, we actively collaborate to ensure key variables for the disclosure are best maintained. Additionally, if the clinical trial is conducted within a small and sensitive population, it requires special consideration and our experts can help with the risk thresholds. Our team of specialists can help sponsors to de-identify data at risk thresholds determined by the applicable regulatory authority. Real Life Sciences has developed tools for de-identifying the data using Natural Language Processing technology. RLS Protect platform makes the process of de-identification at a specific threshold easier and more accurate. Real Life Sciences offers tailor-made solutions for data de-identification.  Contact Real Life Sciences.

Risk Thresholds and Equivalence Class Sizes Based on Types of Disclosure

Anonymization Primer - Health Canada PRCI: Personally Identifiable Information

Health Canada has published guidance documents for submitting new drug submissions and device applications. In particular, these guidelines provide direction for sharing information gathered during various clinical trials in a safe manner. Using these guidelines, data sharing for secondary analyses is possible while protecting patient privacy and confidentiality.

To protect patient identity and also reduce the risk of re-identification, Health Canada now requires anonymization or de-identification of data. In its Anonymization Report Template, Health Canada asks the sponsors to classify the variables considered personal information into two main categories- directly identifying and indirectly identifying categories. Furthermore, the sponsors are required to state and justify the reasons for describing information as personal information. 

Directly-identifying variables.   

Health Canada defines directly-identifying variables as identifying information that is replicable, distinguishable and knowable. Most common examples of directly-identifying variables are name, medical record number, telephone number, or other numbers such as social security number, and address.    

Although Health Canada uses this three pronged test for directly-identifying variables, other regulatory publications such as the Health & Human Services (HHS) De-identification Guidelines have used it for both directly and indirectly-identifying variables.  

Replicability of data can be an important factor. For instance, blood sugar levels fluctuate throughout the day, over months and years. It is less likely that a patient can be identified based on their blood sugar levels solely. However, anatomical anomalies identified in a CT scan or MRI are relatively stable over time unless the patient receives treatments. These findings can be replicated with certainty and increase the risk of patient re-identification.    

Another aspect is distinguishability of the data. In clinical trials, indications are often the same for all patients. Hence, indication alone cannot distinguish between patients. But if indirectly-identifying variables such as Date of Birth, Gender, and 5-Digit ZIP Code, were known then the individual can be re-identified.

Lastly, knowability of the data depends on who the adversary is. For example, an adversary can know all of the demographic information of a subject and has access to data in public sources such as census data or claims, self-disclosures via social media. This increases the risk of re-identification. On the other hand, most lab reports are not disclosed in public domain and they run a lower risk of re-identification.  

Indirectly-identifying variables

Indirectly-identifying variables are defined as identifying information that can identify an individual through a combination of indirect identifiers. When information such as medical history, height, weight, body mass index (BMI), race, or gender is available, it is possible to identify a patient in a clinical trial by someone who knows them.

Research sponsors and Clinical Research Organizations (CROs) may collect sensitive information such as sexual orientation, addictions, and illegal behaviors. These variables may also be considered as indirectly-identifying variables. Disclosure of sensitive information can be permitted only under special circumstances.  

When indirectly-identifying variables are present in a structured database, their de-identification process is relatively easier as compared to unstructured text data. Electronic Health Records (EHR) are an example of commonly used unstructured texts for research purposes. Unstructured texts also include focus groups, interviews, or any other type of clinical narratives from other stakeholders such as family members, allied health professionals. Some sponsors may have a dedicated data science department to conduct the de-identification process. But often start-up companies or companies that do not have expert data teams may need to collaborate with external agencies      

At Real Life Sciences, we help sponsors comply with these regulatory requirements using cutting-edge technology such as natural language processing (NLP). Our RLS Protect platform that was built using NLP technology facilitates the process of de-identifying variables. RLS Protect platform has successfully de-identified thousands of clinical trial datasets. For complicated clinical trial scenarios, our team of data experts collaborates with the sponsor team to expedite the de-identification process. In particular, RLS commonly uses the PHUSE CDISC SDTM data standards which provides a detailed and low-level view of what the PHUSE working group classifies as directly-identifying and indirectly-identifying variables. RLS data team welcomes your queries about de-identification and can provide solutions tailor-made for your datasets.  

Coming Soon On Demand - Disclosure Best Practices: Commercial Confidential/Business Information (CCI/CBI)


A significant portion of initial submissions result in rejection by regulators - CBI redaction content for initial submission packages. CCI/CBI continues to be a critical component to disclosure submissions with regulators, yet, the tendency to over redact remains. This webinar will review internal business considerations, the need for strong collaboration between Disclosure, Legal and Clinical teams and balancing the needs of regulators.

On-Demand Webinar - Trial Disclosure: A Focus On Rare Diseases


Working with rare disease populations requires compliance strategies be thought through carefully in advance. Small populations, like those found in rare and ultra-rare disease populations can increase the likelihood of patient re-identification if advanced methods are not applied. This webinar session will focus on learnings and best practices to apply when working with rare disease trials and populations.

View On-Demand Here: Trial Disclosure: A Focus On Rare Diseases

Anonymization Primer: Adversaries - The Risk of Re-Identification

Health Canada has updated several policies to encourage transparency and data sharing in clinical trials, which in turn, will encourage secondary analyses of data. Health Canada PRCI highly recommends quantitative de-identification instead of qualitative rule-based redactions of patient identifiers. While complying with quantitative de-identification requirements, sponsors also have to ensure that the risk of re-identification of patients is minimized and meets a statistical threshold. 

Adversaries pose a threat to patient privacy as they may attempt to know more about patients. This is referred to as “Identity Disclosure.” There are other types of disclosures such as “Attribute Disclosure” risks. A qualified research investigator may come to know the attributes of a study participant with a high degree of certainty. For instance, if all individuals born in a specific year were given a screening test, then a research investigator may be able to identify study participants who may have taken the test based on their birth year. However, since regulatory authorities primarily focus on identity disclosures and the risk of re-identification, we will take a deep dive into the topic. 

Adversary Definition

The term “adversary” is used to describe the person or persons trying to re-identify patients using their Personally Identifiable Information (PII) or Personal Health Information (PHI). ‘Intruders’ or ‘attackers’ are some other terms used to describe an adversary. Sometimes, these adversaries may not be external agents, a qualified investigator may also act as an adversary.    

Types of Adversaries

Adversaries can be of various types based on their interest in data or their role in interacting with data. Adversaries can come in the form of competitive organizations, researchers at other universities, marketers, or even hackers who want to attack secure databases as an intellectual challenge. Some of these adversaries can have malicious intent which may result in accidentally exposing data on the internet. 

Broadly speaking, most adversaries do not know who is in the dataset. However, some adversaries could be directly related to the dataset and may inadvertently, unintentionally re-identify patients. For instance, a researcher interested in a high-profile celebrity in the dataset. Although they may not intend harm, deliberate attempts in patient reidentification may happen. Adversaries may also include family members or employers who are too curious about patient/ disease status.     

Potential for re-identification

Previously, sharing individual-level participant data (IPD) was considered a voluntary disclosure. Health Canada now requires IPD as a part of mandatory disclosures. Due to reidentification, a patient may experience discrimination or stigmatization. Adversaries such as insurance companies or lawyers may use the information for financial purposes. The IPD may include direct or indirect identifiers that need to be de-identified. Depending on the variables seen, the risk of re-identification may be high. For instance, with access to direct identifiers such as name, phone number, an adversary can easily identify individuals in the dataset. 

Indirect or quasi-identifiers may also help an adversary to put together information and re-identify an individual. For instance, when multiple quasi-identifiers such as date of birth, sex, languages spoken, marital status, occupation are available, a participant can be reidentified. Hence, the decision about whether a variable is a direct identifier (or quasi identifier) is crucial. The most common method is to allow two or more experts to evaluate whether a variable is an identifier. Then a statistical test Cohen’s Kappa is used as a measure of agreement. If the value of the test is more than 0.8 then the experts have agreed that the variable is an identifier. For values lower than 0.8, additional experts are consulted to get a classification agreement. These direct or indirectly identifying variables need to be protected from adversaries. Quasi identifiers may also be more complex to anonymise in an unstructured/verbatim text context. The sponsor organizations need both a quantitative approach and a tool that can manage these complexities. 

RLS Protect performs quantitative de-identification that meets statistical thresholds. This reduces the risk of patient re-identification significantly. RLS Protect takes into account various patient identifying variables and then replaces them with transformed variables that achieve regulatory risk thresholds while optimizing data utility and collaboration between organizations to ensure appropriateness of the data. This process is facilitated by AI/machine learning algorithms, resulting in efficiency and accuracy. Each clinical trial dataset has its unique anonymization challenges. Real Life Sciences develops advanced algorithms needed for solving complicated problems. Our expert team can show you how to meet regulator risk thresholds reliably and efficiently.

Anonymization Primer: Participant Privacy Risks

Medical researchers from pharmaceutical industries and academia are increasingly engaging in the secondary analysis of unstructured data or unstructured clinical documents obtained from clinical trials. However, there are practical challenges in data-sharing for re-analysis or secondary analysis. One of such challenges is the lack of transparency while sharing data at the patient level.    

The lack of transparency while sharing patient data was for good reasons. Since health information is often associated with our most personal aspects (e.g. ability to work, dietary habits, sexual orientation, stigmatizing medical history etc), redacting certain variables allows patients to maintain their privacy. It aligns with principles such as respect for persons, justice, and non-maleficence. Additionally, it aligns with applicable legal requirements such as General Data Protection Regulation (GDPR).       

As the regulatory authorities move away from qualitative redactions and require de-identification, it is worthwhile to discuss privacy, consequences of privacy breach, and the quantitative de-identification process.

Defining Privacy

Although there is no consensus about a specific definition of ‘privacy’, researchers view privacy as ‘the ability to control the collection, use, and disclosure of one’s personal information.’ Another definition of privacy states that privacy means ‘whether others can access one’s information, regardless of whether it is the individual who is in control of her information’. But in the context of this article, we use privacy in the context of whether others know information about a person and can draw various inferences from it. 

A survey study revealed that when participants were asked if they were willing to have their records used for research, without their knowledge or permission, a majority clearly stated ‘no’. But when researchers mentioned the database would be anonymous for research or that access to the data would be under their control, a majority of patients now thought data-sharing was a good idea that helps advance science. Typically, the sponsor organizations engage in practices that respect the sentiments of the patients. For instance, the use of controlled platforms, data sharing agreements and third-party ethical oversight can reduce risks to individuals.  

Consequences of privacy breach

Privacy breaches can have drastic negative effects and may harm patients if the information is used by individuals with malicious intent. Health information can be misused and that may affect a person’s ability to get a specific job or maintain their current job. Their ability to get insurance may be affected. But worst of all, they may experience social stigma if information related to their gender, race, ethnicity, or disability status is revealed to the general public. In extreme scenarios, patients’ ability to maintain autonomy over their lives may be affected. Researchers need to de-identify variables to prevent the harm it may lead to the patients. For sponsor organizations, privacy breaches may mean heavy legal penalties.        

At present, Health Canada strongly recommends quantitative de-identification instead of qualitative rule based redactions. The regulations focus our attention on what is called the “Risk of Re-identification” or “ROR”– namely, that there can be negative repercussions if an adversary or intruder can determine, with absolute certainty or very confident, the identity of a patient in a clinical trial. ‘Adversaries' is a term used to describe people or entities that might try to identify research participants. 

Patient Identifiers

The Health Insurance Portability and Accountability Act (HIPAA) privacy rules list eighteen identifiers that require de-identification. Most other regulatory authorities have similar policies. These direct identifiers can be summarized as:    

  1. Name
  2. Location. Usually, the first three digits of a Zip Code can be shared in certain cases or the names of states can be shared.
  3. Dates. Age brackets or year of an event can be shared but sharing specific dates such as birth date, date of death creates a risk of re-identification.
  4. Contact information such as Telephone number, Fax number, Emails
  5. Identifying numbers such as Social Security numbers, Medical record numbers, Health plan beneficiary numbers, Account numbers, Certificate/license numbers, Vehicle identifiers, and serial numbers, including license plate numbers, Device identifiers, and serial numbers. 
  6. Web identifiers such as Universal Resource Locators (URLs), Internet Protocol (IP) addresses.
  7. Biometric identifiers e.g., finger and voice prints, full-face photographs, and any comparable images

Any other unique identifying number or patient characteristic is also covered under privacy acts by various regulatory authorities, including Health Canada. To de-identify personal information, data science experts can help in finding personal identifiers in structured or unstructured datasets. 

While de-identifying these variables, sponsors need to meet statistical thresholds to mitigate the risk of re-identification. Researchers in the pharma industry need technology solutions to meet these regulatory requirements within a short timeframe.   

Real Life Sciences has launched the RLS Protect platform that helps sponsors in de-identifying data and meeting statistical thresholds defined by Health Canada. The platform enables sponsors to share high-quality data that upholds principles of openness, transparency, and adds data utility, yet protects patients from the risk of re-identification. Since each clinical trial has its unique sets of variables and study designs, our team of expert data scientists works closely with sponsor teams to ensure optimal results. To learn more about RLS Protect and our data science services, contact us here.