Anonymization Primer: Adversaries - The Risk of Re-Identification

Health Canada has updated several policies to encourage transparency and data sharing in clinical trials, which in turn, will encourage secondary analyses of data. Health Canada PRCI highly recommends quantitative de-identification instead of qualitative rule-based redactions of patient identifiers. While complying with quantitative de-identification requirements, sponsors also have to ensure that the risk of re-identification of patients is minimized and meets a statistical threshold. 

Adversaries pose a threat to patient privacy as they may attempt to know more about patients. This is referred to as “Identity Disclosure.” There are other types of disclosures such as “Attribute Disclosure” risks. A qualified research investigator may come to know the attributes of a study participant with a high degree of certainty. For instance, if all individuals born in a specific year were given a screening test, then a research investigator may be able to identify study participants who may have taken the test based on their birth year. However, since regulatory authorities primarily focus on identity disclosures and the risk of re-identification, we will take a deep dive into the topic. 

Adversary Definition

The term “adversary” is used to describe the person or persons trying to re-identify patients using their Personally Identifiable Information (PII) or Personal Health Information (PHI). ‘Intruders’ or ‘attackers’ are some other terms used to describe an adversary. Sometimes, these adversaries may not be external agents, a qualified investigator may also act as an adversary.    

Types of Adversaries

Adversaries can be of various types based on their interest in data or their role in interacting with data. Adversaries can come in the form of competitive organizations, researchers at other universities, marketers, or even hackers who want to attack secure databases as an intellectual challenge. Some of these adversaries can have malicious intent which may result in accidentally exposing data on the internet. 

Broadly speaking, most adversaries do not know who is in the dataset. However, some adversaries could be directly related to the dataset and may inadvertently, unintentionally re-identify patients. For instance, a researcher interested in a high-profile celebrity in the dataset. Although they may not intend harm, deliberate attempts in patient reidentification may happen. Adversaries may also include family members or employers who are too curious about patient/ disease status.     

Potential for re-identification

Previously, sharing individual-level participant data (IPD) was considered a voluntary disclosure. Health Canada now requires IPD as a part of mandatory disclosures. Due to reidentification, a patient may experience discrimination or stigmatization. Adversaries such as insurance companies or lawyers may use the information for financial purposes. The IPD may include direct or indirect identifiers that need to be de-identified. Depending on the variables seen, the risk of re-identification may be high. For instance, with access to direct identifiers such as name, phone number, an adversary can easily identify individuals in the dataset. 

Indirect or quasi-identifiers may also help an adversary to put together information and re-identify an individual. For instance, when multiple quasi-identifiers such as date of birth, sex, languages spoken, marital status, occupation are available, a participant can be reidentified. Hence, the decision about whether a variable is a direct identifier (or quasi identifier) is crucial. The most common method is to allow two or more experts to evaluate whether a variable is an identifier. Then a statistical test Cohen’s Kappa is used as a measure of agreement. If the value of the test is more than 0.8 then the experts have agreed that the variable is an identifier. For values lower than 0.8, additional experts are consulted to get a classification agreement. These direct or indirectly identifying variables need to be protected from adversaries. Quasi identifiers may also be more complex to anonymise in an unstructured/verbatim text context. The sponsor organizations need both a quantitative approach and a tool that can manage these complexities. 

RLS Protect performs quantitative de-identification that meets statistical thresholds. This reduces the risk of patient re-identification significantly. RLS Protect takes into account various patient identifying variables and then replaces them with transformed variables that achieve regulatory risk thresholds while optimizing data utility and collaboration between organizations to ensure appropriateness of the data. This process is facilitated by AI/machine learning algorithms, resulting in efficiency and accuracy. Each clinical trial dataset has its unique anonymization challenges. Real Life Sciences develops advanced algorithms needed for solving complicated problems. Our expert team can show you how to meet regulator risk thresholds reliably and efficiently.


crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram