Health Canada has published guidance documents for submitting new drug submissions and device applications. In particular, these guidelines provide direction for sharing information gathered during various clinical trials in a safe manner. Using these guidelines, data sharing for secondary analyses is possible while protecting patient privacy and confidentiality.
To protect patient identity and also reduce the risk of re-identification, Health Canada now requires anonymization or de-identification of data. In its Anonymization Report Template, Health Canada asks the sponsors to classify the variables considered personal information into two main categories- directly identifying and indirectly identifying categories. Furthermore, the sponsors are required to state and justify the reasons for describing information as personal information.
Health Canada defines directly-identifying variables as identifying information that is replicable, distinguishable and knowable. Most common examples of directly-identifying variables are name, medical record number, telephone number, or other numbers such as social security number, and address.
Although Health Canada uses this three pronged test for directly-identifying variables, other regulatory publications such as the Health & Human Services (HHS) De-identification Guidelines have used it for both directly and indirectly-identifying variables.
Replicability of data can be an important factor. For instance, blood sugar levels fluctuate throughout the day, over months and years. It is less likely that a patient can be identified based on their blood sugar levels solely. However, anatomical anomalies identified in a CT scan or MRI are relatively stable over time unless the patient receives treatments. These findings can be replicated with certainty and increase the risk of patient re-identification.
Another aspect is distinguishability of the data. In clinical trials, indications are often the same for all patients. Hence, indication alone cannot distinguish between patients. But if indirectly-identifying variables such as Date of Birth, Gender, and 5-Digit ZIP Code, were known then the individual can be re-identified.
Lastly, knowability of the data depends on who the adversary is. For example, an adversary can know all of the demographic information of a subject and has access to data in public sources such as census data or claims, self-disclosures via social media. This increases the risk of re-identification. On the other hand, most lab reports are not disclosed in public domain and they run a lower risk of re-identification.
Indirectly-identifying variables
Indirectly-identifying variables are defined as identifying information that can identify an individual through a combination of indirect identifiers. When information such as medical history, height, weight, body mass index (BMI), race, or gender is available, it is possible to identify a patient in a clinical trial by someone who knows them.
Research sponsors and Clinical Research Organizations (CROs) may collect sensitive information such as sexual orientation, addictions, and illegal behaviors. These variables may also be considered as indirectly-identifying variables. Disclosure of sensitive information can be permitted only under special circumstances.
When indirectly-identifying variables are present in a structured database, their de-identification process is relatively easier as compared to unstructured text data. Electronic Health Records (EHR) are an example of commonly used unstructured texts for research purposes. Unstructured texts also include focus groups, interviews, or any other type of clinical narratives from other stakeholders such as family members, allied health professionals. Some sponsors may have a dedicated data science department to conduct the de-identification process. But often start-up companies or companies that do not have expert data teams may need to collaborate with external agencies
At Real Life Sciences, we help sponsors comply with these regulatory requirements using cutting-edge technology such as natural language processing (NLP). Our RLS Protect platform that was built using NLP technology facilitates the process of de-identifying variables. RLS Protect platform has successfully de-identified thousands of clinical trial datasets. For complicated clinical trial scenarios, our team of data experts collaborates with the sponsor team to expedite the de-identification process. In particular, RLS commonly uses the PHUSE CDISC SDTM data standards which provides a detailed and low-level view of what the PHUSE working group classifies as directly-identifying and indirectly-identifying variables. RLS data team welcomes your queries about de-identification and can provide solutions tailor-made for your datasets.