Anonymization Techniques

Data sharing and secondary analyses of research data are becoming common practices. Disclosure of individual-level participant data has become the new normal. While sharing data from clinical trials, sponsors and clinical research organizations (CROs) use various techniques to protect the privacy and confidentiality of patients. These techniques are commonly referred to as anonymization techniques. These techniques have their pros and cons. Some techniques protect patient data but make secondary analyses difficult, while others make secondary analyses possible but carry a risk of patient re-identification. 

Here we discuss further details of four popular anonymization techniques.  

Anonymization techniques in clinical research

Earlier, the most common method for data anonymization was suppression. Suppression involves redacting or removing the data, so it is no longer readable. This is also known as “Masking” or “Redaction”. However, a major disadvantage of this procedure is that it decreases the data utility for secondary analyses. The research community needs alternative statistical approaches that help preserve data utility along with low re-identification risk.  

One of the popular methods is “generalization”. Generalization involves making patient identifiers less granular or more general. For instance, instead of specific patient ages, age ranges are shared. Or instead of specific geographical locations, broader areas such as state, country, or even continent are disclosed. This method is useful when data is relatively homogenous and there are no extreme outliers. Outliers may pose a risk of re-identification of patients.  

Another commonly practiced method is pseudonymization. In Pseudonymization new data elements are introduced instead of existing data. For clinical trials, it involves re-codifying a value such as a Patient ID into a different number. This method is typically used for directly-identifying variables such as patient ID, medical record numbers, or even phone numbers (e.g., patient ID 5280 will be shared as 2805). One major advantage of this method is that it keeps the link to entire study participant data intact. However, it still has a significant re-identification risk.    

A less popular technique adopted is Random Noise. When this approach is adopted, it involves adding or subtracting random values/amounts from numeric or data-oriented identifiers to make it difficult to determine original values. Data scientists may sometimes use specific methods such as additive or multiplicative noise while introducing noise. This procedure allows them to reduce the re-identification risk even when outliers or influential observations may be present in the dataset. Since this method requires data science expertise, it may not be easy to find such experts in a niche domain.  

Regulatory requirements

Apart from a variety of approaches, sponsors and CROs need to be mindful of regulatory requirements and industry standards. There are suggested data standards such as Clinical Data Interchange Standards Consortium (CDISC), Study Data Tabulation Model (SDTM), Analysis Data Model (ADaM), and anonymization guidance for data transformation from industry consortia/working groups (e.g. PhUSE). After thoughtful deliberation, various authorities selected these standards because they provide a reasonable tradeoff in terms of removing risk, preserving data utility, and are reasonable to implement with current technologies and tools. The choice of transformations for a variable depends on:

  • Type of variable: Whether the identifier is a Direct Identifier (DI), Quasi-Identifier (QI), or Sensitive Information
  • Sensitive information: Whether the identifier is Relational or Transactional Data
  • Type of data: The data type such as a Categorical, Ordinal Data, or Numeric/Interval
  • Re-identification risk: Data utility considerations such as the possibility of patient linkage
  • Risk assumptions 

Sponsors/CROs make anonymization decisions after careful consideration to ensure that it meets regulatory requirements. For example, Since 2020, Health Canada has discouraged redactions to promote data sharing. Along with clinical scientists, data scientists weigh-in to pick the right anonymization technique. If data expertise is not available in-house then often CROs choose to collaborate with an external data agency such as Real Life Sciences. At RLS, our team of experts treats each clinical trial dataset as a unique challenge and provides customized anonymization solutions. We keep up with the current regulatory requirements and suggest solutions accordingly. These solutions have helped our clients to get quick approval from the regulatory authorities. Connect with us to further discuss clinical trials and the solutions we offer.   


crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram