Regulatory authorities such as Health Canada now encourage data sharing for secondary analyses and derivative research. Data sharing with other institutions needs to be done in a safe manner while protecting patient privacy and confidentiality. One method commonly adopted in the recent past was redacting the personally identifiable information. But this approach reduces the data utility. Health Canada now encourages quantitative risk assessment while anonymizing data for safe data sharing.
A common practice in data anonymization is to transform personal information, such as an actual age to an age range, in which the actual age will fall. For instance, prior to data anonymization, in a dataset under the age variable a patient's actual age such as 42 years is included. During the data anonymization process, 42 years is replaced by a range of 40-45 years. The age range is defined as part of the risk assessment process. Since there will be more such individuals in the dataset, the risk of re-identification is quite low.
This common practice of data anonymization is not well-suited for data points that are extreme. These extreme observations are termed as ‘outliers’ and need to be dealt with before the dataset is analyzed. The central tendency of a dataset is indicated by the average or the mean. For instance, the average age of patients in a dataset is 49 years. However, this dataset may include individuals who are 98 years of age or 18 years of age. These extreme values are the outliers as the data point is far away from the mean.
Outliers pose several problems for data analysis. Outliers skew the averages for the group, which may not be a true representation of the patient group or the control group. Also, these extreme values may lead to invalid results when various statistical tests are conducted to analyze the dataset. Sometimes outliers could simply be due to errors in data entry. Hence, sponsors/CROs check data periodically to ensure correct data is entered or if possible, a measure may be repeated to get the normal value.
The bigger problem regarding outliers from a patient privacy perspective is the risk of re-identification. These outliers are (a single or) a small set of patients with extreme attributes, such as a single elderly patient aged 98 years. In these cases, often the data utility will suffer, because the patient will not fit into any equivalence class with others. Providing an age range for data anonymization would not be a useful solution. An adversary may attempt to re-identify an outlier when a range is provided. HIPAA regulation encourages addressing outliers for eighteen direct identifiers such as age, phone numbers, race etc at a ‘safe harbor’ anonymization level. Safe harbor means the data for these eighteen identifiers is removed from the dataset.
This strategy often leads to suppression of that attribute for all patients, greatly degrading the data utility. One solution to this issue is to allow the system to treat some patients as outliers and suppress their entire record (all attributes). For instance, a sponsor would suppress all attributes of the single elderly patient. While one might lose a little data utility for that single patient, it might lead to a greater increase in data utility by allowing us to retain age for the rest of the patients.
Using this approach for increasing data utility across smaller populations which are often more sensitive to single outliers is beneficial for data sharing.
Data related decisions can be difficult as they may have unintended consequences down the road. Sponsors/CROs often engage expert services for consulting and addressing queries from regulatory bodies. Real Life Sciences (RLS) provides expert advice on challenges such as outliers to meet the requirements of regulatory authorities. RLS data scientists help in identifying outliers and then suppressing their data as needed.