Natural Language Processing (NLP) algorithms enable us to use novel data sources such as electronic health records (EHR), media interview transcriptions, and social media data. For studies using social media, the data is spontaneous, heterogenous, and is less structured. This makes data curation a necessary step. Data curation is one of the main influencing factors that determines the quality of the data.
Data scientists have proposed a variety of strategies for data curation. Some consider data curation unnecessary while others think it is unavoidable. Data curation may be necessary for privacy concerns such as excluding personal identification, or to ensure the data used in the model cannot be ‘hacked’. But most importantly, data curation is necessary to identify good quality, robust, and inclusive data that can be converted into actionable insights.
Social media data enables us to understand narrators’ perspectives about specific diseases, treatment, and healthcare options available. Narrators may include stakeholders such as doctors, care team members, primary caregivers, and patients themselves. At Real Life Sciences (RLS), we use sophisticated NLP algorithms to convert unstructured social media data into structured data that can be analyzed to understand the patient journey further.
To get high-quality data, we employ NLP algorithms that will weed out irrelevant data and provide usable data. First, we use tools to aggregate several social media sources instead of one popular networking site. Popular social media websites such as Twitter or Facebook often represent a small portion of the population of a specific disease. Hence, we need a different approach for comprehensive data collection within a specific population. We include many patient forums and online communities in order to enrich our data. Our NLP algorithms cast a deep, comprehensive, and wide net to include data from disease specific forums, blogs, and verified patient communities along with social media sites.
After collecting lots of data, we establish processes to reduce the noise in the data or whittle it down to retain data that we care about. We start with a simple and popular ‘keyword-based approach’ (or bag of words) to identify relevant posts. Here keywords or terms used are definitively associated with the disease, not confused with other diseases. For instance, in a recent study related to Alzheimer’s disease, some of the keywords used, that were specific to the disease, were dementia, AD, APOE etc. As an example, this helped in narrowing down the initial sample of ten million posts to one million!
Next, we apply sophisticated algorithms to ensure keywords are true experiences and not random experiences or “joke” references. A “concept model” that includes a combination of keywords such as patient and disorder is built. This is also referred to as ‘relational modeling’ as the sentence structure and syntactic relations between keywords is of prime importance in relational modeling. For instance, a narrator may have mentioned ‘grandfather got forgetful’, which could be based on one event or due to a disease that is not relevant to the research study. The algorithm ensures that narrators have referenced a diagnosed concept e.g., ‘grandfather got forgetful’ does not include a diagnosis and hence would be disqualified from the study. But ‘grandfather got forgetful due to his Alzheimer’s’ will be included in the qualified data.
Since the main objective of the study is not prevalence of the disease but to understand patient journeys, additional models are implemented to extract and analyze the data. For instance, in Alzheimer’s study, a “suffer relation”model that captured symptoms was applied. If a narrator living with Alzheimer’s posts “last week, I had severe anxiety”, the suffer relation model would capture the post.
Later a functional “impairment” model was applied. This model allowed us to understand not just symptomatology but impaired activities (e.g., getting dressed, parenting, driving a car). It will help us to know the functional problems caused by or alongside the impairments. For example, a narrator may post ‘Dad’s Alzheimer’s is worse, and he’s had trouble dressing. We need to hire help’. This narrative allows us to look at the impairment and the impact of that impairment.
Each research study asks for unique research questions and NLP algorithms need to be tailored accordingly. The process of data curation is iterative and may need to be revisited depending on various factors. Also, manual review of the data may be necessary at a later stage. At RLS, our data team works closely with clinical scientists to ensure high quality of data analytics.