RLS Reveal logo white


Patient Voice: The Intersection of Real World Data & Natural Language Processing

Dr. Nardin Farid, Strategy Lead, Real World Patient Analytics, Real Life Sciences
Dr. Margaret Bray, Lead Data Scientist, Real World Patient Analytics, Real Life Sciences
Stephen Doogan, Founder, Real Life Sciences

Full Webinar Transcript

John Lyons:
Our topic today, Patient Voice: The Intersection of Real World Data and Natural Language Processing. We have several great speakers lined up for you. First, we have Dr. Nardin Farid, who has a doctorate of pharmacy. She has worked in multiple therapeutic areas, including organ transplant, oncology and neurological disorders. Dr. Farid has practiced her pharmaceutical background academically at Yale, clinically at Penn Medicine and within pharma at J&J. Dr. Farid is currently focused on RLS REVEAL within Real World Data. Also, joining us is Dr. Margaret Bray. Dr. Bray has a PhD in biostatistics. She has spent six years in pharma handling data science analytics for HEOR, market value and access, commercial diagnostics and research. Dr. Bray has experience in rare disease and neurological disorder therapeutic areas. Handling our Q&A session today is Stephen Doogan. Stephen is the founder of Real Life Sciences, and he brings over a decade of experience in natural language processing and machine learning. Including product development and analytics roles in pharmacovigilance, outcomes research, and clinical data management. With that, I'd like to kick off our presentation.

Dr. Nardin Farid:
Great, thank you so much, John. Here at Real Life Sciences, we are revolutionizing outcomes and having advanced insight generation by using our NLP while protecting patient privacy. NLP is the core of what powers both of our solutions. RLS Protect is a data anonymization, redaction and risk assessment tool that helps you with new drug applications for Health Canada and EMA. Then RLS REVEAL is going to be the focus of today's conversation. REVEAL is used to systematically process real world data and apply structure by using NLP so that we could better understand and model the patient journey. In today's presentation, we're going to discuss patient centricity, Real World Data, and then NLP. Then we'll talk about two case studies, one for migraines and one in diabetes. With that, I'm going to pass it off to Margaret who is going to start off the presentation.

Dr. Margaret Bray:
 All right. I'm going to start off talking a little bit about patient centricity. Over the past number of years, we've been seeing this really big shift in a lot of healthcare settings going and putting patients at the center. What that really means is for physicians, we're looking at patient centered or value based healthcare. Where we allow the patient goals and what's important to them to be considered as opposed to simply driving for the best clinical outcome. Within the pharmaceutical industry, this is increasing the analyzing of the patient journey and the patient experience with their treatments. What's working? What's not? Why were we not diagnosed? Even on a bigger scale, at kind of the FDA level and the regulatory level, we saw in the US the passing of the Cures Act in 2016. This modified the drug approval process in the US by emphasizing the importance of incorporating the patient and just everything about them when considering the approval of a new therapeutic. Behind all of these movements to bring patients to front and center, we're seeing kind of a common refrain. That refrain is, it comes down to the patient voice.
In the past, and now, even currently now, the main ways of collecting the patient voice, we're looking at surveys, questionnaires, interviews, focus groups. These really kind of direct conversational methods almost. Some of the downsides to that is it's very, very time intensive and it's labor intensive. What this means is, it's hard to scale. It's hard to get enough patients, enough voices to really do something with the data in a lot of cases. We also run into the issue of inaccurate memories, which makes answering questions about the past difficult. Or when you're recollecting something, is it the same as how you felt when it was happening? It just can be hard to really trust those. We also run into the trade off between survey length and the usefulness of the data. Essentially, if a survey is shorter, we're more likely to get people to respond. The more multiple choice questions, the easier it is, more responses so we get more data. However, it's going to be limited data. If we want to ask more questions, we're going to get more comprehensive data.
If we're asking open-ended questions, we can get more variety of answers. But all of that means you're less likely to get people to answer. One other issue we have with collecting patient voice in this day is recognizing the bias. All of these methods that I was mentioning are question based. We're actively going out and looking for answers. We're seeking responses. When we're making these questions, we're relying on preexisting knowledge. That preexisting knowledge may kind of cause us to uncover an incomplete picture, for lack of better words. This past knowledge, it might not always be patient centric. We may be coming at questions or formulating questions from the view of a physician, from the view of a payer, and what matters to them as opposed to what matters to the patient. We also run into, I always love the phrase, the active observing disturbs the observed. By the act of questioning, we are provoking responses. While this may not always be a bad thing, if we are asking a bunch of questions on a certain topic, people might be digging in their brain. Did this happen?
Does this matter? We'll get answers and they're truthful answers, but do they matter? Is that really what was important to the patient? It's hard to tell because we've perturbed the system, for lack of a better word. Then the other thing is, open-ended questions, they can't fully stop the bias. It's still questioning and we're still seeking those answers. One example that really speaks to me in this area, it's in the rare disease field. I remember us talking to some physicians and what they said is, a lot of times the first person or the first few physicians that diagnose a rare disease, whatever symptoms they kind of put out there are the ones that are now associated with the disease. It's very hard to change. That means that if their patient ended up having multiple diagnoses or an abnormal presentation of the disease, symptoms are going to be missed. Maybe the disease always has a cough in it. Their patient didn't have a cough, and now it's not associated with it. We're never going to find that because our preexisting knowledge is, cough is associated with this.
It's just something to think about when we're collecting the patient voice. Recognizing that bias there. The other thing that I like to think that the really big kind of wrap up of the limitation of these traditional patient engagement sources is, we cannot collect what we do not ask, and we cannot ask of what we are not aware. If we just don't know what's there, we're not going to ever find it because we're not looking for it. Our ideal sources for the patient voice allow for unprompted statements. Now we're going to go into Real World Data. First in our real world data, we have a quick poll. For this poll, what sources of real world data are you currently using? So you personally, within your organization that you're aware of. You can select all that apply. We have electronic health records, administrative claims, disease registries, patient-reported outcomes, social media, and other.

John Lyons: 
Okay. Dr. Bray, we have results coming in. There's a fairly even spread across the board. The top three answers however are electronic health records, disease registries, and administrative claims.

Dr. Margaret Bray: 
Awesome. Conveniently, those are the ones on this slide. Real world data sources, it sounds like everyone's largely familiar with them. We have all those ones that were listed. If you notice, kind of the first three are the ones that are used the most. One of the reasons that we aren't seeing social media or even PROs used quite as much, is they're a little bit more challenging. One of the reasons they're more challenging is that the first three, the ones that people are talking about, are largely structured data. What do we mean when we say structured data? Structured data, one, has a standardized language. So a single term to represent a concept. We're not going to run into issues of synonyms and multiple ways to say the same thing. One term can represent a concept. Also, the data is going to comply to a data model.
I think EHR is a great example of that for the structured portion of the EHR and not the doctor's notes part. Where you're filling in a form, there's a certain data type for each field. We know what to expect. When we're asking for date of birth, we're getting date of birth. That's our data model. Then we also have defined relationships between concepts. What that means, I'll give a few examples in a second, but pretty much if we are using a hierarchical relationship, that hierarchy is built in. We know what it is. All of the information that we need is there. A few examples of this standardized language, I'm going to refer to it as a vocabulary, some of the ontologies, we have taxonomy, and we have dictionaries, we have classification schema. They always mean slightly different things.
I'm going to use them kind of interchangeably and focus on vocabularies here. Some of the common vocabularies that you've probably seen if you're regularly working with real world data, ICD-10, CPT, LOINC, SNOMED-CT, HPO, MedDRA. One interesting thing is, each of these systems, each of these vocabularies was originally created with a purpose in mind. That purpose dictates what words or phrases appear in their vocabulary. What relationships exist in their vocabulary. For instance, ICD-10, it was originally created for tracking diseases within a population. Which has turned into very billing centric focused, at least in the US. But it's good for this broad scale billing treatment statistics collection. CPT codes on the other hand are for reporting medical procedures and services. SNOMED-CT, it's kind of similar to ICD-10 in that we are able to encode things from EHRs, but it's much more broad. There's ways to kind of combine different terms together, so it's a little too broad for that same big statistics collection that you're going to get from ICD-10.
HPO, it's another one kind of similar bit there used largely by researchers. It's a vocabulary phenotypic normality. Then MedDRA is used by regulatory authorities. We might see overlaps between terms in each of these vocabularies, but like I said, the relationships, the structures are all going to be different. Here's an example of the structure of HPO and its hierarchy. It's called a directed acyclic graph, but essentially at the top, that phenotypic abnormality, it's the most general term. It's the least specific. As we work our way down each of the branches, we are getting more and more specific. If you work your way down the blue branch, for example, we're seeing abnormality of body height, and underneath that either more specific or short stature or tall stature. Another example, with MedDRA, we can see kind of a similar type of flow going from respiratory, thoracic and mediastinal disorders down to the common cold. But within MedDRA, each of the different levels, it's a little bit set and it has a name. We go System Organ Class, the High-Level Group Terms, High-Level Terms, Preferred Terms, Lowest Level Terms.
But still we have that flow, we have the relationships and we have defined language. That's kind of the overview of what structured data is and the vocabularies, but I haven't super gone into why it matters. When we have our data structured, it means it's ready to be analyzed. We don't have to worry if a doctor wrote down a name of a disorder in a slightly different way. So multiple ways of saying things are not allowed. For instance, if we are looking at some structured data, we aren't going to find ADHD, all caps, adhd, no caps, ADD, attention deficit disorder, attention deficit hyperactivity disorder. We're not going to find all of those variations. If we're using ICD-10, we're going to see F90, and maybe some of it's slightly more specific versions of ADHD. But we know everything classified there and we can roll up if you need is under that F90 term. Same with SNOMED-CT has a code, same with HPO that has a code. The other advantage of structured data is, it's ready for analytics. Everything is clean and known and relationships are defined.
I add my little caveat here that, there's always going to be small mistakes and maybe it's not 100% perfect, but that is fine. It is clean. It is more known. The relationships are more defined than unstructured data. We can rely on it to answer questions. All right. Now we move into natural language processing. I begin with a very broad statement, NLP is many things. It's an interdisciplinary field. It combines stuff from linguistics, computer science, artificial intelligence, data science, statistics. It's a little bit of everything. In the same way, it's composed of a little bit of everything. Its uses can be applied very broadly. In my head, I always go to the example of robots talking to each other because it seems like a cool version, frankly, of NLP. While that is part of it, it's also the more simple and maybe less flashy uses. So identifying, excuse me, parts of speech, or finding all named people in a Wikipedia article. It's the chat bots that you see when you go into a site, or automatic translations from one language to another. It's just a little bit of everything.
When you combine NLP with real world data, with some of those big unstructured sources, that's when we can get to the patient voice. NLP is going to translate the unstructured data into structured data. Our verbatim written language gets put into a defined vocabulary. So what was once unusable and unanalyzable becomes analyzable, becomes usable. Then we also have different NLP models, and I go into this bit a little more deeply on the next slide. But different models can apply different vocabularies, which give us different lenses for viewing the same data. I think kind of one of the things that I think is really useful and helpful is, this allows us to process data at scale. This scale is virtually impossible for humans. Think millions of social media posts in minutes, seconds. It depends on your computation power. These are tasks that would take tens of person years, hundreds, depending on your level. We would be so limited in the different vocabularies and what we could answer.
Drilling down a little bit into NLP models, on the left, we can see it's the same post every time. Let's say this was taken from one of the social media forums we might look at. Maybe from a diabetes specific forum. I was recently diagnosed with gestational diabetes after failing my one hour glucose test. This really sucks. I can't eat anything good. On the right, we can see each different model which is utilizing a different vocabulary, is going to pull out something a little bit different. In our first model, the green one, where we're applying the CPT codes, it's going to pull out that one hour glucose tolerance test and give it a code, 28084. On the next one, we use SNOMED. Now we're going to get gestational diabetes mellitus that has a code as well. When we use ICD-10, we're getting that same bit as we did in SNOMED, but now we're also able to add initial encounter. So we know when it was occurring, and this is the first one.
One of the interesting things is in that last one where we have, this really sucks. I can't eat anything good. A lot of the vocabularies, like I said, they're all created with a specific purpose and there's not too many vocabularies that are dedicated to, I would say, the patient voice or dedicated to quality of life. Which it's interesting. If we go back to what we were saying at the beginning, that patient centricity, awesome that we can learn what tests happened and what the diagnosis was in the end. But we also want to include how the person was feeling. How is this affecting their quality of life? Quality of life, like I was saying, it's one of those pieces that it benefits so much from the patient voice. While I would say we've definitely been doing it before, it's been harder to get at scale.
A lot of the quality of life has been coming through kind of the lens of what the doctor thinks is important for their quality of life. What someone else is saying is important. Since vocabularies, they're all created with a purpose, at RLS, right now we have SPEC-F. This is a framework for classifying quality of life factors and commentary. SPEC-F stands for social, physical, emotional, cognitive, functional. Think of these as kind buckets or classes that these factors, this commentary can go into. I call SPEC-F a classifying framework instead of maybe a standardizing one, because what it can do is keep a lot of the patient's language choices intact. By keeping them intact, there might be instances where specific language is very, very important. Now, before we go into the next side, we have another poll. This poll is, are you currently using social media as a real world data source to assess quality of life? Yes, regularly, infrequently, but I do, or never?

John Lyons: 
We have some responses coming in and it looks like one or two more here. The responses are somewhat evenly split between never and infrequently, but I do. Most report never.

Dr. Margaret Bray:
 Awesome. Then some of the examples that Nardin is going to go through on the later slides should be really relevant and hopefully show you what can be done. A little bit of the art of the possible, if you will. A little bit further look at SPEC-F before we go to those examples, we have our different buckets. We have a different post. Looking into the first one, I wouldn't socialize or do much of anything for fear I would make a mistake. That, wouldn't socialize, is going to be able to be pulled out. We know that this is a social commentary of life, and then we can add in other factors of, what's the cause of it? What can we do over here? Is there a why? But having that, it is social, it's affecting their commentary of life. Same with functional. He gets lost driving familiar places, forgets dates, his address and is very disorganized. That, getting lost when driving, is going to trigger kind of our functional bucket. That's kind of a broad overview of how we are using SPEC-F at RLS. Now, I will pass this on to Nardin.

Dr. Nardin Farid: 
Thank you so much, Margaret. I'm going to go into a couple case studies just to show how we do what Margaret was saying and what we were able to do previously. But before I do that, I wanted to mention that because we're using social media, it's really an agnostic tool. There is no limitation based on disease state. If people are posting about that disease on social media, then we're able to aggregate it and gain insights in that space. These are just a few examples, but we're not limited to anything. The quantity of the data is going to depend on the disease state. So if we're looking at something super common, then that's going to have a lot more data compared to a rare disease. Which still will have a good amount, but because it's rare, obviously it's only going to impact a certain amount of people.
I just want to say that. Today, for the case studies, we're going to do one on migraines and then the next one will be on diabetes. To start off on the migraine one, this client was a patient access group, and they wanted to gain patient insights into high frequency episodic and chronic migraines. They wanted to learn more about what's happening with patients. What are their symptoms. What are their comorbidities. What's the treatment pathway. Really, they wanted to evaluate the use of the utility of existing PRO or patient-reported outcome instruments that are being used in the space. They wanted to just learn as much as they can about what's happening with patients in the space. RLS looked at post-level data from 91 external sources, including social media and safety databases in order to build around 7,800 unique patient profiles. Important to mention, when I say 91 sources that included social media, when you hear social media, you might think the Twitter, Reddits, the general social media sites that you and I may use day to day.
But there's a lot of specific disease ones that actually have a lot of information on them that are really useful and have communities where people go to ask for support and to lean on one another and to discuss their experiences. There's a great wealth of information within those types of sources. Our sources aren't limited to just the general social media ones, but we built upon them so that we could get a more wide view about what's happening for patients in that space. We use our NLP in order to structure the really unstructured social media data that's already out there. Like Margaret was mentioning earlier, we use our SPEC-F classification in order to learn more about treatment, symptoms, comorbid conditions, and what's happening within those conversations. Then the structured data is able to be analyzed and we're able to pull quantitative and qualitative insights. Having both types of insights are really useful. We're able to look at disease presentations and treatment switching. What makes somebody stop taking a medication versus start taking a new medication and other quality of life factors that are present amongst those SPEC-F categories.
By using technology and using NLP, you save a lot of time. You gain more about what's happening in the patient experience and you don't have to do the traditional research methodologies that Margaret went over earlier in the webinar. We didn't have to worry about what questions we were asking and the questions that we're not asking and what kind of concepts are being covered or not covered, because we were able to tap into discussions and experiences that patients were having in the real world. The analysis of those self-reported unguided commentary is really important and gives us a really strong patient relevant approach in order to evaluate those existing PRO instruments and the utility of them. That's what happened with the migraine case study. Then the next case study that I wanted to discuss is a diabetes one.
This was actually a Medical Affairs group who wanted to learn more about what's happening in the space around their diabetes medication. They had a medication that was already accepted and approved, and so they wanted to learn more about what's happening with the patient experience in that space. In Medical Affairs, they had a lot of information from healthcare providers and key opinion leaders. But the need for hearing patients directly and amplifying that patient voice is really strong throughout all the functional areas in pharma, but specifically in Medical Affairs. That way, the language that's being communicated to healthcare providers matches the language that patients are communicating to healthcare providers in their interactions. We took a similar approach here where we used our NLP in order to aggregate and analyze social media data. Here, 56 sources where analyzed with over 94,000 reports to create around 29,000 patient profiles.
Here, Diabetes Daily is an example of a disease specific forum that adds a lot of value to our data set alongside the conventional social media sites that you may be thinking, like Twitter and Reddit. NLP was used in order for us to be able to learn more about what's happening with treatment annotations and the common topics of conversation in the diabetes space. Some clear, actionable insights were derived from the engagement. Two that I'll focus on, one is prescriber acceptance. Hypoglycemia obviously is major concern in diabetes with certain medications. They were able to show that with this medication or with other medications, it contributed to dose splitting and this medication had reduced hypoglycemic events. That was great to be able to discuss to prescribers.
Then formulary placement, because of the hypoglycemic events, by proving that your selected populations that had higher incidents of hypoglycemia would do better on a medication like this, that prevented hypoglycemic events, this medication was able to be higher in the formulary. Those are the two case studies around migraines, and then this one around diabetes in different functional areas and in different diseases because social media can really be used in any disease and in any functional area because it's really the outputs and the outcomes and the questions that are being asked that may change. Those are the case studies. With that, I'm going to pass it to Stephen for the Q&A. Feel free to leave any questions in the question box below.

Stephen Doogan: 
Thanks Nardin and Margaret for a great presentation there. Hopefully you can hear me and see me okay. We've got a few questions sort of in here and coming in. I think I'll start off with the one about diseases. There's a question around, are there any limitations on what diseases you can support in your system? I think in line with what Nardin was saying, sort of agnostic, if people are talking about it online, we can access it. That the key caveat though to really point out that we have to make clear is that we are looking at publicly available sources of data. If someone's talking privately about something and they're in social circles, we don't pick up things of that nature. There's that limitation, which will always be a limitation which we would never want to break anyway. But in terms of the types of diseases or where commentary really can sort of expand and grow very rapidly on social media, it's often with those sort of chronic diseases versus acute.
Chronic diseases often have great coverage. Acute diseases, we're starting to see a lot more post pandemic as well of just having just a lot more high frequency reporting even around acute events. Probably in line with what was happening with COVID. Just people becoming more sort of reporting socially, maybe even if they haven't in the past. The limitations are really limited to what people talk about as well. We really see a future where people really are going to put more and more of this commentary out there. It's really, most diseases we have coverage for, and getting started with those diseases is probably a good start so that you can sort of then have a framework for other data that may be generated in the future.
Actually the second question ... Well, not the second question. It was actually the fourth or fifth question. How to access information on private disease specific communities, e.g. on Facebook. Well, we don't, is the answer. We have worked on projects though where we work with the owners of the Facebook Pages. Often that can be the sponsor themselves. That's actually the scenarios that we've worked in when there's been Facebook campaigns. We've been collecting the unstructured texts that comes through from those private communities and doing everything from just sort of identifying if there are any sort of concerns using automated approaches to see if there's anything like adverse events in there to also doing things like insight extraction. Just that we don't go to private communities to access data unless we're given access.

Stephen Doogan: 
I'm just going on to the next one. There's a lot of questions here, so I'll have to pick. Has the migraine case example, has it been presented or published? What are the key sources of data for the study? The migraine work has actually been internal work. We do lots of publications of our work just for reference. Our migraine work was actually quite strategic long-term work. It didn't actually make it into publications. With that being said, across a number of diseases we've done that and we could do so in migraine. Key sources of data for the study, obviously social media. So public access social media, where people talk about migraine. This is actually, as someone else asked this question as well, what sources of data? I'll give a more detailed answer here.
The types of social media data that we have, you think Twitter, Reddit, those are the sort of more common, YouTube as well, sort of video platforms that have comments on them, sort of very much in the public zeitgeist, commonly used. We call these mainstream social networks. We also have disease specific social networks as well and disease specific forums. This is when, if we've got a disease around migraine or multiple sclerosis or diabetes, these are sources that are specific to that disease. In the case of migraine, we have some of those sources as well where the sources are focused only around migraine. Then sort of in addition to that, you've got sort of just text based forum boards as well. Which could be general health and wellness, they're not specific to a disease.
You also have other sources, which are actually kind of interesting, the sort of public doctor Q&A. The depth of the conversation may not be too granular, but there's at least some directional insights that can be drawn from these sort of public, not medical advice, but doctors and patients asking for educational advice. Not medical advice, but educational advice. We've drawn insights from there. Last but not least, another source of social media data might be something like drug reviews. People give reviews of side effects and drug effects and things like that. We also do look into things like that as well as a social media source. In terms of the migraine study specifically, we also used, I believe it was drug safety databases. Specifically, it would've been the FDA theirs that we were using I think to do some kind of a number of different insights actually from that source of data. Then we brought the two data sources together. So structured them in the same lens, analyzed them together.
Just going through the list. How long did these exemplary projects last? It's actually a really good question. There's a number of different answers to this question. Generally, what happens is that they can be sort of an initial three months that often leads to a cascade of different sort of questions and projects. We're not trying to sort of sell you too much too quickly kind of thing. I think there's a lot to understand about this. Typically speaking, we can generate insights in three months. If you need something that is a little bit more, could you look into a disease in a specific way for us? But we also have off the shelf analytics and data. What we haven't sort of shown today is what's emerging is a data platform which will have some sort of version of a sort of structured metadata that people can access as well.
That would be more like, it's not about a project's length of time, it's about the length of your agreement to that data access. Generally speaking, we've worked on projects that have ranged from really anything from one month to 24 months basically. Which is our sort of longest contract. Which is, I think, we also have contracts that are still ongoing that are longer as well actually. That should answer that question. Then let's just see if we've got others here. How do you deal with the noise factor around the cleanliness of the data? There's a number of different ways that we have to go about this. Actually, typically when we first started we heavily relied on manual annotate so we can identify what you can manually in text. In principle, we've been really putting a lot of work into how we set up annotation guidelines. It's been an evolution for us as to how we set up the questions that we're asking and how we sort of frame the answers in terms of how we can actually answer specific questions as well.
In addition to that, we have three part tests, is one particular sort of practice that we've used as well. Multiple people reviewing specific outputs. Then what we're going to see sort of going forward and over time is sort of this emergence of, I think, a new set of modeling and a maturing of sort of deep learning models that will really have us doing a lot more automated sort of cleaning of the data as well. Where right now, parts of it are automated, parts of it are manual, but we see that sort of continuum changing over time. There was a sort of final question here around costs, which I'm not going to answer on here publicly, but we'll have someone follow up with you who've asked that question. What I will say is that, working with us is not cost prohibitive. It is the type of engagement with us that could be seen as a project done in three months you and you could get started in a scenario like that.
Ways to sort of start small and we can further iterate with you about that.

    * required field

    Novel Insights

    Uncover Previously Undiscovered Insights From Your Real World Data

    Are you interested in how Real Life Sciences powers our RLS Reveal solution with AI (Natural Language Processing) to aggregate and analyze large amounts of Social Media Data to deliver unparalleled novel real world data insights?

    Reach out with any questions that you are looking to answer with your Real World Data. Let us show you how RLS Reveal unlocks the value of your data and generates new insights through non-traditional Real World Data sources like Social Media.

    Book a meeting