In this webinar, a panel of experts from Partnership for Health Analytic Research (PHAR), Genentech, the University of Pennsylvania School of Medicine, and the Hospital for Sick Children (SickKids) at the University of Toronto embark on a deep dive into code algorithm validation for real world evidence.
- Case studies involving algorithm validation
- Reference group selection for these case studies
- Algorithm validation and machine learning
- Why validation is done so infrequently
Michael Broder, MD, MSHS, begins this webinar by sharing how he became interested in validation studies. After recognizing an opportunity for combining clinical and methodological training to conduct health economics and outcomes research (HEOR) in an applied setting, Michael left academia to work in industry. Despite being a small team of 17 researchers, PHAR has published more than 800 scientific articles involving both primary and secondary data.
“My involvement in both of those things made me realize … that validation was not a step that was done, it seemed to me, nearly enough.”
The Food and Drug Administration (FDA) recently released guidance on how to assess electronic health record (EHR) and claims data to support regulatory decision making. If real world data (RWD) is used to support a claim for a particular product, the FDA requires complete verification of the study variable. For example, if RWD suggests that a drug for advanced cancer treatment could be used at earlier stages, the outcomes must be confirmed in each patient.
“That’s a very high bar, and they recognize that if that’s not feasible, then assessing the performance of the operational definition might suffice.”
In this case, if it is not possible to verify in each patient that they had early stage cancer, an alternative is to present evidence that the algorithm performed well. Most HEOR researchers are not required to meet that standard, but Michael notes that it is something worth thinking about. Although there has been a massive explosion in RWD publications over the past 25 years, only approximately 1 in 20 studies have a corresponding validation.
The rest of the panel joins Michael to discuss their personal experiences with validation studies, beginning with their first studies that involved validating an Internal Classification of Diseases (ICD) algorithm. Eric Benchimol, MD, PhD, from SickKids shares that he wanted to investigate the epidemiology of pediatric inflammatory bowel disease (IBD) onset in Ontario, Canada for his doctoral research, and was advised to look at the ICD codes to ensure that true IBD patients were being assessed. Interestingly, the likelihood of truly having IBD if you had a single code for Crohn’s and colitis was less than 10%. Thus, Eric developed an algorithm to accurately identify children with IBD, and brought awareness to the fact that algorithms were not being developed and validated. Since then, Eric and colleagues have also developed reporting guidelines for validation studies.
“Our goal over the past 10 years was to improve the transparency of this type of research and now we’re … starting to work on emphasizing the methods, both for validation and for conduct of this research.”
Anil Vachani, MD, MS, from the University of Pennsylvania School of Medicine next shares that his work is largely focused on cancer screening and treatment. Early on, Anil and colleagues recognized that molecular testing can be performed on cancer patients, particularly those with lung cancer, and that these results can lead to the selection of targeted therapies. Subsequently, Anil began identifying cases using the Pennsylvania Cancer Registry, which contained claims from the state through Medicare and insurance, and created a population-based cohort of several thousand lung cancer patients. However, there were no specific Current Procedural Terminology (CPT) codes for genetic testing at the time.
“We weren’t sure if these codes would actually find our exposure of interest, and so we embarked then on a several month detour where we actually had to validate these CPT codes to see what their sensitivity and specificity was for our exposure.”
Although Karina Raimundo, MS, first conducted algorithm validation for sarcopenia in graduate school, she highlights an example from later in her career in which she had the opportunity to help support study planning and decision making. While launching a product at Genentech, Karina and colleagues recognized that a specific ICD code did not exist to categorize chronic, spontaneous urticaria, and they wanted to describe the disease and better understand its epidemiology. The best way to go about this task was to develop an algorithm with assistance from clinical experts in the field, and then validate it.
“It was helpful to have a code or … an algorithm to help us truly have confidence that we had the right patients.”
In the next portion of this webinar, Michael wanted to learn more about the reference groups that each panelist used in these cases. Anil notes that since they used the state cancer registry to identify lung cancer cases, they were confident that their population was correct, but had to use standard groups or chart reviews to identify molecular testing. For any person thought to have had molecular testing based on an existing CPT code, they scoured medical records and pulled out individual test results to see if the mutations of interest were present.
“Ultimately, … our inferences that we wanted to make were whether those test results were informing therapeutic decision making and whether that varied by race, and so we felt like there was really no path forward without having to actually go to the charts and confirm that finding, given the uncertainty around the CPT codes.”
Karina shares that what was interesting about the early sarcopenia example was that they were diagnosing patients during the process rather than relying on a physician’s assessment or a pre-diagnosis, which she notes was a bold approach. She further clarifies that the appropriateness of this method depends on the type of disease; for cancer, there is most likely not a need to diagnose patients themselves since they are already probably accurately diagnosed, but other poorly diagnosed conditions would benefit from this approach.
Since all of these methods are labor intensive, Michael inquires if any of the panelists has experience with machine learning or other computational methods for validation. Eric has developed algorithms to identify disease severity in IBD patients using machine learning, but always with some sort of clinical reference standard. In 2010, Eric also performed a systematic literature review of all validation studies to date, and found that a few studies at that time used Bayesian analysis to define a probability of having the disease based on administrative data. There were also studies that validated one source of administrative data against another.
“I think … that if you know that one of those sources is the truth and is the true reference standard, that can work, but … in the end you really do need a good reference standard in order to know that your validation is effective.”
Eric further shares that early on in the pediatric Crohn’s and colitis algorithm work, they used a database of all patients seen with these diseases at the only pediatric hospital in the city. Therefore, they assumed that everybody living in the city who was not seen with Crohn’s and colitis likely did not have IBD at that time (i.e., a negative reference standard), which was advantageous in that they could get a validation cohort that mimicked disease prevalence in the actual population. However, they did not anticipate that the algorithm functioned differently in children of different ages. Eric explains that there were older teenagers who were being seen by adult gastroenterologists, meaning their database was incomplete, which impacted their estimates for the algorithm.
“Luckily, we had plans to validate it in other centers … so we had a backup plan of what we would do, but that was sort of eye-opening, that you can’t make assumptions. You really need to make sure your reference standard is good.”
Karina has experience in using machine learning to help define an algorithm and find a good predictive model, and explains that although this can be good at face value, the result can be complicated, and other researchers may not adopt that algorithm. She has also seen a lot of pushback from clinicians in this regard.
“I have tried to understand why people want to [use] machine learning … purely to develop algorithms, but I do think we need the clinical expertise to help us guide how we come up with the actual algorithm so that it will have … more external validity.”
Michael next asks the panel why they think algorithm validation is performed so infrequently. Anil believes it is true that journals do not view validation manuscripts as high enough impact to publish on their own, but they do want the published work to be valid, which he remarks is a catch-22. As an associate editor for a pulmonary journal, if Anil finds that a code was used in a study but was not validated, he questions if it is worth publishing.
“I think that there is this fine line that … authors and researchers are going to have to walk going forward. … I think we will care more about this and we have to figure out how to get it into our workflow in a more robust fashion.”
Eric also notes that it is worthwhile for editors to know that validation studies end up getting cited many times because they are critical for research. However, these studies are difficult to get funded on their own since they do not directly impact clinical care, health systems, or health policy. Michael and Karina have also had similar experiences in which the incentives for these studies were not present; they were not interesting, they were costly, and they could delay the primary study by several months. Michael has therefore conducted a “poor man’s validation” several times, in which he has asked clinicians to simply take a closer look at their own billing codes.
“It’s not much … and I wouldn’t want to argue that that’s the best way to do it but at least it has given me a little more confidence that we’re on the right track.”
To conclude this webinar, Michael shares that he tries to do his part when reviewing papers and ensure that authors mention whether or not they conducted a validation. Estimating the amount of error in an RWD study is very complicated if a validation was not done. Michael notes that being able to at least give the range of possibilities of how far off population identification or outcome assessment could be is better than simply stating misclassification or miscoding could have occurred during the study.
- What do you do if your validation study reveals poor performance?
- What is “good enough” for sensitivity and specificity?
- Is there a simpler way to get a validation sample that doesn’t require all that work?
- How do algorithms vary by source of RWD?
- Is there a way to validate ICD algorithms when we have only claims data?
To retrieve a PDF copy of the presentation, click on the link below the slide player. From this page, click on the “Download” link to retrieve the file.