Author: Guy Amster, Senior Principal Machine Learning Engineer, Flatiron Health
Real-world evidence (RWE) has long held promise to transform the way we understand cancer, turning the realities of everyday patient care into insights that will inform research, clinical development, regulatory decisions, and treatment strategies. In the age of AI, that promise feels closer than ever.
But as someone who has spent seven years building AI systems for oncology data, I have seen the gap between what AI can do and what it actually does widen. In a survey of 90 RWD purchasing decision makers, 89% identified data quality and completeness as very or very important, with a close second (84%) emphasizing the importance of cohort size.
These findings indicate a clear shift in the market. As we rely more on data for decision-making, accuracy, depth, and representativeness become more important. As that change accelerates, data quality becomes even more important. Rather than solving weak data foundations, AI amplifies the foundations on which it is built.
First bottleneck: data you don’t have
In oncology, RWE is most valuable when the clinical picture is complex. However, these are also the first places where weak data foundations break down.
The complete patient journey exists across biomarker reports, pathology results, and evolving treatment decisions, much of which cannot be captured in a structured field. EHRs are not research databases. It’s a tool built to help doctors treat patients, and to turn that record into usable evidence, we need to know what data should exist, where it might appear, and how to interpret it in context.
In the pre-AI era, missing data often looked like missing data. Now, AI can fill that gap with something that sounds plausible: hallucinations. If left undetected, these errors can travel downstream, skewing analysis, and masking new or unexpected signals in the data.
That’s why building datasets that reflect the real-world complexity of cancer treatment is the first step before curation or modeling. At Flatiron, clinical and data experts collaborate to build high-fidelity maps of the patient experience, so the data foundation is strong enough to support the models and decisions built on top of it.
Second bottleneck: you can’t outperform the label
Even with good data, there is a second limitation that is equally fundamental. It cannot construct an output of higher quality than the signals used to train and evaluate the model. The basic model cannot match a human expert “out of the box”, but it can get there through iteration if appropriate labels are available.
In oncology, these labels are best derived from expert human abstractions, clinicians and trained experts who interpret nuance, context, and ambiguity in ways that models cannot reproduce on their own. There is a persistent theory that AI will reduce the need for human expertise. From where I sit, the opposite is true. Every high-performance system I’ve built requires more, not less, investment in human labeling. Because without high-quality labels, you can’t reliably measure whether each iteration actually improves performance. Gram-level accuracy cannot be achieved using a scale that only measures in kilograms. And no matter how smart your architecture is, that won’t change.
At Flatiron, scaling AI meant doubling down on the interaction between human expertise and machine extraction and continuously iterating until performance was measurable and meaningful. The goal is not to remove humans from the loop, but to expand human expertise in a way that maintains and extends clinical fidelity.
What guarantees quality, not what guarantees it
Next question: Modern AI systems can produce datasets that are internally consistent, statistically valid, and analytically useful. These characteristics, once used to indicate quality, are no longer proof of quality. Currently, data generated by AI can pass superficial checks but fail in the most important areas. This means rare cohorts, complex eligibility criteria, and multivariable analyzes where small errors combine to create significant distortions.
For RWE users, this changes the burden of proof. It is no longer enough to ask whether a dataset “looks right”. The question is whether it has been rigorously evaluated against the realities of clinical care and whether its limitations are understood.
At Flatiron, we built the VALID framework, a methodology for AI-curated oncology data evaluation across three pillars. Automated validation checks to identify internal inconsistencies and impossible values. Replication analysis comparing LLM-derived findings with established clinical outcomes.
The real inflection point: From data to decision-making
RWE is increasingly integrated into complex clinical decision-making. Clinical development teams can use a digital twin approach to pressure test trial designs before enrolling a single patient. Researchers can study outcomes in rare or underrepresented populations with confidence in the underlying data. However, these systems can only produce clinically relevant results if the underlying data are reliable.
In a world where sophisticated models are widely available, the differentiator is not who has the best algorithm. Who built a system that consistently produces answers that reflect clinical reality, across edge cases, complex patient populations, and every decision a biopharmaceutical company needs to make.
For life science teams, this changes their mandate. The question is no longer “What can AI do?” The question is, “What can we trust to run repeatedly in the most complex scenarios?” The organizations that win in this next phase will not be those with the most tools, but those with access to all data and the tools to reproduce expert quality in even the most complex clinical settings.
Because the goal in oncology is not just to generate insight. To generate insights that you can act on with confidence.
Interested in how Flatiron is shaping the future of oncology? Attend the ASCO Annual Meeting to learn more.

