A research team led by Columbia University has developed an open-source framework designed to streamline and accelerate artificial intelligence research using medical data, addressing long-standing challenges in data standardization, reproducibility, and interinstitutional collaboration.
This framework, called MEDS, introduces both a standardized data format and a growing ecosystem of interoperable tools aimed at supporting the development and evaluation of machine learning models using clinical data.
A study describing this framework was published in NEJM AI.
Researchers say the framework could help alleviate technical barriers that currently slow health AI research and make it difficult for scientists to reproduce research findings or compare models across studies and institutions.
MEDS is an easy way to make all different sources of electronic health record (EHR) data look the same to your code, regardless of which hospital, clinic, or EHR software system the data comes from. MEDS allows the sharing of code that can be used to train models in different clinical settings, without having to share sensitive patient data, and often without even having to take the more difficult step of fully “harmonizing” the data into a consistent clinical vocabulary. This infrastructure allows researchers to spend less time rebuilding pipelines and more time answering clinically meaningful questions. ”
Dr. Matthew McDermott, Assistant Professor of Biomedical Informatics and Research Leader, Columbia University
Standardizing health data for clinical AI research
Electronic health record data is often stored in facility-specific formats that require extensive pre-processing before being used for AI development. According to the study authors, these inconsistencies can result in significant duplication of effort, limit collaboration, and impede reproducibility.
MEDS addresses these issues by providing a lightweight, extensible standard for representing longitudinal clinical data in machine learning workflows. The framework also includes open-source tools that support data transformation, preprocessing, benchmarking, and model development.
The authors emphasize that MEDS is specifically designed for AI and machine learning applications and complements, rather than replaces, existing clinical data standards.
This framework aims to support a wide range of use cases in biomedical AI research, including predictive modeling, representation learning, multimodal modeling, and large-scale benchmark studies. The ecosystem is open source, allowing researchers in academia, healthcare, and industry to contribute tools and extensions.
“Great success in AI has always been driven by the ability of communities to come together and collaborate in a decentralized, open-source fashion around tools, model parts, and ultimately an ecosystem that allows us to build bigger models that scale to large datasets,” McDermott said. “These impressive results in MEDS simply reflect the benefits that the community can gain by sharing tools and abstracting common parts of pipelines into shared libraries that can be used with everyone’s data.”
This study also highlights the importance of reproducibility and transparency in medical AI development, as machine learning models increasingly move towards clinical deployment.
Researchers say they hope MEDS will foster broader collaboration across institutions and accelerate innovation in clinical AI, while promoting more transparent and reproducible science. MEDS has already been adopted by 21 institutions in 12 countries.
sauce:
Columbia University Irving Medical Center
Reference magazines:
McDermott, MBA; Others. (2026). MEDS — Emerging data standards and ecosystem for health AI research. NEJM AI. DOI: 10.1056/AIra2501253. https://ai.nejm.org/doi/10.1056/AIra2501253

