How MSK and CAIA use federated learning to uncover EHR data patterns
When a patient walks into a doctor’s office for the first time, they recount their medical history, summarizing years of health into a few lines on their health record. Cancer patients are often seen at centers that specialize in their condition. Their medical history may not always follow them to their specialists. Current computer models take a "snapshot" of a patient’s most recent lab value or their most recent scan and provide recommendations to predict what might happen next.
However, these brief interactions and limited data points don’t capture the complete sequence of events in the patient’s journey. A research team at Memorial Sloan Kettering Cancer Center (MSK) is building an AI model to help tell the full story, or the timeline of events that led the patient to that particular diagnosis.
Feiyang Huang (a graduate researcher at the Tri-Institutional Program for Computational Biology and Medicine at MSK) explains that “[doctors] look over the charts, over the [patient’s] cumulative history. We’re trying to simulate that process, looking at events as they arrive over time to tell a story of what happened to this patient." Whether a patient’s cancer has been progressing slowly over ten years or whether they arrived at the hospital with a stage-four diagnosis, the sequence of events leading to these diagnoses can help inform a clinician's recommendations and lead to more personalized treatments. '
“[Doctors] look over the charts, over the [patient’s] cumulative history. We’re trying to simulate that process, looking at events as they arrive over time to tell a story of what happened to this patient.”
Working with the Cancer AI Alliance (CAIA), Feiyang is building a “time-evolving” AI model alongside his colleagues, Dr. Wesley Tansey (Assistant Professor of Computational Oncology at MSK) andDr. Francisco Sanchez-Vega (Assistant Attending in Computational Oncology at MSK).
Their goal is to bring together the entire history of a patient to predict the next phase of their medical experience.
This project is one of CAIA’s AI innovation projects. Using CAIA’s federated infrastructure, projects like this one at MSK are building the AI infrastructure for tomorrow: massive foundation models trained on secure, de-identified cross-institutional data.
A well-timed collaboration
This project is the outcome of a well-timed collaboration between institutional strategy and on-the-ground research. Francisco was first introduced to CAIA in January 2025. However, the project truly took shape during an MSK departmental retreat later that year.
"I saw the work that Feiyang and Wes were doing, and I thought that it would be a great fit for federated learning," Francisco recalls.
For Wesley and Feiyang, CAIA provided the missing piece of their puzzle: massive scale.
“We’re already building these really data-hungry models. CAIA came along at exactly the right time.”
Making sense of Electronic Health Records (EHR)
The “raw material” for this project is EHR data. Working with EHR data at scale can present challenges. Unlike a controlled clinical trial where data is collected at standard intervals, real-world medical data can be messy. A patient might visit the clinic one day and then not return for six months. They might have missing data in their record because they missed an appointment.
"The fact that a patient missed a measurement can be informative," Wesley notes. "Maybe they weren't strong enough to get out of bed that day. These irregularly sampled time courses are difficult from a mathematical perspective, but they are essential to capturing the true state of a patient."
EHR data often holds the clues to capturing that “true state,” but extracting information and building stories out of these records presents a challenge for busy clinicians. This is where a sophisticated, time-evolving AI model can help. As Wesley explains: “Every patient almost always gets lab values measured from routine blood draws. There are very intricate patterns an AI model can pick up — signals between lab values, scans, and treatments — that a human simply couldn't explore all at once.”
Giving a clinician the full narrative of a patient’s health trajectory in a digestible format can lead to a more individualized course of treatment and care.
“There are very intricate patterns an AI model can pick up — signals between lab values, scans, and treatments — that a human simply couldn’t explore all at once.”
Giving clinicians better predictive tools
Feiyang points out that the volume of data is becoming a burden for even the most talented specialists. In the future, he envisions a model that performs "retrospective analyses" on hundreds of thousands of similar cases in seconds. If a doctor is faced with a rare "edge case” they have never seen before, the AI model can search the database to find how thousands of similar patients responded to different treatments.
Unlike popular AI models, “we’re not going to have 'hallucinations' here,” Wesley says. “We are giving the clinician better tools to help a patient understand where their actual journey is likely to go.”
Improving data diversity in cancer research
Medical AI models are often trained on data from one specific hospital. Wesley points out that MSK’s data is reflective of the hospital’s location in New York City. By leveraging CAIA’s federated network and including data from other regions across the country, the team ensures the model is robust, working for patients regardless of where they live. This diversity is also crucial for building models that can handle different clinical practices and regional health trends.
Francisco, who has focused on multimodal data integration for years, sees this as just the beginning. While the project currently focuses on structured clinical data, the model is designed to be versatile.
“The hope is that we can expand this to include genomic data, digital pathology slides, and even radiology images like MRIs and CT scans.”
Creating actionable insights from noisy data
The power of this time-evolving model depends on the structure and quality of the data feeding it. Before CAIA, the team at MSK had already laid a unique foundation for this work. "We saw the amazing work that the Cancer Data Science Initiative team has done here on organizing all of MSK’s data in a uniquely vertically integrated fashion," Wesley notes. This internal effort to ensure all data is organized prepared the team to take advantage of CAIA’s scale.
However, even the best-organized EHR data can be “noisy.” By identifying "intricate patterns" across this massive, noisy dataset, the AI model that this team is building can help clinical teams turn these patterns into actionable insights. As Wesley puts it, “We hope to give the clinician better tools to help them... with these sorts of predictive analytics that actually say, ‘This patient has a much worse prognosis. This patient is at high risk for an adverse event.’"
When the model flags a risk, it can change the conversation in the doctor’s office. Instead of reacting to a crisis, the doctor can get ahead of it. By bridging the gap between "noisy" historical data and future predictions, the model has the potential to move cancer care from a reactive stance to a more proactive one.
If you’d like to learn more about CAIA, subscribe to our newsletter for updates and follow us on LinkedIn and X.