Federated Learning in Cancer Research

A complex overview of the federated learning process

In October 2025, the Cancer AI Alliance (CAIA) announced the launch of the first scalable platform using federated learning for cancer research.

CAIA is comprised of National Cancer Institute-designated cancer centers Dana-Farber Cancer Institute, Fred Hutch Cancer Center, Memorial Sloan Kettering Cancer Center, and The Sidney Kimmel Comprehensive Cancer Center and Whiting School of Engineering at Johns Hopkins with financial and technological support from technology industry leaders Amazon Web Services (AWS), Deloitte, Ai2 (Allen Institute for AI), Google, Microsoft, NVIDIA and Slalom.

This federated learning platform is the technological foundation that will enable researchers and clinicians to train AI models that learn from our participating cancer centers’ millions of clinical data points while maintaining data security, privacy and adherence to regulatory and ethical standards.

Table of contents: Federated learning in cancer research

What is federated learning?

Federated learning is an AI training approach and machine learning method that preserves the anonymity of individual data. It allows researchers and clinicians to train powerful AI models that learn from participating cancer centers' millions of clinical data points while maintaining data security, privacy, and adherence to regulatory and ethical standards.

Why is data federation useful in cancer research?

Data federation, as implemented by CAIA, is a framework aimed at aggregating AI insights across a distributed network of cancer centers without compromising patient data, privacy, or security. CAIA’s data federation framework is underpinned by an AI training approach called federated learning.

Data federation fosters greater diversity in cancer research. Individual cancer centers only have information about patients who come to their clinics for treatment. Thus, models developed on data from a single institution are only able to learn about that subset of the population. This limits their models from being generalizable.

Data federation in cancer research can overcome this limitation by enabling groups of researchers and institutions to work together with a more diverse data set. The key idea that led to the creation of the Cancer AI Alliance is that if cancer centers work together, then we can make demonstrable progress to accelerate treatments and cures

Healthcare institutions have a number of safeguards and regulations in place to secure patient data. The siloed nature of healthcare data has made cross-institutional collaboration challenging. Data federation allows for collaboration, helping organizations overcome institutional barriers and accelerate progress. 

Lastly, Data federation enables scalability and ease of expansion. Unlike centralized data projects that face massive logistical hurdles when adding new partners, the federated network is designed for expansion. Each Alliance member’s data source can be added to the federated framework as an additional edge node.

How does federated learning work?

Federated learning lets researchers improve AI models using data from multiple institutions without sharing sensitive information. First, a researcher sends their AI model to a central system. This system then sends the model to different cancer centers to analyze their local de-identified data. The result of that analysis (but not the underlying data itself) are combined and sent back to the central system. This process repeats multiple times to refine the model, and finally, the improved model is returned to the original researcher, all without accessing identifiable patient data, and without any raw patient data ever leaving its home institution

Does the central orchestration layer see the patient data?

No. The central orchestration layer acts like a conductor for an orchestra; it sends the AI model and instructions to each of the edge nodes (at the cancer centers) but never sees the patient data.

Instead of bringing sensitive patient data to a centralized location, federated learning brings the AI model to the data. The AI models travel to each participating cancer center’s secure data to learn from it locally. Patient data remains safely behind institutional firewalls, and individual clinical data never leaves the institution.

Standardizing data for federated learning in cancer research

Hospitals gather a substantial amount of data on every patient’s health. Imagine combining a single patient’s data, with millions of other patient journeys across the country, to create powerful AI models to find new cures

The key to unlocking this potential is data standardization. This work ensures that the vast amount of information collected during a patient's care can be interpreted, and reliably analyzed across different institutions. In other words, the goal of data standardization is to create a common format that is accessible and useful to researchers for collaborative, multi-institutional cancer discovery.

What is an edge node?

Each participating cancer center acts as an edge node — a device that is a secure gateway between the cancer center and the rest of the alliance. Patient data remains safely behind a firewall and never leaves the cancer center. Each edge node connects to a central orchestration layer. The orchestration layer sends the AI model and instructions to each edge node. Each edge node trains the AI model within its own secure environment using its local, secure data. The edge node then sends a summary of its learnings (the updated model), which contains no private patient information, back to the central orchestration layer to be aggregated and strengthen the model.

What are the key benefits of using federated learning?

Federated learning enables a strategic shift leveraging collective strength rather than isolation, accelerating the pace of breakthrough discoveries by up to tenfold.

The resulting AI models are more powerful and equitable because they learn from a diverse and representative sample of patients across the country. It also accelerates research for rare cancers by combining insights from small patient populations across multiple centers to uncover new patterns and potential therapies.

Adapting federated learning for multi-institutional cancer research

While federated learning has been gaining steam for nearly 10 years, adapting the technology for multi-institution use in cancer research has proved elusive due to significant technological, regulatory, patient privacy, and data harmonization challenges, as well as the coordination effort necessary to bring together organizations of this scale and complexity.

CAIA has overcome these barrier by developing federated access models, governance structures, and streamlined regulatory pathways that accelerate multi-institutional AI research.