Skip to main content Skip to secondary navigation
decorative background

AIMI Dataset Index

A community-driven resource of health AI datasets for machine learning in healthcare

Description Data Source No. Sources Longitudinal Accessibility Data Types Funding Source Documentation
1000 Genomes

The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The 1000 Genomes Project created a catalogue of common human genetic variation, using openly consented samples from people who declared themselves to be healthy.

Collaborative (consortium) Multiple
No
Public/open Genomic NIH/NIGRI Exists? Yes

View Documentation »
All of Us Research Program (AoURP)

The All of Us Research Program seeks to engage at least one million diverse participants to advance precision medicine and improve human health. Participant data are publicly-available and data types include surveys, physical measurements, and electronic health record data with validation studies to support researcher use of this novel platform.

Collaborative (data linkage and harmonization) Multiple
Yes
Apply for access EHR, Genomic, Medical Imaging, sensors, labs NIH Exists? Yes

View Documentation »
American Heart Association (AHA) - Precision Medicine Platform

The American Heart Association®/American Stroke Association® (AHA/ASA) collects millions of patient records in our Quality Programs, creating vast national level databases for advancing scientific research. Data is collected at the patient level in hospitals participating in AHA/ASA Quality programs. Patients entered in the database are from U.S. hospitals only. Data is patient- and hospital-de-identified at an aggregate level.

Varies per dataset Multiple
Yes
Apply for access EHR AHA Exists? Yes

View Documentation »
Breast Cancer MRI

Breast cancer MRI dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients.

Duke Health Single
Yes
Public/open EHR, Genomic, Medical Imaging, radiology report, pathology report NIH Exists? Yes

View Documentation »
ClinGen/ClinVar

ClinVar and ClinGen, two NIH-based efforts, have formed a critical partnership to improve our knowledge of clinically relevant genomic variation. This partnership includes significant efforts in data sharing, data archiving, and collaborative curation to characterize and disseminate the clinical relevance of genomic variation.

Collaborative (consortium) Multiple
No
Public/open Genomic NIH/NLM Exists? Yes

View Documentation »
EMBED

EMory BrEast imaging Dataset EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT) images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity classes. This release represents 20% of the total 2D and C-view dataset and is available for research use.

Emory University Single
No
Apply for access Medical Imaging NIH/NCATS Exists? Yes

View Documentation »
ENCODE

The Encyclopedia of DNA Elements (ENCODE) project is a publicly accessible database that aims to delineate all functional elements encoded in the human genome. A functional element is defined as a discrete genome segment that encodes a defined product (e.g., protein or non-coding RNA) or displays a reproducible biochemical signature (e.g., protein binding, or a specific chromatin structure).

Collaborative (consortium) Multiple
No
Public/open Genomic NIH/NIGRI Exists? Yes

View Documentation »
fastMRI; fastMRI+

Deidentified imaging dataset provided by NYU Langone comprises raw k-space data in several sub-dataset groups. NYU Langone is partnering with Facebook AI Research (FAIR) on fastMRI – a collaborative research project to investigate the use of AI to make MRI scans up to 10X faster.

fastMRI+ is the labeled and annotated version the fastMRI dataset: https://www.microsoft.com/en-us/research/publication/fastmri-clinical-p

NYU Langone Health Single
No
Clickthrough license Medical Imaging NIH Exists? Yes

View Documentation »
gnomAD

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

Collaborative (consortium) Multiple
No
Public/open Genomic NIH/NIGRI Exists? Yes

View Documentation »
Google Research

In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines. Search for "health" datasets to easily explore options.

Varies per dataset Multiple
No
Public/open EHR, Genomic, Medical Imaging, insurance, community, survey Google Exists? Yes

View Documentation »
Health Data Research UK (HDR UK)

Health Data Research UK is the only national institute for health data that includes England, Wales, Scotland and Northern Ireland. It was established to work with a wide range of health data from the NHS, universities, research institutes and charities, private companies and from wearable and mobile technologies. They unite health data assets across the UK to make health data research and innovation happen at scale.

Varies per dataset Multiple
Yes
Apply for access EHR, Genomic, Medical Imaging National Health Service (NHS) Exists? Yes

View Documentation »
International Skin Imaging Collaboration​

The International Skin Imaging Collaboration (ISIC) is an international effort to improve melanoma diagnosis. It contains 76,108 dermascopic images collected from various sources

Collaborative (data linkage and harmonization) Multiple
No
Public/open Medical Imaging IDS, Dermodcopedia, Centaur Labs, SiiM Exists? Yes

View Documentation »
Medical Imaging and Rescources Center (MIDRC)

MIDRC is a multi-institutional collaborative initiative driven by the medical imaging community that was initiated in late summer 2020 to help combat the global COVID-19 health emergency. MIDRC is an AI-ready research dataset, (standarized, aggregated, and curated for machine learning research). MIDRC is an expanding data commons with +150,000 imaging studies for 67,728 patients and funded by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) and hosted at the University of Chicago, is co-led by the American College of Radiology® (ACR®), the Radiological Society of North America (RSNA), and the American Association of Physicists in Medicine (AAPM).

Collaborative (data linkage and harmonization) Multiple
No
Clickthrough license Medical Imaging NIH/NIBIB Exists? Yes

View Documentation »
MIMIC-III

MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.

Beth Israel Deaconess Medical Center Single
No
Apply for access EHR NIH Exists? Yes

View Documentation »
National Covid Cohort Collaborative (N3C)

National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many clinical centers to reveal patterns in COVID-19 patients. N3C compiles and harmonizes longitudinal electronic health record data from 65 sites in the USA and over 8 million patients.

Collaborative (data linkage and harmonization) Multiple
Yes
Apply for access EHR NIH Exists? Yes

View Documentation »
Nightingale Open Science

Nightingale hosts massive new medical imaging datasets, curated around unsolved medical problems for which modern computational methods could be transformative. To do so, Nightingale works with health systems around the world to build datasets with two ingredients: large samples of medical images, linked to ground-truth patient outcomes. Deidentified versions of those datasets are then made available on a secure cloud platform to a diverse, global community of researchers.

Varies per dataset Multiple
No
Apply for access Medical Imaging NIH, foundations Exists? Yes

View Documentation »
PEDsnet

PEDSnet is a national network that integrates hospitals, healthcare institutions, researchers, clinicians, and the broader pediatric community. As a multi-disciplinary platform, its emphasis is on observational research and clinical trials within various children's hospital systems. Established in 2009, the network has a longitudinal data set that spans a wide range of pediatric diseases and specialties. The data indicates that more than 10,000 children are categorized under each of the 675 distinct diagnoses, with an additional 2,278 diagnoses impacting at least 1,000 children. The PEDSnet database is Electronic Health Records (EHR), rich including demographic information, diagnoses, lab results, procedures, medications, outpatient ED visits, inpatient visits, and payer plans.

Collaborative (data linkage and harmonization) Multiple
Yes
Apply for access EHR PCORI Exists? Yes

View Documentation »
RSNA AI Challenge Collection

The Radiological Society of North America (RSNA) has curated an extensive collection of medical imaging datasets. These are designed to address a myriad of diagnostic challenges and encompass a diverse range of imaging modalities, including X-ray, CT, MRI, and mammography. Each dataset is released annually as part of RSNA's commitment to fostering AI research and development in the medical imaging domain. Datasets are housed in Kaggle and dataset description can be found on the RSNA AI Challenge webpage.

Varies per dataset Multiple
No
Public/open Medical Imaging RSNA Exists? Yes

View Documentation »
Stanford AIMI

The Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI) manages a public imaging data repository. This collection features curated and annotated clinical imaging data spanning various modalities, such as echocardiograms, brain CT-scans, MRI, radiographs, and ultrasounds. The data originates from several institutions, including Stanford Health Care, Stanford Children’s Hospital, the University Healthcare Alliance, and Packard Children's Health Alliance clinics. All datasets in this repository are earmarked for research use and are facilitated by the Stanford Medicine Research Data Repository (STARR).

Stanford Medical Center Single
No
Clickthrough license Medical Imaging Stanford AIMI Center Exists? Yes

View Documentation »
The Cancer Imaging Archive (TCIA)

The TCIA provided NCI and the cancer research community with a researcher-focused supply of de-identified and highly curated radiology and histopathology imaging, targeting prioritized research needs and supporting major NIH research programs. Imaging collections include data related to the images such as patient outcomes, treatment details, genomics, pathology, and expert analyses that are also provided or linked to when available.

Varies per dataset Multiple
Yes
Public/open, Apply for access EHR, Genomic, Medical Imaging, Pathology NIH/NCI Exists? Yes

View Documentation »
UK Biobank

UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is regularly augmented with additional data and is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases.

Collaborative (data linkage and harmonization) Multiple
Yes
Apply for access EHR, Genomic, Medical Imaging, sensors, labs National Health Service (NHS) Exists? Yes

View Documentation »