Skip to main content Skip to secondary navigation



We envision an AI landscape where motivated data users have unfettered access to data and the computational power to advance AI development for health. Driven by evidence-based research 1,2 and support from the Gordon and Betty Moore Foundation, we developed the AIMI Dataset Index to create a web-based resource with an inventory of AI-ready datasets for machine learning use. We hope that this will lower barriers to accessing high quality health data for the development of AI algorithms that can advance diagnostic excellence in healthcare.


The growth of digital healthcare data worldwide offers a chance to significantly cut healthcare expenses and improve health outcomes. A 2013 McKinsey report estimated that optimizing health data could reduce US healthcare spending by as much as $450 billion a year.3 However, trends of proprietary data and exclusivity inhibits equitable and transparent development of AI algorithms in healthcare.


There is increasing awareness that health AI models are often not useful, reliable, and fair, leading to significant research on algorithmic fairness.4 Available datasets often lack the diversity5,6 and detailed labeling necessary for effective machine learning. As a result, a mere fraction of data meets the standards of being "AI-ready" and contain the clinically relevant annotations to support generalized machine learning research.


We believe responsible reuse of anonymized clinical data for the “public good” is an ethical obligation.7 This vision has galvanized initiatives to share anonymized medical data for the benefit of open science and educational communities. As a center we recognize the value and impact of publicly sharing well curated, de-identified clinical data sets to advance AI development in medicine. We hope the AIMI Dataset Index further advances this vision.



  1. Youssef A, Ng MY, Long J, Hernandez-Boussard T, Shah N, Miner AS, Larson DB, Langlotz CP. Organizational factors influencing health data sharing for AI: A cross-sector qualitative study of organizational leaders. [Under review]
  2. Ng MY, Youssef A, Miner AS, Sarellano D, Long J, Larson DB, Hernandez-Boussard T, Langlotz CP. Framework for measuring the AI-readiness of health datasets based on a qualitative study of creators and researchers. [Under review]
  3. Kayyali B, Knott D, Kuiken SV. The Big-Data Revolution in US health care: Accelerating value and innovation [Internet]. McKinsey & Company; 2013 [cited 2023 Nov 7]. Available from: 
  4. Editors, Rubin E. Striving for diversity in research studies. N Engl J Med. 2021;385(15):1429-1430. DOI: 10.1056/NEJMe2114651
  5. Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical machine learning in healthcare. Annu Rev Biomed Data Sci. 2021;4:123–144.
  6. Kaushal A, Altman R, Langlotz C. Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms. JAMA. 2020;324(12):1212-1213. doi:10.1001/jama.2020.12067
  7. Larson DB, Magnus DC, Lungren MP, Shah NH, Langlotz CP. Ethics of using and sharing clinical imaging data for artificial intelligence: a proposed framework. Radiology. 2020;295:675–682.