Crowdsourcing of artificial intelligence algorithms for diagnosis and Gleason grading of prostate cancer in biopsies

Kartasalo K.,
Bulten W.,
Chen P-H.C.,
Ström P.,
Pinckaers H.,
Nagpal K.,
Ruusuvuori P.,
Litjens G.,
Eklund M.,
PANDA Challenge Consortium

Abstract

Introduction & Objectives

Gleason grading of biopsies is crucial for prostate cancer treatment decisions but considerable inter- and intraobserver variability can lead to under- and overtreatment. Moreover, there is a global shortage of pathologists. Artificial intelligence (AI) could potentially mitigate these challenges through partial automation and decision support. While AI has shown promise for diagnosing and grading prostate cancer, the results have typically not been validated across international cohorts, and robust performance on data from different sites remains a challenge. Competitions have shown to be an efficient way of crowdsourcing medical innovation. To accelerate the development of AI algorithms for Gleason grading, we organized the PANDA Challenge – the largest competition in histopathology to date.

Materials & Methods

We collected the largest public dataset of digitally scanned prostate biopsies to date, consisting of 10,616 specimens from 2,113 patients from Karolinska Institutet and Radboud University Medical Center. The data were provided to algorithm developers through the competition on the Kaggle data science platform (Apr 21 – Jul 23, 2020). Participants could submit algorithms online, and receive performance estimates on a set of 393 biopsies that they did not have direct access to. Performance was evaluated in terms of concordance (quadratically weighted kappa, QWK) with grading by panels of experienced uropathologists. Finally, the contributed solutions were evaluated on a test set of 545 biopsies. Top-ranking algorithms were further assessed on European (330 biopsies) and US (741 biopsies) external validation sets.

Results

In total, 1,290 developers from 65 countries contributed 1,010 algorithms. Top-performing algorithms analysed in detail showed a mean QWK of 0.931 (95% CI: 0.918-0.944) on the internal test set. On US and EU external validation sets the algorithms achieved QWK of 0.862 (95% CI: 0.840-0.884) and 0.868 (95% CI: 0.835-0.900).

Conclusions

We showed that AI algorithms developed as a community effort in a global competition reached pathologist-level performance in Gleason grading of prostate biopsies. The algorithms generalized to intercontinental external cohorts representing different patient populations, laboratories and reference standards, warranting evaluation of AI based Gleason grading in prospective clinical trials. Taken together, this study serves as an example of how important medical problems can be solved through the combination of AI, innovative study designs, and rigorous validation across diverse cohorts.
PANDA Challenge Consortium: Y Cai, DF Steiner, H van Boven, R Vink, C Hulsbergen-van de Kaa, J van der Laak, MB Amin, AJ Evans, T van der Kwast, R Allan, PA Humphrey, H Grönberg, H Samaratunga, B Delahunt, T Tsuzuki, T Häkkinen, L Egevad, M Demkin, S Dane, F Tan, M Valkonen, GS Corrado, L Peng, CH Mermel, PANDA Challenge participant teams