Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, noninferiority, confirmatory study

Anindo Saha,
Joeran S Bosma,
Jasper J Twilt,
Bram van Ginneken,
Anders Bjartell,
Anwar R Padhani,
David Bonekamp,
Geert Villeirs,
Georg Salomon,
Gianluca Giannarini,
Jayashree Kalpathy-Cramer,
Jelle Barentsz,
Klaus H Maier-Hein,
Mirabela Rusu,
Olivier Rouvière,
Roderick van den Bergh,
Valeria Panebianco,
Jurgen J Fütterer,
Maarten de Rooij,
Henkjan Huisman,
on behalf of the PI-CAI consortium

Comments
Abstract

Publication: The LANCET Oncology, July 2024

Background

Artificial intelligence (AI) systems can potentially aid the diagnostic pathway of prostate cancer by alleviating the increasing workload, preventing overdiagnosis, and reducing the dependence on experienced radiologists. We aimed to investigate the performance of AI systems at detecting clinically significant prostate cancer on MRI in comparison with radiologists using the Prostate Imaging—Reporting and Data System version 2.1 (PI-RADS 2.1) and the standard of care in multidisciplinary routine practice at scale.

Methods

In this international, paired, non-inferiority, confirmatory study, we trained and externally validated an AI system (developed within an international consortium) for detecting Gleason grade group 2 or greater cancers using a retrospective cohort of 10 207 MRI examinations from 9129 patients. Of these examinations, 9207 cases from three centres (11 sites) based in the Netherlands were used for training and tuning, and 1000 cases from four centres (12 sites) based in the Netherlands and Norway were used for testing. In parallel, we facilitated a multireader, multicase observer study with 62 radiologists (45 centres in 20 countries; median 7 [IQR 5–10] years of experience in reading prostate MRI) using PI-RADS (2.1) on 400 paired MRI examinations from the testing cohort. Primary endpoints were the sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC) of the AI system in comparison with that of all readers using PI-RADS (2.1) and in comparison with that of the historical radiology readings made during multidisciplinary routine practice (ie, the standard of care with the aid of patient history and peer consultation). Histopathology and at least 3 years (median 5 [IQR 4–6] years) of follow-up were used to establish the reference standard. The statistical analysis plan was prespecified with a primary hypothesis of non-inferiority (considering a margin of 0·05) and a secondary hypothesis of superiority towards the AI system, if non-inferiority was confirmed. This study was registered at ClinicalTrials.gov, NCT05489341.

Findings

Of the 10 207 examinations included from Jan 1, 2012, through Dec 31, 2021, 2440 cases had histologically confirmed Gleason grade group 2 or greater prostate cancer. In the subset of 400 testing cases in which the AI system was compared with the radiologists participating in the reader study, the AI system showed a statistically superior and non-inferior AUROC of 0·91 (95% CI 0·87–0·94; p<0·0001), in comparison to the pool of 62 radiologists with an AUROC of 0·86 (0·83–0·89), with a lower boundary of the two-sided 95% Wald CI for the difference in AUROC of 0·02. At the mean PI-RADS 3 or greater operating point of all readers, the AI system detected 6·8% more cases with Gleason grade group 2 or greater cancers at the same specificity (57·7%, 95% CI 51·6–63·3), or 50·4% fewer false-positive results and 20·0% fewer cases with Gleason grade group 1 cancers at the same sensitivity (89·4%, 95% CI 85·3–92·9). In all 1000 testing cases where the AI system was compared with the radiology readings made during multidisciplinary practice, non-inferiority was not confirmed, as the AI system showed lower specificity (68·9% [95% CI 65·3–72·4] vs 69·0% [65·5–72·5]) at the same sensitivity (96·1%, 94·0–98·2) as the PI-RADS 3 or greater operating point. The lower boundary of the two-sided 95% Wald CI for the difference in specificity (−0·04) was greater than the non-inferiority margin (−0·05) and a p value below the significance threshold was reached (p<0·001).

Interpretation

An AI system was superior to radiologists using PI-RADS (2.1), on average, at detecting clinically significant prostate cancer and comparable to the standard of care. Such a system shows the potential to be a supportive tool within a primary diagnostic setting, with several associated benefits for patients and radiologists. Prospective validation is needed to test clinical applicability of this system.

Funding

Health~Holland and EU Horizon 2020.

Commentary by Assco. Prof. Pawel Rajwa

Artificial intelligence (AI) is rapidly expanding into various industries and fields of development, including medical research. In prostate cancer diagnostics, significant advancements have been made, particularly with the adoption of prostate magnetic resonance imaging (MRI). However, while standardised prostate MRI represents progress, it still faces challenges due to variability in interpretation, suboptimal MRI quality, the expertise required for consistent reporting, and the growing radiologists’ workload, which limits broad MRI accessibility in many parts of the world.

The PI-CAI study, titled "Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI)," published in Lancet Oncology, explores the use of AI in detecting clinically significant prostate cancer (Gleason grade group ≥2) through MRI scans. PI-CAI is the first international diagnostic accuracy study comparing AI system to radiologists’ performance using PI-RADS 2.1 and radiology readings made during multidisciplinary practice. This paired, non-inferiority study involved over 10,000 MRI scans (9,207 for training and tuning, 1,000 for testing) performed between 2012 and 2021. The performance of the AI system was compared, based on 400 cases, to that of 62 radiologists from 20 countries with a median of 7 years of experience in interpreting prostate MRI. Histopathology and long-term patient follow-ups were reference standards. The primary endpoints were diagnostic estimates and the statistical analysis plan was prespecified with a primary hypothesis of noninferiority.

The findings demonstrated that the AI system was not only non-inferior to the radiologists in detecting significant prostate cancers, but also statistically superior in some key areas. The AI system achieved an area under the receiver operating characteristic curve (AUROC) of 0.91, compared to 0.86 for the radiologists (p<0.001). Compared to the mean cut-off – PI-RADS≥3, the AI system detected 6.8% more cases of significant cancer and maintained the same specificity as the radiologists. Additionally, the AI system reduced false positives by 50.4% and detected 20% fewer Gleason grade group 1 cancers. However, when compared with radiologists' readings during routine multidisciplinary practice in the 1,000 testing cases, AI's specificity was slightly lower (68.9% vs. 69.0%) when sensitivity was matched (96.1%). The AI system had AUROC of 93%.

The PI-CAI study provides level 2b evidence that an AI system may reduce the number of unnecessary biopsies and overdiagnosis. Furthermore, at a prevalence of Gleason grade group 2 of 33%, the negative predictive value of 93.8% of AI supports further exploration of AI in prostate cancer screening. Given that the AI system was tested against experts from developed healthcare systems, there is a question of its potential impact in "real-world" settings or in regions with less experienced radiologists. These findings open up exciting possibilities for easing some of the current challenges in radiology and healthcare, ultimately improving diagnostics in prostate cancer. Further prospective studies are needed to confirm whether AI can be safely and effectively integrated into routine medical practice. Nevertheless, the PI-CAI study marks a promising advancement in the application of AI to prostate imaging.