A recent study evaluating a commercial artificial intelligence (AI) algorithm for breast cancer screening has confirmed that AI significantly outperforms human experts in overall detection accuracy—while also identifying areas where further optimisation could enhance its clinical utility, particularly in pinpointing the exact location of cancerous lesions.
The research, which analysed 1,200 mammograms from the UK’s NHS Breast Cancer Screening Programme, compared AI diagnostic performance against that of 1,258 trained radiologists participating in the Personal Performance in Mammographic Screening (PERFORMS) program. Using both breast-level and lesion-level analyses, researchers aimed to determine whether AI’s strong performance at detecting malignancies translates to accurate pinpointing of lesions.
Key Findings
The study revealed that AI’s area under the curve (AUC)—a standard measure of diagnostic accuracy—was 0.942 at the breast level and slightly lower, at 0.929, at the lesion level. This drop, though statistically significant (p < 0.01), still placed AI ahead of human readers, whose AUCs averaged 0.878 at the breast level and 0.851 at the lesion level.
“While AI clearly shows promise in screening accuracy, its ability to correctly localise suspicious areas is not as strong,” researchers reported. “This reduced lesion-level performance could have important implications for clinical workflows and patient outcomes.”
Clinical Implications
This difference in performance is more than academic. If AI systems frequently highlight incorrect regions of interest, they risk leading radiologists toward unnecessary biopsies or reinforcing incorrect decisions—a phenomenon known as automation bias.
“Understanding where AI thinks the cancer is matters just as much as knowing that it thinks there is cancer,” the authors emphasised. “Lesion-level assessment is critical if we are to safely integrate AI into clinical practice and support, rather than undermine, human decision-making.”
Moving Forward
The study’s authors call for greater scrutiny of AI tools at the lesion level and suggest that external quality assurance programmes consider this more localised analysis when evaluating AI systems. Better alignment between AI and radiologists in lesion detection could enhance trust, reduce false positives, and ultimately improve screening outcomes.
As AI continues to gain traction in medical diagnostics, this study underscores the need for nuanced evaluations that go beyond top-line performance metrics—reminding us that accuracy isn’t only about if a case is flagged, but also why and where.