Methodological Challenges in Deep Learning-Based Detection of Intracranial Aneurysms: A Scoping Review
Article information
Abstract
Artificial intelligence (AI), particularly deep learning, has demonstrated high diagnostic performance in detecting intracranial aneurysms on computed tomography angiography (CTA) and magnetic resonance angiography (MRA). However, the clinical translation of these technologies remains limited due to methodological limitations and concerns about generalizability. This scoping review comprehensively evaluates 36 studies that applied deep learning to intracranial aneurysm detection on CTA or MRA, focusing on study design, validation strategies, reporting practices, and reference standards. Key findings include inconsistent handling of ruptured and previously treated aneurysms, underreporting of coexisting brain or vascular abnormalities, limited use of external validation, and an almost complete absence of prospective study designs. Only a minority of studies employed diagnostic cohorts that reflect real-world aneurysm prevalence, and few reported all essential performance metrics, such as patient-wise and lesion-wise sensitivity, specificity, and false positives per case. These limitations suggest that current studies remain at the stage of technical validation, with high risks of bias and limited clinical applicability. To facilitate real-world implementation, future research must adopt more rigorous designs, representative and diverse validation cohorts, standardized reporting practices, and greater attention to human-AI interaction.
INTRODUCTION
Intracranial aneurysms are focal dilations of cerebral arteries that affect approximately 3 to 7 percent of the general population [1,2]. Although the annual risk of rupture for an unruptured intracranial aneurysm is relatively low, averaging around 1 percent, the consequences can cause subarachnoid hemorrhage (SAH), a life-threatening hemorrhagic stroke with high morbidity and mortality rates [3]. Given these serious outcomes, early detection of intracranial aneurysms is crucial for guiding timely and appropriate intervention.
With the substantial advancement of artificial intelligence (AI) in recent years, particularly in deep learning, a growing body of research has demonstrated its potential applicability in the medical field. To date, the U.S. Food and Drug Administration has approved over 1,000 AI- and machine learning-enabled medical devices [4]. However, compelling evidence of the clinical benefits or widespread adoption of AI in routine clinical practice, beyond controlled research settings, remains limited [4]. This phenomenon has been described as the “AI chasm,” a term introduced by Keane and Topol [5] to highlight the disconnect between the strong performance of AI algorithms in research environments and their limited impact in real-world clinical applications. Several factors contribute to this gap, including limited generalizability, challenges associated with clinical integration, the inherent lack of explainability of AI algorithms, insufficient user knowledge, ethical/legal considerations, and cultural resistance to adopting new technologies [[6-8].
In the application of AI for the detection of intracranial aneurysms, concerns regarding bias and limited generalizability have been prominently addressed. A meta-analysis by Din et al. [9], which assessed the diagnostic performance of AI in detecting intracranial aneurysms on magnetic resonance angiography (MRA), computed tomography angiography (CTA), or digital subtraction angiography (DSA), reported a pooled sensitivity of 91.2% and an area under the receiver operating characteristic curve of 0.936. However, the authors noted that most of the included studies exhibited a high risk of bias and poor generalizability, as evaluated by the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool [10]. They further emphasized that this lack of generalizability remains a major obstacle to the widespread clinical adoption of AI algorithms for intracranial aneurysm detection. Similarly, another systematic review and meta-analysis published in 2023, which evaluated the diagnostic accuracy of deep learning-based algorithms for detecting intracranial aneurysms on CTA, reported a high pooled sensitivity of 0.87 for aneurysms larger than 3 mm. Nonetheless, this review also identified substantial concerns regarding risk of bias and methodological limitations across the included studies [11]. Consequently, users of these AI-based diagnostic tools—particularly clinicians—are increasingly interested not only in their diagnostic accuracy, but also in identifying the patient populations most likely to benefit from their use, the specific clinical settings in which the algorithms have been validated, the inherent limitations of each model, and, ultimately, their clinical utility. Moreover, there is growing interest in understanding how far these AI algorithms have progressed toward integration into routine clinical practice.
Although previous review articles have offered valuable insights into the methodology of existing studies through systematic evaluation using the evaluation tools such as QUADAS-2, the relatively broad criteria of these instruments—particularly in the patient selection domain—may hinder readers from fully appreciating the nuances of study design limitation and to identify the sources of bias and limited generalizability in research on AI-based detection of intracranial aneurysms. For example, in the systemic review by Din et al. [9], 95% (41 out of 43) of studies were rated as having high or unclear concerns regarding applicability in the patient selection domain. However, the specific aspects of study design that contributed to these ratings remain difficult to discern. Moreover, general evaluation tools may not adequately capture the distinct clinical considerations specifically relevant to intracranial aneurysm detection.
In this context, the aim of this scoping review is to provide a comprehensive overview of study methodologies in published research on deep learning-based detection of intracranial aneurysms on CTA or MRA, with a particular focus on study populations, validation strategies, reporting practices, and reference standards, rather than on their diagnostic performance. By identifying and synthesizing the methodological limitations of existing studies, this review may help shape future approaches to algorithm development and contribute to ongoing efforts to establish their clinical utility.
METHODS
A systematic literature search was conducted in January 2025 using the Embase, MEDLINE, and Web of Science databases. The search used the following query: (‘artificial intelligence’ OR ‘deep learning’ OR ‘computer assisted’ OR ‘computer aided’ OR ‘AI’ OR ‘automated detection’ OR ‘automatic detection’) AND (‘intracranial aneurysm’ OR ‘intracranial aneurysms' OR ‘cerebral aneurysm' OR ‘cerebral aneurysms'). Search results were screened by evaluating their titles and abstracts. The exclusion criteria were as follows: articles that were not original research; studies focused on rupture risk prediction, treatment outcome prediction, aneurysm segmentation, technological development or image quality, hemodynamics, or unrelated subjects (e.g., genomics, treatment simulation); studies comparing diagnostic performance with and without AI assistance; studies primarily based on DSA; studies on computer-aided diagnosis using conventional machine learning published before 2018; studies with inadequate reporting that compromised credibility; and publications not in English (Fig. 1).

Flowchart of articles screened and included in this review. AI, artificial intelligence; CAD, computer-aided diagnosis.
In the 36 studies that were finally selected, 14 key questions pertaining to study population selection, validation of diagnostic performance, reporting methodology, and the reference standard used were systematically investigated (Table 1).
RESULTS
The results of the methodological evaluation of the included studies, based on the 14 key questions, are summarized in Table 2 [12-47]. The scoring system was designed such that a higher total score indicates a lower risk of bias and concerns regarding generalizability. A detailed description of the scoring criteria is provided in Supplementary Material 1. Fig. 2 illustrates an upward trend in the mean scores of the included studies over their publication years. Items that could not be quantitatively assessed were excluded from the total score calculation.
Study Population Selection
Q1. Exclusion criteria based on aneurysm size
Among the 36 studies reviewed, 5 studies excluded intracranial aneurysms based on size criteria. Nakao et al. [12] excluded aneurysms smaller than 2 mm in diameter, while Stember et al. [13] and Heit et al. [28] excluded those measuring less than 3 mm. In contrast, 2 studies excluded large aneurysms: Joo et al. [21] excluded those exceeding 25 mm in diameter, and Terasaki et al. [24] excluded aneurysms larger than 15 mm. The remaining studies did not specify any exclusion criteria related to aneurysm size.
Q2. Exclusion criteria based on aneurysm location
No study explicitly listed specific aneurysm location as an exclusion criterion. However, in 2 studies, aneurysms located in the posterior circulation were not included in the test dataset [18,33].
Q3. Rupture status of aneurysms in study populations
The inclusion or exclusion of ruptured aneurysms in the testing set was evaluated in the included studies. The training set was not considered in this assessment. Among the 36 studies, 12 included ruptured aneurysms in their study populations. Of these, 3 studies specifically focused on detecting aneurysms in patients with aneurysmal SAH [20,22,24]. Among the remaining studies, 3 clearly stated in the title or study aim that only unruptured aneurysms were investigated [21,28,33]. Eleven studies indicated in the inclusion/exclusion criteria or discussion that ruptured aneurysms were not included. The remaining 10 studies did not clearly state whether ruptured aneurysms were included in the study population.
Q4. Exclusion of previously treated aneurysms
This assessment examined whether aneurysms previously treated with coil embolization or surgical clipping were excluded from the testing of the AI algorithms. In studies that included external validation, the evaluation was based on the population used in the external validation set. Twenty-one studies excluded previously treated aneurysms from their study populations. In 2 studies, patients with treated aneurysms were included, but the treated aneurysms themselves were not used in the testing process [14,19]. Only 2 studies explicitly stated that treated aneurysms were included in the study population for evaluating the diagnostic performance of their algorithms [17,39]. The remaining 11 studies did not specify whether treated aneurysms were included in the study population.
Q5. Consideration of coexisting brain parenchymal or vascular abnormalities
This assessment evaluated whether the studies addressed the presence of coexisting brain parenchymal or intracranial vascular abnormalities in their study populations. Sixteen studies explicitly reported the exclusion of specific abnormalities. The most frequently cited were vascular malformations such as arteriovenous malformations or fistulas (n=9), moyamoya disease (n=7), arterial occlusion or stenosis (n=5), and tumors (n=4). These exclusion criteria were not mutually exclusive; individual studies often excluded more than 1 type of abnormality. The remaining 20 studies did not clearly address the presence of such coexisting abnormalities.
Validation of Diagnostic Performance
Q6. Prospective or retrospective approach
Only 1 study utilized a prospective design. In that study, a portion of the external validation was conducted retrospectively. However, in 1 external dataset, the model was prospectively applied and evaluated in 1,562 real-world clinical CTA cases. All other studies were conducted retrospectively.
Q7. Implementation of external validation
In this assessment, an external validation dataset was defined as data obtained from institutions geographically distinct from those used for model training. Consequently, temporally independent data from the same institution—referred to as temporal validation—were not considered external validation in this process [48]. Studies that evaluated the diagnostic performance of a preexisting AI model without further model development were also regarded as having conducted external validation. However, even in such cases, if the test dataset was clearly derived from the same institution as the training dataset, it was not classified as external validation. Furthermore, even in studies described as ‘multicenter study,’ if the data used for validation did not originate from institutions entirely separate from those providing the training data, the validation was not deemed external.
According to the defined criteria, 20 studies performed external validation to evaluate the diagnostic performance of the AI model. Among these, 11 assessed the performance of preexisting AI models or commercially available software [25,26,28,30,33,36,38,40,43,46,47]. In contrast, 16 studies did not conduct external validation. Of these, 2 studies utilized a preexisting AI model but appeared to have tested it on data acquired from the same institution as the training set [22,44]. One study employed 5-fold cross-validation without incorporating an independent test set [15].
Q8. Size of the test or external validation set
The size of the independent test set, or the external validation set if applicable, was evaluated and categorized based on whether it included more than 100 cases. Twenty-six studies included more than 100 cases in the test set or, when external validation was performed, in both the test set and the external validation set. Among the studies that conducted external validation, 2 included more than 100 cases in only 1 of the 2 datasets. The study by Ham et al. [29] included 15 cases in the internal test set but used 113 cases for external validation, obtained from the open-source Aneurysm Detection and segMentation (ADAM) challenge database. In contrast, the study by Ueda et al. [14] included 521 cases in the internal test set but only 67 cases in the external validation set. In 7 studies, the number of cases in the test set or in both the test set and the external validation set was fewer than 100, with 1 study including only 10 cases in the test set [31]. As noted above, 1 study did not employ an independent test set or external validation set; therefore, the current assessment was not applicable [15].
Q9. Use of multiple vendors or scanners in external validation sets
This analysis was restricted to studies that conducted external validation. Although external validation enhances the assessment of a model’s generalizability, its value may be diminished if the scanners used are the same as those in the training set. Therefore, this assessment evaluated whether the scanners used in the external validation set were different from those used during training, or whether at least 3 different scanners were employed. This criterion was used to assess whether the AI model could reasonably claim generalizability.
Among the 20 studies that conducted external validation, 14 employed sufficiently different scanners in their external validation process, while this information was unclear in 6 studies. Of these 6, 5 utilized a preexisting AI model for which the scanners used during model development were not specified. In the remaining study, the external validation dataset was derived from the ADAM challenge database, and details regarding the scanners used were incomplete [29].
Q10. Representation of real-world aneurysm prevalence in external validation cohorts
Evaluating diagnostic performance using a diagnostic cohort that reflects the true prevalence of intracranial aneurysms, rather than relying on datasets with a predetermined number of positive and negative cases, is generally considered a more reliable approach and may provide a closer approximation of real-world performance [49,50].
This assessment also focused exclusively on studies that conducted external validation. A dataset was considered a diagnostic cohort only when the study population was recruited consecutively or randomly, without prior knowledge of aneurysm prevalence. Based on this definition, 7 studies were identified as using a diagnostic cohort in their external validation. Twelve studies did not meet this criterion. In 1 study, the recruitment method for the dataset was not clearly described; however, it was likely non-consecutive or non-random, as 179 out of 212 patients were diagnosed with aneurysms [25].
Reporting Methodology
Questions 11 to 13 were applied only to the performance of the AI model when used in a standalone setting. If a study employed a paired-design to compare user performance between AI-assisted and AI-unassisted interpretations, the metrics from that comparison were not included in this assessment.
Q11. Sensitivity metrics: patient-wise and lesion-wise
Of the 36 included studies, 11 reported both patient-wise and lesion-wise sensitivity, while the remaining 25 studies reported only 1 of the 2.
Q12. Patient-wise specificity
Fifteen studies reported patient-wise specificity. In contrast, 21 studies did not report this metric, primarily because they included only aneurysm-positive examinations, making the calculation of specificity impossible. Of these, 3 studies reported specificity based on vessel-level analysis rather than patient-wise analysis [13,28,33].
Q13. Reporting of false positives per case
This assessment evaluated whether the authors reported the number of false positives per case or presented a free-response receiver operating characteristic curve. Twenty-four studies explicitly reported this information. In 7 studies, the metric was not directly stated but could be inferred from the reported number of false positive detections and the total number of cases. In the remaining 5 studies, the information was not available.
Reference Standards
Q14. The reference standard used
The reference standard used across the included studies was categorized as either DSA or radiologist consensus. If a study primarily relied on radiologist consensus but incorporated DSA findings when available, it was classified as using consensus.
Six studies used DSA as the reference standard, including only cases with available DSA results. In 5 studies, both DSA and radiologist consensus were used based on the dataset. The remaining 25 studies primarily relied on radiologist consensus as their reference standard.
DISCUSSION
This scoping review provides a comprehensive overview of study methodologies in published research on deep learning–based detection of intracranial aneurysms using CTA or MRA. Rather than focusing on their diagnostic performance, particular attention was given to study population selection, validation strategies, reporting practices, and reference standards. This approach aims to clarify the current state of the field and highlight methodological factors that may impede the demonstration of clinical utility.
Easing exclusion criteria to reduce selection bias and better reflect real-world clinical variability may pose substantial practical challenges, including the need for larger and more heterogeneous datasets, increased complexity in model development, and potential declines in model performance. These difficulties may have led many studies to favor more restricted datasets, even at the expense of clinical generalizability. Although explicit exclusion of aneurysms based on size or location was relatively uncommon (Q1 and Q2) among the included studies, more than half of the studies either excluded ruptured or previously treated aneurysms, or failed to clearly specify how these cases were handled (Q3 and Q4). Notably, only 2 studies explicitly reported the inclusion of treated aneurysms. Given that real-world clinical scenarios frequently involve ruptured or previously treated aneurysms, such exclusions or ambiguities may limit the generalizability and clinical applicability of these AI models. Additionally, some studies focused specifically on aneurysm detection in the context of aneurysmal SAH, where at least 1 ruptured aneurysm per case can be presumed. Taken together, the inclusion/exclusion criteria across studies vary considerably—with some excluding ruptured aneurysms, others including only ruptured cases, and many not specifying rupture status—highlighting the need for caution when interpreting and comparing findings across the literature.
Another important concern was the limited consideration of concurrent findings (Q5). Sixteen studies explicitly excluded specific coexisting abnormalities, while 20 did not clearly address this issue. It is possible that studies which did not list coexisting findings as exclusion criteria did so intentionally to include all such cases. For example, 1 study did not mention coexisting abnormalities in the exclusion criteria but later reported in the results that some false-positive findings of the model were due to misclassification of arteriovenous malformations or fistulas as aneurysms [25]. Even so, the lack of explicit reporting makes it difficult to assess the robustness of AI models in the presence of other brain or cerebrovascular abnormalities. Providing clearer information about the inclusion of such cases would improve clinical relevance of these studies.
Robust clinical verification of the performance of a diagnostic AI model requires external validation using a clinical cohort that accurately reflects the characteristics of the target patient population [50]. However, only 20 out of 36 studies reported conducting geographically external validation (Q7). This number further decreases to 9 when excluding the 11 studies that evaluated preexisting algorithms or commercially available software, indicating that only a minority of studies validated their own AI models using independently recruited external datasets. Even among studies that performed external validation, the generalizability of the models remains a concern due to limitations in dataset size and scanner diversity (Q8 and Q9). Of the 20 studies, only 10 used external datasets comprising more than 100 cases with sufficient variability in scanner types. Taken together, there remains a limited body of well-designed research that can robustly support the generalizability of AI models for intracranial aneurysm detection.
The lack of prospective studies has long been recognized as a major limitation in the development and clinical implementation of AI models for intracranial aneurysm detection [9]. Among the 36 articles included in this review, only 1 study validated its model using a prospective cohort (Q6). Given the methodological challenges of conducting prospective studies, the use of retrospective diagnostic cohorts that approximate real-world clinical settings may serve as a practical alternative for model evaluation, especially when cases are not selected through convenience sampling. Convenience sampling, which selects disease-positive and disease-negative cases separately rather than consecutively, can distort the spectrum of the diseased and non-diseased states in the dataset and introduce spectrum bias, limiting the model’s applicability to real-world patient populations [50]. However, only 7 studies employed such cohorts for model testing, suggesting that this approach remains underutilized (Q10).
For AI models designed to detect intracranial aneurysms, it is essential to evaluate diagnostic performance from multiple perspectives to ensure both technical accuracy and clinical relevance. To this end, the following 4 evaluation metrics should be reported in combination: patient-wise sensitivity, lesion-wise sensitivity, patient-wise specificity, and the number of false positives per case. Each metric captures a distinct yet complementary aspect of model performance. Lesion-wise sensitivity measures how accurately the model detects individual aneurysms, which is particularly important in patients with multiple lesions. However, this metric alone does not indicate whether the model detects any aneurysm in a given patient—an aspect more directly reflected by patient-wise sensitivity, which aligns closely with clinical decision-making. Conversely, patient-wise sensitivity may overestimate overall performance by ignoring missed lesions when at least 1 is identified, underscoring the need for lesion-wise evaluation to ensure thorough detection. Patient-wise specificity indicates how reliably the model identifies aneurysm-negative patients, helping to reduce unnecessary follow-ups and false-positive burdens—issues not fully captured by vessel- or segment-level specificity. The number of false positives per case is also critical, as excessive false alarms can hinder workflow and reduce diagnostic confidence. Just as lesion-wise sensitivity and patient-wise sensitivity offer complementary perspectives on sensitivity, false positives per case and patient-wise specificity together provide a more complete understanding of specificity [11,51]. Despite their importance, only 4 of the 36 included studies explicitly reported all 4 metrics (Q11 to 13). Notably, 21 studies did not report patient-wise specificity; among these, 16 included only aneurysm-positive cases, making it impossible to calculate this metric. The definitions of the metrics also varied across studies. For example, some reported vessel-wise specificity, instead of patient-wise specificity, by assessing aneurysms at the level of individual vascular segments. The definition of false positives per case also varied. Some studies counted false positives on a per-patient basis, while others used lesion-level counts as the numerator. The denominator likewise differed, with some studies including all examinations regardless of aneurysm status, and others including only aneurysm-negative cases. These inconsistencies underscore the need for standardized reporting practices to ensure accurate interpretation and meaningful comparison across studies.
DSA is widely regarded as the gold standard imaging modality for diagnosing intracranial aneurysms. Previous reviews have noted that many studies used radiologist consensus rather than DSA as the reference standard, consistent with our findings that only 6 studies employed DSA as the reference standard (Q14). However, due to the risk of procedure-related morbidity, DSA is typically reserved for cases where the potential benefit justifies the risk, particularly for treatment planning or when noninvasive imaging results are inconclusive or inconsistent [52]. As a result, restricting the study population to patients who underwent DSA can be an additional source of spectrum bias, as it may not reflect the broader clinical population where the AI model was intended to be applied. Therefore, it is important to recognize that the choice between DSA and radiologist consensus as the reference standard entails a trade-off between diagnostic certainty and the representativeness of the target population. This decision should be guided by the specific clinical scenario and the target population in which the AI model is intended to be applied. Selecting different reference standards for different purposes can be a practical approach, as demonstrated in a previous study where DSA was used for evaluating diagnostic accuracy, while radiologist consensus was used for assessing generalizability in external validation [42].
One important yet often overlooked issue is the diagnostic gray zone in identifying intracranial aneurysms. Even with the gold standard DSA, uncertainty may remain as to whether a bulging contour represents a true aneurysm, an infundibulum, or a tortuous vessel, as the determination ultimately relies on visual interpretation of a complex 3-dimensional structure. This challenge is further exacerbated in studies relying on convenience sampling, which often yield a study population where aneurysm cases are distinctly abnormal and control cases are distinctly normal, thereby limiting the representativeness of the target population encountered in real-world clinical practice. Addressing these gray areas warrants careful consideration, and the application of uncertainty quantification may offer a practical approach that could enhance both research and clinical use of AI in this field [53,54]. In addition, reporting inter- and intrarater variability of features annotated for the reference standard may be considered to address this issue, as recommended in the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) guideline [55].
In general, the integration of AI tools into clinical practice necessitates a stepwise evaluation process, beginning with the validation of technical performance, followed by the assessment of clinical performance [50,56]. Typically, technical performance is assessed in case-control studies, while clinical performance uses diagnostic cohort designs [54]. Reflecting this progression, the studies reviewed showed a trend toward improved methodology over time (Fig. 2). Earlier studies focused on technical performance using only aneurysm-positive cases or convenience samples, whereas recent studies more often adopted cohort designs and external validation. Notably, only 1 study employed a prospective design, highlighting that clinical validation of AI models for intracranial aneurysm detection remains in its early stages.
Beyond technical and clinical performance, human-AI interaction is an important factor to consider for the successful integration of AI into clinical practice. This interaction involves several dimensions, including interface design, explainability, trust, fairness, adaptability, and accountability [4]. Inadequate attention to these aspects has been associated with increased cognitive burden and digital fatigue among users [4,57]. However, formal conceptual frameworks for evaluating the quality of human-AI interaction are still lacking [58]. In the specific context of AI models for intracranial aneurysm detection, comparing diagnostic performance with and without AI assistance could serve as a proxy for assessing human-AI interaction; however, this was not within the scope of the present review. Moreover, the included studies provided insufficient information to enable a meaningful evaluation of human-AI interaction. Nevertheless, medical algorithmic auditing may offer a useful approach to improving human-AI interaction by systematically identifying and characterizing algorithmic errors, as demonstrated in a recent study [46]. Such efforts can enhance our understanding of the limitations and failure modes of AI models and offer insights to guide the development and validation of future AI systems in this context [59].
This review has several limitations. First, it excluded AI models designed for rupture risk prediction, aneurysm segmentation, or treatment outcome prediction. Second, it focused specifically on studies utilizing CTA and MRA, while excluding those based on DSA. As DSA is both invasive and considered the gold standard for diagnosing intracranial aneurysms, its clinical application differs substantially from that of CTA or MRA. Therefore, in the context of evaluating the potential utility of AI models for identifying patients who may be appropriate candidates for DSA, this review was limited to studies using CTA and MRA. In addition, some studies did not provide sufficient information regarding the imaging scanners used in their datasets. As a result, the assessment of scanner or vendor diversity may be subject to some limitations. Another limitation of this review is that all included studies were evaluated using a single methodological standard, irrespective of their position within the AI development and implementation framework [50,56]. This uniform approach may not fully reflect the contextual and methodological differences among studies and may overlook nuances in study design and intended application stage. Nevertheless, the objective was not to critique individual studies, but to provide an overview of current trends in the field.
In conclusion, this scoping review provides a comprehensive methodological overview of deep learning–based AI studies for intracranial aneurysm detection using CTA and MRA. Rather than emphasizing diagnostic performance, the review focused on methodological factors—such as population selection, validation strategies, reporting practices, and reference standards—within the context of bias and generalizability concerns. Common limitations included inconsistent handling of ruptured or treated aneurysms, incomplete reporting of coexisting pathologies, limited use of external validation, non-random or non-consecutive sampling, underreporting of key performance metrics, insufficient scanner diversity, and a near-absence of prospective validation. These findings indicate that most studies remain at the stage of technical performance evaluation, with a high risk of bias and poor generalizability, reflecting limited progress toward clinical performance assessment. To support real-world implementation, future research will require more rigorous study designs, representative validation cohorts, standardized reporting, and greater attention to human-AI interaction.
SUPPLEMENTARY MATERIALS
Supplementary material related to this article can be found online at https://doi.org/10.5469/neuroint.2025.00283.
Detailed scoring criteria for methodological assessment
Notes
Fund
None.
Ethics Statement
This article was exempted from the review by the institutional ethics committee. This article does not include any information that may identify the person.
Conflicts of Interest
The author has no conflicts to disclose.
Author Contributions
Concept and design: BJ. Analysis and interpretation: BJ. Data collection: BJ. Writing the article: BJ. Critical revision of the article: BJ. Final approval of the article: BJ. Overall responsibility: BJ.