Insufficient sample size and
events-per-variable (EPV)
Although the initial model considered 91
candidate predictors, only ≈15 features appear to
have been retained after Boruta-based selection.
With 50 POAF events in 100 patients, the resulting
EPV is approximately 3.3-well below the level
required for model stability.
A model with 15 predictors and an anticipated outcome incidence of ~22% would require at least 550-700 participants to ensure reliable parameter estimation, prevent overfitting, and achieve a shrinkage factor ≥0.9.[2,3]
The current sample therefore falls far short of the minimum sample size required for internally valid model development under modern reporting standards.
Artificially balanced outcome prevalence
The cohort was sub-sampled to a 50/50
POAF-non-POAF split, creating an artificial
prevalence of 0.50. All reported performance
metrics in Table 3-such as sensitivity, specificity, precision, F1 score, accuracy, and Cohen's ?-reflect
this engineered balance of 50% rather than the
true clinical incidence of POAF, which is ≈22 %.
Recent simulation work shows that such balancing
inflates apparent accuracy and produces severe
mis-calibration when the model is applied to
real-world data.[4]
Limited test set and absence of external
validation
Only 20 patients comprised the hold-out test
set; misclassifying a single case alters accuracy by
five percentage points. No geographic or temporal
validation was reported, contrary to TRIPOD-AI
reporting guidance.[2]
Choice of algorithms and interpretability
The core model utilizes a Probabilistic Data
Association (PDA) classifier-originally designed
for radar and sonar tracking, not for clinical binary
classification. Moreover, no model explainability
method (e.g., SHAP) was provided, despite
transparent interpretation being essential to clinical
applicability and trust.
Moreover, the confusion matrix in Figure 4 is inconsistent with the sensitivity, specificity and precision values reported in the text (e.g., TP=10, FP=1 yield sensitivity=1.00 and specificity ? 0.90, not vice versa), indicating a reporting error and raising doubt about reliability of the performance estimates.
In summary, while Akbulut et al.'s work represents an important step toward incorporating artificial intelligence into cardiac surgical care, several methodological limitations-particularly regarding sample size, outcome balancing, validation, and reporting standards-warrant cautious interpretation. Recognizing and addressing these limitations in future research will be essential to building robust, generalizable, and clinically trustworthy ML tools for peri-operative risk stratification.
Data Sharing Statement: The data that support the findings of this study are available from the corresponding author upon reasonable request.
Author Contributions: All authors contributed equally to this article.
Conflict of Interest: The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.
Funding: The authors received no financial support for the research and/or authorship of this article.
1) Akbulut B, Çakır M, Sarıkaya MG, Oral O, Yılmaz M, Aykal
G. Artificial intelligence to predict biomarkers for new-onset
atrial fibrillation after coronary artery bypass grafting. Turk
Gogus Kalp Dama 2025;33:144-53. doi: 10.5606/tgkdc.
dergisi.2025.27304.
2) Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL,
Van Calster B, et al. TRIPOD+AI statement: updated guidance
for reporting clinical prediction models that use regression or
machine learning methods. BMJ 2024;385:e078378. doi:10.1136/bmj-2023-078378.