e-ISSN : 2149-8156
Turkish Journal of Thoracic and Cardiovascular Surgery     
Comment to the article: Machine-learning model for postoperative atrial fibrillation
Fatih Yiğit1, Taylan Adademir1, Kaan Kırali1
1Department of Cardiovascular Surgery, Koşuyolu High-Specialization Training and Research Hospital, İstanbul, Türkiye
DOI : 10.5606/tgkdc.dergisi.2025.28291

We read with great interest the study by Akbulut et al. on machine-learning (ML) prediction of postoperative atrial fibrillation (POAF) after isolated CABG.[1] Although the integration of artificial intelligence into peri-operative care is welcome, several methodological issues may overstate the model's reliability and clinical value.

Insufficient sample size and events-per-variable (EPV)
Although the initial model considered 91 candidate predictors, only ≈15 features appear to have been retained after Boruta-based selection. With 50 POAF events in 100 patients, the resulting EPV is approximately 3.3-well below the level required for model stability.

A model with 15 predictors and an anticipated outcome incidence of ~22% would require at least 550-700 participants to ensure reliable parameter estimation, prevent overfitting, and achieve a shrinkage factor ≥0.9.[2,3]

The current sample therefore falls far short of the minimum sample size required for internally valid model development under modern reporting standards.

Artificially balanced outcome prevalence
The cohort was sub-sampled to a 50/50 POAF-non-POAF split, creating an artificial prevalence of 0.50. All reported performance metrics in Table 3-such as sensitivity, specificity, precision, F1 score, accuracy, and Cohen's ?-reflect this engineered balance of 50% rather than the true clinical incidence of POAF, which is ≈22 %. Recent simulation work shows that such balancing inflates apparent accuracy and produces severe mis-calibration when the model is applied to real-world data.[4]

Limited test set and absence of external validation
Only 20 patients comprised the hold-out test set; misclassifying a single case alters accuracy by five percentage points. No geographic or temporal validation was reported, contrary to TRIPOD-AI reporting guidance.[2]

Choice of algorithms and interpretability
The core model utilizes a Probabilistic Data Association (PDA) classifier-originally designed for radar and sonar tracking, not for clinical binary classification. Moreover, no model explainability method (e.g., SHAP) was provided, despite transparent interpretation being essential to clinical applicability and trust.

Moreover, the confusion matrix in Figure 4 is inconsistent with the sensitivity, specificity and precision values reported in the text (e.g., TP=10, FP=1 yield sensitivity=1.00 and specificity ? 0.90, not vice versa), indicating a reporting error and raising doubt about reliability of the performance estimates.

In summary, while Akbulut et al.'s work represents an important step toward incorporating artificial intelligence into cardiac surgical care, several methodological limitations-particularly regarding sample size, outcome balancing, validation, and reporting standards-warrant cautious interpretation. Recognizing and addressing these limitations in future research will be essential to building robust, generalizable, and clinically trustworthy ML tools for peri-operative risk stratification.

Data Sharing Statement: The data that support the findings of this study are available from the corresponding author upon reasonable request.

Author Contributions: All authors contributed equally to this article.

Conflict of Interest: The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.

Funding: The authors received no financial support for the research and/or authorship of this article.