Objective To evaluate the accuracy of three large language models (LLMs), ChatGPT, Grok, and DeepSeek, in predicting the natural outcome of pediatric ventricular septal defect (VSD) and their discrepancies with actual clinical outcomes, providing insights into whether LLMs can assist clinicians in providing personalized management recommendations. MethodsA retrospective analysis of clinical data from pediatric patients with VSD admitted to Children's Hospital of Nanjing Medical University between October and December 2020. The VSD severity, spontaneous closure probability and surgical necessity were evaluated by ChatGPT, Grok, DeepSeek, and the expert panel, respectively. Intergroup differences were analyzed and also compared with the actual outcomes. The stability of model performance was compared based on three repeated assessments by LLMs. Results A total of 146 children were enrolled, including 87 (59.6%) males and 59 (40.4%) females, with a median age at first diagnosis of 2.0 months (IQR: 1.1-3.4). Significant differences were observed between the Grok group and the expert panel in assessing the probability of spontaneous closure and the necessity of surgery (P=0.01, 0.02). The ChatGPT group also differed from the expert panel in evaluating the necessity of surgery (P=0.05). In comparison with the actual clinical outcomes, only the Grok group showed a significant difference (P<0.05), while ChatGPT achieved the highest consistency between predicted outcomes and actual outcomes. Intra-group analysis of three repeated assessments in the LLMs groups showed no statistically significant differences (all P>0.05). Conclusion LLMs demonstrate potential and high stability in predicting the natural outcome of VSD. In particular, ChatGPT shows the highest consistency between its assessments and actual outcomes. LLMs can serve as an auxiliary tool to support the formulation of personalized management strategy.