Construction and clinical application exploration of an artificial intelligence-based high-quality lung cancer surgery dataset_Chinese Journal of Clinical Thoracic and Cardiovascular Surgery

Authors：

HUANG Xuhua ¹ , NIE Yunfeng ² , SHEN Liang ³ , KONG Pengxu ¹ , TAN Xin ² , LI Zihao ¹ , LV Wang ¹ , ZHOU Min ³ ,  LV Xudong ² ,  HU Jian ^1,4

1. Department of Thoracic Surgery, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310003, P. R. China;
2. College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou, 310027, P. R. China;
3. Department of Information Technology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310003, P. R. China;
4. Key Laboratory of Clinical Evaluation Technology for Medical Device of Zhejiang Province, Hangzhou, 310003, P. R. China;

Corresponding?author：

LV Xudong, Email: dr_hujian@zju.edu.cn; HU Jian, Email: lvxd@zju.edu.cn

Keywords：

Lung cancer; artificial intelligence; surgery-oriented disease-specific index; openEHR standard; high-quality dataset; large language model; quality control

DOI：

10.7507/1007-4848.202512108

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Objective To construct a lung cancer surgery-oriented disease-specific database covering the entire perioperative care pathway, thereby improving the quality and usability of key surgical data elements. Methods Real-world clinical data were extracted from a single-center thoracic surgery department. A standardized data model was established based on the open electronic health record (openEHR) standard. Large language model (LLM), optical character recognition (OCR), and artificial intelligence (AI)-driven techniques were employed to extract, structure, and perform quality control on unstructured clinical narratives, imaging reports, and radiological data, with a focus on capturing surgically relevant perioperative indicator. Results A multimodal database comprising 19 917 patients was established, including 7 930 males and 11 987 females, with ages ranging from 15 to 97 (61.7±9.7) years. The database includes 582 structured data variables, textual report data corresponding to 69 clinical indicators, 13 000 pulmonary function test PDF reports, and chest CT imaging data from 16 884 patients. This database comprehensively covers major information relevant to surgical diagnosis and treatment of lung cancer, significantly improving the completeness and granularity of surgical detail data. Large language models (LLMs) and optical character recognition (OCR) technologies enhanced the efficiency of converting unstructured data into structured formats, while a multi-level manual verification process ensured data accuracy and traceability. The database supports real-world research including comparisons of surgical procedures, prediction of postoperative complications, prognosis assessment, and multimodal data association analyses.

1.	Han B, Zheng R, Zeng H, et al. Cancer incidence and mortality in China, 2022, J Natl Cancer Cent, 2024, 4(1): 47-53.
2.	中華醫學會腫瘤學分會. 中華醫學會肺癌臨床診療指南(2025版). 中華腫瘤雜志, 2025, 47(9): 769-810.Chinese Society of Oncology, Chinese Medical Association. Chinese Medical Association guideline for clinical diagnosis and treatment of lung cancer (2025 edition). Chin J Oncol, 2025, 47(9): 769-810.
3.	Dixon LK, Barber E, Cook A, et al. Sublobar resection or lobectomy for stage Ⅰa non-small cell lung cancer: a systematic review and meta-analysis. BMJ Open Respir Res, 2025, 12(1): e003234.
4.	Deceuninck A, Thiebaut PA, Bubenheim M, et al. Quality of lymph node dissection in lung cancer surgery: a comparative analysis of robotic-assisted versus video-assisted thoracic surgery using novel pathological criteria. Int J Med Robot, 2025, 21(5): e70112.
5.	Jeon H, Wang S, Song J, et al. Update 2025: management of non-small-cell lung cancer. Lung, 2025, 203(1): 53.
6.	Wang Z, Hu F, Chang R, et al. Development and validation of a prognostic model to predict overall survival for lung adenocarcinoma: a population-based study from the SEER database and the Chinese multicenter lung cancer database. Technol Cancer Res Treat, 2022, 21: 15330338221133222.
7.	楊麗冰, 郭超, 姜會珍, 等. 人工智能輔助肺癌數據庫構建. 中國胸心血管外科臨床雜志, 2025, 32(2): 167-174.Yang LB, Guo C, Jiang HZ, et al. Artificial intelligence-assisted construction of a lung cancer database. Chin J Clin Thorac Cardiovasc Surg, 2025, 32(2): 167-174.
8.	Peters S, Weder W, Dafni U, et al. Lungscape: resected non-small-cell lung cancer outcome by clinical and pathological parameters. J Thorac Oncol, 2014, 9(11): 1675-1684.
9.	Kapitan D, Heddema F, Dekker A, et al. Data interoperability in context: the importance of open-source implementations when choosing open standards. J Med Internet Res, 2025, 27: e66616.
10.	Maranh?o PA, Pereira AM, Calhau C, et al. Nutrition information in oncology-extending the electronic patient-record data set. J Med Syst, 2020, 44(11): 191.
11.	Min L, Atalag K, Tian Q, et al. Verifying the feasibility of implementing semantic interoperability in different countries based on the OpenEHR approach: comparative study of acute coronary syndrome registries. JMIR Med Inform, 2021, 9(10): e31288.
12.	Fisher A, Srinivasan K, Hillier S, et al. HEAL-Summ: a lightweight and ethical framework for accessible summarization of health information. Front Public Health, 2025, 13: 1619274.
13.	Wiest IC, Ferber D, Zhu J, et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit Med, 2024, 7(1): 257.
14.	Kokkotou E, Anagnostakis M, Evangelou G, et al. Real-world data and evidence in lung cancer: a review of recent developments. Cancers (Basel), 2024, 16(7): 1414.
15.	Kim YW, Jeon M, Song MJ, et al. Differences in detection patterns, characteristics, and outcomes of central and peripheral lung cancers in low-dose computed tomography screening. Transl Lung Cancer Res, 2021, 10(11): 4185-4199.
16.	Smolarz B, ?ukasiewicz H, Samulak D, et al. Lung cancer-epidemiology, pathogenesis, treatment and molecular aspect (review of literature). Int J Mol Sci, 2025, 26(5): 2049.
17.	Chen H, Kim AW, Hsin M, et al. The 2023 American Association for Thoracic Surgery (AATS) expert consensus document: management of subsolid lung nodules. J Thorac Cardiovasc Surg, 2024, 168(3): 631-647.
18.	Ye W, Fu W, Li C, et al. Diameter thresholds for pure ground-glass pulmonary nodules at low-dose CT screening: Chinese experience. Thorax, 2025, 80(2): 76-85.
19.	Kim BG, Nam H, Hwang I, et al. The growth of screening-detected pure ground-glass nodules following 10 years of stability. Chest, 2025, 167(4): 1232-1242.
20.	Alex GC, Engelhardt K, Rajaram R, et al. The Society of Thoracic Surgeons General Thoracic Surgery Database: 2025 annual update. Ann Thorac Surg, 2025: S0003-4975(25)01208-1. [Epub ahead of print].
21.	杭浩, 浦帥, 王忠捷, 等. 基于肺癌專用大語言模型構建標準化數據庫的效能評估. 上海醫學, 2025, 48(8): 512-517.Hang H, Pu S, Wang ZJ, et al. Performance evaluation of constructing a standardized database based on a lung cancer-specific large language model. Shanghai Med J, 2025, 48(8): 512-517.
22.	Patel AJ, Bille A. Lymph node dissection in lung cancer surgery. Front Surg, 2024, 11: 1389943.
23.	Jiang C, Zhang Y, Fu F, et al. A shift in paradigm: selective lymph node dissection for minimizing oversurgery in early stage lung cancer. J Thorac Oncol, 2024, 19(1): 25-35.
24.	Gabryel P, Skrzypczak P, Roszak M, et al. Influencing factors on the quality of lymph node dissection for stageⅠA non-small cell lung cancer: a retrospective nationwide cohort study. Cancers (Basel), 2024, 16(2): 346.
25.	Bai G, Chen X, Peng Y, et al. Surgery challenges and postoperative complications of lung cancer after neoadjuvant immunotherapy. Thorac Cancer, 2024, 15(14): 1138-1148.
26.	Admass BA, Ego BY, Tawye HY, et al. Post-operative pulmonary complications after thoracic and upper abdominal procedures at referral hospitals in Amhara region, Ethiopia: a multi-center study. Front Surg, 2023, 10: 1177647.
27.	Chandran U, Reps J, Yang R, et al. Machine learning and real-world data to predict lung cancer risk in routine care. Cancer Epidemiol Biomarkers Prev, 2023, 32(3): 337-343.
28.	Howard HR, Hasanova M, Tiwari A, et al. The landscape of conventional and artificial intelligence-based clinical prediction models in non-small-cell lung cancer: from development to real-world validation. ESMO Open, 2025, 10(9): 105557.
29.	王飛, 黃藝璠, 汪鵬. 基于多模態數據的肺癌專病庫建設研究. 中國數字醫學, 2021, 16(12): 85-88, 104.Wang F, Huang YF, Wang P. Construction of a lung cancer-specific database based on multimodal data. China Digit Med, 2021, 16(12): 85-88, 104.
30.	Delussu G, Frexia F, Mascia C, et al. A survey of openEHR clinical data repositories. Int J Med Inform, 2024, 191: 105591.

1. Han B, Zheng R, Zeng H, et al. Cancer incidence and mortality in China, 2022, J Natl Cancer Cent, 2024, 4(1): 47-53.
2. 中華醫學會腫瘤學分會. 中華醫學會肺癌臨床診療指南(2025版). 中華腫瘤雜志, 2025, 47(9): 769-810.Chinese Society of Oncology, Chinese Medical Association. Chinese Medical Association guideline for clinical diagnosis and treatment of lung cancer (2025 edition). Chin J Oncol, 2025, 47(9): 769-810.
3. Dixon LK, Barber E, Cook A, et al. Sublobar resection or lobectomy for stage Ⅰa non-small cell lung cancer: a systematic review and meta-analysis. BMJ Open Respir Res, 2025, 12(1): e003234.
4. Deceuninck A, Thiebaut PA, Bubenheim M, et al. Quality of lymph node dissection in lung cancer surgery: a comparative analysis of robotic-assisted versus video-assisted thoracic surgery using novel pathological criteria. Int J Med Robot, 2025, 21(5): e70112.
5. Jeon H, Wang S, Song J, et al. Update 2025: management of non-small-cell lung cancer. Lung, 2025, 203(1): 53.
6. Wang Z, Hu F, Chang R, et al. Development and validation of a prognostic model to predict overall survival for lung adenocarcinoma: a population-based study from the SEER database and the Chinese multicenter lung cancer database. Technol Cancer Res Treat, 2022, 21: 15330338221133222.
7. 楊麗冰, 郭超, 姜會珍, 等. 人工智能輔助肺癌數據庫構建. 中國胸心血管外科臨床雜志, 2025, 32(2): 167-174.Yang LB, Guo C, Jiang HZ, et al. Artificial intelligence-assisted construction of a lung cancer database. Chin J Clin Thorac Cardiovasc Surg, 2025, 32(2): 167-174.
8. Peters S, Weder W, Dafni U, et al. Lungscape: resected non-small-cell lung cancer outcome by clinical and pathological parameters. J Thorac Oncol, 2014, 9(11): 1675-1684.
9. Kapitan D, Heddema F, Dekker A, et al. Data interoperability in context: the importance of open-source implementations when choosing open standards. J Med Internet Res, 2025, 27: e66616.
10. Maranh?o PA, Pereira AM, Calhau C, et al. Nutrition information in oncology-extending the electronic patient-record data set. J Med Syst, 2020, 44(11): 191.
11. Min L, Atalag K, Tian Q, et al. Verifying the feasibility of implementing semantic interoperability in different countries based on the OpenEHR approach: comparative study of acute coronary syndrome registries. JMIR Med Inform, 2021, 9(10): e31288.
12. Fisher A, Srinivasan K, Hillier S, et al. HEAL-Summ: a lightweight and ethical framework for accessible summarization of health information. Front Public Health, 2025, 13: 1619274.
13. Wiest IC, Ferber D, Zhu J, et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit Med, 2024, 7(1): 257.
14. Kokkotou E, Anagnostakis M, Evangelou G, et al. Real-world data and evidence in lung cancer: a review of recent developments. Cancers (Basel), 2024, 16(7): 1414.
15. Kim YW, Jeon M, Song MJ, et al. Differences in detection patterns, characteristics, and outcomes of central and peripheral lung cancers in low-dose computed tomography screening. Transl Lung Cancer Res, 2021, 10(11): 4185-4199.
16. Smolarz B, ?ukasiewicz H, Samulak D, et al. Lung cancer-epidemiology, pathogenesis, treatment and molecular aspect (review of literature). Int J Mol Sci, 2025, 26(5): 2049.
17. Chen H, Kim AW, Hsin M, et al. The 2023 American Association for Thoracic Surgery (AATS) expert consensus document: management of subsolid lung nodules. J Thorac Cardiovasc Surg, 2024, 168(3): 631-647.
18. Ye W, Fu W, Li C, et al. Diameter thresholds for pure ground-glass pulmonary nodules at low-dose CT screening: Chinese experience. Thorax, 2025, 80(2): 76-85.
19. Kim BG, Nam H, Hwang I, et al. The growth of screening-detected pure ground-glass nodules following 10 years of stability. Chest, 2025, 167(4): 1232-1242.
20. Alex GC, Engelhardt K, Rajaram R, et al. The Society of Thoracic Surgeons General Thoracic Surgery Database: 2025 annual update. Ann Thorac Surg, 2025: S0003-4975(25)01208-1. [Epub ahead of print].
21. 杭浩, 浦帥, 王忠捷, 等. 基于肺癌專用大語言模型構建標準化數據庫的效能評估. 上海醫學, 2025, 48(8): 512-517.Hang H, Pu S, Wang ZJ, et al. Performance evaluation of constructing a standardized database based on a lung cancer-specific large language model. Shanghai Med J, 2025, 48(8): 512-517.
22. Patel AJ, Bille A. Lymph node dissection in lung cancer surgery. Front Surg, 2024, 11: 1389943.
23. Jiang C, Zhang Y, Fu F, et al. A shift in paradigm: selective lymph node dissection for minimizing oversurgery in early stage lung cancer. J Thorac Oncol, 2024, 19(1): 25-35.
24. Gabryel P, Skrzypczak P, Roszak M, et al. Influencing factors on the quality of lymph node dissection for stageⅠA non-small cell lung cancer: a retrospective nationwide cohort study. Cancers (Basel), 2024, 16(2): 346.
25. Bai G, Chen X, Peng Y, et al. Surgery challenges and postoperative complications of lung cancer after neoadjuvant immunotherapy. Thorac Cancer, 2024, 15(14): 1138-1148.
26. Admass BA, Ego BY, Tawye HY, et al. Post-operative pulmonary complications after thoracic and upper abdominal procedures at referral hospitals in Amhara region, Ethiopia: a multi-center study. Front Surg, 2023, 10: 1177647.
27. Chandran U, Reps J, Yang R, et al. Machine learning and real-world data to predict lung cancer risk in routine care. Cancer Epidemiol Biomarkers Prev, 2023, 32(3): 337-343.
28. Howard HR, Hasanova M, Tiwari A, et al. The landscape of conventional and artificial intelligence-based clinical prediction models in non-small-cell lung cancer: from development to real-world validation. ESMO Open, 2025, 10(9): 105557.
29. 王飛, 黃藝璠, 汪鵬. 基于多模態數據的肺癌專病庫建設研究. 中國數字醫學, 2021, 16(12): 85-88, 104.Wang F, Huang YF, Wang P. Construction of a lung cancer-specific database based on multimodal data. China Digit Med, 2021, 16(12): 85-88, 104.
30. Delussu G, Frexia F, Mascia C, et al. A survey of openEHR clinical data repositories. Int J Med Inform, 2024, 191: 105591.

Chinese Journal of Clinical Thoracic and Cardiovascular Surgery

Latest ArticlesConstruction and clinical application exploration of an artificial intelligence-based high-quality lung cancer surgery dataset

Abstract Full text Figures/Tables Video References Cited by

Format

Content