A feature fusion framework for lung cancer computer-aided diagnosis model: development and application based on heterogeneous data from health examination populations_West China Medical Journal

Authors：

LI Yulin ¹ , SONG Lijun ¹ , TANG Xiumei ^2,3 , ZHONG Jiandan ¹ , JI Guiyi ⁴ , LI Weimin ³ ,  GU Tao ^2,3

1. School of Communication Engineering, Chengdu University of Information Technology, Chengdu, Sichuan 610225, P. R. China;
2. Institute of Hospital Management, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
3. Institute of Respiratory Health and Multimorbidity, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
4. Health Management Center, General Practice Medical Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, P. R. China;

Corresponding?author：

GU Tao, Email: gutao@swufe.edu.cn

Keywords：

Lung cancer; clinical decision support model; feature fusion; health examination population; deep learning

DOI：

10.7507/1002-0179.202509119

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Objective To develop a computer-aided diagnosis model for lung cancer based on routine health examination data for identifying individuals with a current high risk of lung cancer in health screening settings, thereby providing decision support for subsequent clinical confirmation. Methods Individuals who underwent health examinations at the Health Management Center of West China Hospital, Sichuan University, between 2010 and 2022 were enrolled. After screening, a retrospective cohort of 5257 subjects was retained, comprising 1307 patients with lung cancer and 3950 non-lung cancer controls. A three-tier feature fusion model was designed: Heterogeneous feature encoding module: a multi-layer perceptron and bidirectional encoder representations from transformers (BERT) were employed to extract feature vectors from structured data and unstructured data (medical records and imaging report texts), respectively. Heterogeneous feature fusion architecture: dimensional expansion concatenation coupled with a gated recurrent unit based gating network was implemented to achieve multi-scale feature alignment and deep interaction, thereby addressing dimensional discrepancies and information redundancy. Attention-based decision mechanism: word-level attention with weighted pooling was applied to dynamically capture key features and generate risk probability distributions. Model performance was evaluated using precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC-ROC). Results The proposed model significantly outperformed both single-data-type models and simple concatenation approaches. On the test set, the proposed model achieved a recall of 0.861, an F1-score of 0.882, and an AUC-ROC of 0.972, substantially surpassing the best-performing model trained on structured data alone (extreme gradient boosting: recall=0.630, F1-score=0.725, AUC-ROC=0.916) and the model trained on unstructured data alone (BERT coupled with a bidirectional long short-term memory network: recall=0.833, F1-score=0.846, AUC-ROC=0.944). Feature elimination experiments demonstrated minimal performance variation across different feature subsets, confirming the model’s capability to effectively identify and mitigate the impact of irrelevant features. Subgroup analyses revealed that the model performed optimally in female subjects (recall=0.835, F1-score=0.838, AUC-ROC=0.950) and individuals aged >69 years (recall=0.913, F1-score=0.875, AUC-ROC=0.911). Conclusion The proposed model based on heterogeneous health examination data can identify high-risk individuals for lung cancer among health examination populations using only routine screening data, thereby facilitating the early diagnosis of lung cancer in this population.

Citation： LI Yulin, SONG Lijun, TANG Xiumei, ZHONG Jiandan, JI Guiyi, LI Weimin, GU Tao. A feature fusion framework for lung cancer computer-aided diagnosis model: development and application based on heterogeneous data from health examination populations. West China Medical Journal, 2026, 41(4): 554-561. doi: 10.7507/1002-0179.202509119 Copy

1.	Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 2021, 71(3): 209-249.
2.	Ferlay J, Ervik M, Lam F, et al. Global cancer observatory: cancer today. Lyon, France: International Agency for Research on Cancer, 2024.
3.	Wood DE, Kazerooni EA, Baum SL, et al. Lung cancer screening, version 3.2018, NCCN clinical practice guidelines in oncology. J Natl Compr Canc Netw, 2018, 16(4): 412-441.
4.	Zeng H, Zheng R, Guo Y, et al. Cancer survival in China, 2003-2005: a population-based study. Int J Cancer, 2015, 136(8): 1921-1930.
5.	Li N, Tan F, Chen W, et al. One-off low-dose CT for lung cancer screening in China: a multicentre, population-based, prospective cohort study. Lancet Respir Med, 2022, 10(4): 378-391.
6.	Lam DC, Liam CK, Andarini S, et al. Lung cancer screening in asia: an expert consensus report. J Thorac Oncol, 2023, 18(10): 1303-1322.
7.	Ten Haaf K, Jeon J, Tammem?gi MC, et al. Risk prediction models for selection of lung cancer screening candidates: a retrospective validation study. PLoS Med, 2017, 14(4): e1002277.
8.	Tammem?gi MC, Katki HA, Hocking WG, et al. Selection criteria for lung-cancer screening. N Engl J Med, 2013, 368(8): 728-736.
9.	Cassidy A, Myles JP, van Tongeren M, et al. The LLP risk model: an individual risk prediction model for lung cancer. Br J Cancer, 2008, 98(2): 270-276.
10.	Gould MK, Huang BZ, Tammemagi MC, et al. Machine learning for early lung cancer identification using routine clinical and laboratory data. Am J Respir Crit Care Med, 2021, 204(4): 445-453.
11.	Guan X, Du Y, Ma R, et al. Construction of the XGBoost model for early lung cancer prediction based on metabolic indices. BMC Med Inform Decis Mak, 2023, 23(1): 107.
12.	Ji G, Bao T, Li Z, et al. Current lung cancer screening guidelines may miss high-risk population: a real-world study. BMC Cancer, 2021, 21(1): 50.
13.	Ayad S, Jamimi HA, Kheir AE. Integrating advanced techniques: RFE-SVM feature engineering and nelder-mead optimized XGBoost for accurate lung cancer prediction. IEEE Access, 2025, 13: 29589-29600.
14.	Dritsas E, Trigka M. Lung cancer risk prediction with machine learning models. Big Data Cogn Comput, 2022, 6(4): 139.
15.	Yang Z, Mitra A, Liu W, et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat Commun, 2023, 14(1): 7857.
16.	Wang L, Yin Y, Glampson B, et al. Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data. EBioMedicine, 2024, 110: 105442.
17.	Chen PF, Chen L, Lin YK, et al. Predicting postoperative mortality with deep neural networks and natural language processing: model development and validation. JMIR Med Inform, 2022, 10(5): e38241.
18.	Yuan Q, Cai T, Hong C, et al. Performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer. JAMA Netw Open, 2021, 4(7): e2114723.
19.	Araki K, Matsumoto N, Togo K, et al. Developing artificial intelligence models for extracting oncologic outcomes from japanese electronic health records. Adv Ther, 2023, 40(3): 934-950.
20.	Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
21.	Chung J, Gulcehre C, Cho KH, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. Arxiv, 2014.

1. Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 2021, 71(3): 209-249.
2. Ferlay J, Ervik M, Lam F, et al. Global cancer observatory: cancer today. Lyon, France: International Agency for Research on Cancer, 2024.
3. Wood DE, Kazerooni EA, Baum SL, et al. Lung cancer screening, version 3.2018, NCCN clinical practice guidelines in oncology. J Natl Compr Canc Netw, 2018, 16(4): 412-441.
4. Zeng H, Zheng R, Guo Y, et al. Cancer survival in China, 2003-2005: a population-based study. Int J Cancer, 2015, 136(8): 1921-1930.
5. Li N, Tan F, Chen W, et al. One-off low-dose CT for lung cancer screening in China: a multicentre, population-based, prospective cohort study. Lancet Respir Med, 2022, 10(4): 378-391.
6. Lam DC, Liam CK, Andarini S, et al. Lung cancer screening in asia: an expert consensus report. J Thorac Oncol, 2023, 18(10): 1303-1322.
7. Ten Haaf K, Jeon J, Tammem?gi MC, et al. Risk prediction models for selection of lung cancer screening candidates: a retrospective validation study. PLoS Med, 2017, 14(4): e1002277.
8. Tammem?gi MC, Katki HA, Hocking WG, et al. Selection criteria for lung-cancer screening. N Engl J Med, 2013, 368(8): 728-736.
9. Cassidy A, Myles JP, van Tongeren M, et al. The LLP risk model: an individual risk prediction model for lung cancer. Br J Cancer, 2008, 98(2): 270-276.
10. Gould MK, Huang BZ, Tammemagi MC, et al. Machine learning for early lung cancer identification using routine clinical and laboratory data. Am J Respir Crit Care Med, 2021, 204(4): 445-453.
11. Guan X, Du Y, Ma R, et al. Construction of the XGBoost model for early lung cancer prediction based on metabolic indices. BMC Med Inform Decis Mak, 2023, 23(1): 107.
12. Ji G, Bao T, Li Z, et al. Current lung cancer screening guidelines may miss high-risk population: a real-world study. BMC Cancer, 2021, 21(1): 50.
13. Ayad S, Jamimi HA, Kheir AE. Integrating advanced techniques: RFE-SVM feature engineering and nelder-mead optimized XGBoost for accurate lung cancer prediction. IEEE Access, 2025, 13: 29589-29600.
14. Dritsas E, Trigka M. Lung cancer risk prediction with machine learning models. Big Data Cogn Comput, 2022, 6(4): 139.
15. Yang Z, Mitra A, Liu W, et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat Commun, 2023, 14(1): 7857.
16. Wang L, Yin Y, Glampson B, et al. Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data. EBioMedicine, 2024, 110: 105442.
17. Chen PF, Chen L, Lin YK, et al. Predicting postoperative mortality with deep neural networks and natural language processing: model development and validation. JMIR Med Inform, 2022, 10(5): e38241.
18. Yuan Q, Cai T, Hong C, et al. Performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer. JAMA Netw Open, 2021, 4(7): e2114723.
19. Araki K, Matsumoto N, Togo K, et al. Developing artificial intelligence models for extracting oncologic outcomes from japanese electronic health records. Adv Ther, 2023, 40(3): 934-950.
20. Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
21. Chung J, Gulcehre C, Cho KH, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. Arxiv, 2014.

West China Medical Journal

A feature fusion framework for lung cancer computer-aided diagnosis model: development and application based on heterogeneous data from health examination populations

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content