Unsupervised deep learning for identifying the O<sup>6</sup>-carboxymethyl guanine by nanopore sequencing_Journal of Biomedical Engineering

Authors：

GUAN Xiaoyu ¹ , WANG Yu ^2,3 , ZHANG Jinyue ^2,3 , SHAO Wei ¹ ,  HUANG Shuo ^2,3 ,  ZHANG Daoqiang ¹

1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, P. R. China;
2. State Key Laboratory of Analytical Chemistry for Life Sciences, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, P. R. China;
3. Chemistry and Biomedicine Innovation Center, Nanjing University, Nanjing 210023, P. R. China;

Corresponding?author：

HUANG Shuo, Email: shuo.huang@nju.edu.cn; ZHANG Daoqiang, Email: dqzhang@nuaa.edu.cn

Keywords：

Carboxymethyl guanine; Nanopore sequencing; DNA lesion; Gastrointestinal cancer; Deep Learning; Unsupervised learning

DOI：

10.7507/1001-5515.202104068

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

O⁶-carboxymethyl guanine(O⁶-CMG) is a highly mutagenic alkylation product of DNA that causes gastrointestinal cancer in organisms. Existing studies used mutant Mycobacterium smegmatis porin A (MspA) nanopore assisted by Phi29 DNA polymerase to localize it. Recently, machine learning technology has been widely used in the analysis of nanopore sequencing data. But the machine learning always need a large number of data labels that have brought extra work burden to researchers, which greatly affects its practicability. Accordingly, this paper proposes a nano-Unsupervised-Deep-Learning method (nano-UDL) based on an unsupervised clustering algorithm to identify methylation events in nanopore data automatically. Specially, nano-UDL first uses the deep AutoEncoder to extract features from the nanopore dataset and then applies the MeanShift clustering algorithm to classify data. Besides, nano-UDL can extract the optimal features for clustering by joint optimizing the clustering loss and reconstruction loss. Experimental results demonstrate that nano-UDL has relatively accurate recognition accuracy on the O⁶-CMG dataset and can accurately identify all sequence segments containing O⁶-CMG. In order to further verify the robustness of nano-UDL, hyperparameter sensitivity verification and ablation experiments were carried out in this paper. Using machine learning to analyze nanopore data can effectively reduce the additional cost of manual data analysis, which is significant for many biological studies, including genome sequencing.

Citation： GUAN Xiaoyu, WANG Yu, ZHANG Jinyue, SHAO Wei, HUANG Shuo, ZHANG Daoqiang. Unsupervised deep learning for identifying the O⁶-carboxymethyl guanine by nanopore sequencing. Journal of Biomedical Engineering, 2022, 39(1): 139-148. doi: 10.7507/1001-5515.202104068 Copy

1.	Kasianowicz J J, Brandin E, Branton D, et al. Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci U S A, 1996, 93(24): 13770-13773.
2.	Venkatesan B M, Bashir R. Nanopore sensors for nucleic acid analysis. Nat Nanotechnol, 2011, 6(10): 615-624.
3.	Ying Y L, Cao C, Long Y T. Single molecule analysis by biological nanopore sensors. Analyst, 2014, 139(16): 3826-3835.
4.	Wang Y, Patil K M, Yan S, et al. Nanopore sequencing accurately identifies the mutagenic DNA lesion O⁶—carboxymethyl guanine and reveals its behavior in replication. Angewandte Chemie International Edition, 2019, 58(25): 8432-8436.
5.	Henley R Y, Ashcroft B, Farrell I, et al. Electrophoretic deformation of individual transfer RNA molecules reveals their identity. Nano Lett, 2016, 16(1): 138-144.
6.	Smith A M, Abu-Shumays R, Akeson M, et al. Capture, unfolding, and detection of individual tRNA molecules using a nanopore device. Front Bioeng Biotechnol, 2015, 3: 91.
7.	Zhang X, Xu X, Yang Z, et al. Mimicking ribosomal unfolding of RNA pseudoknot in a protein channel. J Am Chem Soc, 2015, 137(50): 15742-15752.
8.	Zhang X, Zhang D, Zhao C, et al. Nanopore electric snapshots of an RNA tertiary folding pathway. Nat Commun, 2017, 8(1): 1458.
9.	Krause M, Niazi A M, Labun K, et al. Tailfindr: alignment-free poly(A) length measurement for Oxford nanopore RNA and DNA sequencing. RNA, 2019, 25(10): 1229-1241.
10.	Simpson J T, Workman R E, Zuzarte P C, et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods, 2017, 14(4): 407-410.
11.	Carral A D, Sarap C S, Liu K, et al. 2D MoS2 nanopores: ionic current blockade height for clustering DNA events. 2D Mater, 2019, 6(4): 045011.
12.	Farshad M, Rasaiah J C. Molecular dynamics simulation study of transverse and longitudinal ionic currents in solid-state nanopore DNA sequencing. ACS Appl Nano Mater, 2020, 3(2): 1438-1447.
13.	Jia Shen, Luo Haochen, Gao Qiheng, et al. Detection of m6A RNA methylation in nanopore sequencing data using support vector machine//2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Suzhou: IEEE, 2019: 1-5.
14.	Schreiber J, Karplus K. Analysis of nanopore data using hidden Markov models. Bioinformatics, 2015, 31(12): 1897-1903.
15.	Liu H, Begik O, Lucas M C, et al. Accurate detection of m(6)A RNA modifications in native RNA sequences. Nat Commun, 2019, 10(1): 4079.
16.	Ni P, Huang N, Zhang Z, et al. DeepSignal: detecting DNA methylation state from nanopore sequencing reads using deep-learning. Bioinformatics, 2019, 35(22): 4586-4595.
17.	Stoiber M, Quick J, Egan R, et al. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv, 2016: 094672.
18.	Alpaydin E, Bishop C M. Introduction to machine learning. MIT press, 2014.
19.	Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh: ACM, 2006: 161-168.
20.	Michie D, Spiegelhalter D J, Taylor C C. Machine learning, neural and statistical classification. Journal of the American Statistical Association, 1994, 91(433): 2291432.
21.	Riedmiller M. Advanced supervised learning in multi-layer perceptrons-from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces, 1994, 16(3): 265-278.
22.	Xu R, Wunsch II D C. Clustering. IEEE Computational Intelligence Magazine, 2009, 4(3): 92-95.
23.	Macqueen J. Some methods for classification and analysis of multivariate observations//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: Univ of California Press, 1967, 1(14): 281-297.
24.	Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise//Proceedings of the second international conference on knowledge discovery and data mining, Oregon: ACM, 1996, 96(34): 226-231.
25.	Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell, 1995, 17(8): 790-799.
26.	Kurita T. An efficient agglomerative clustering algorithm using a heap. Pattern Recognit, 1991, 24(3): 205-209.
27.	Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems, 2002, 2: 849-856.
28.	Le Q V. Building high-level features using large scale unsupervised learning//2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver: IEEE, 2013: 8595-8598.
29.	Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 2010, 11: 3371-3408.
30.	Johnson D H, Sinanovic S. Symmetrizing the kullback-leibler distance. IEEE Transactions on Information Theory, 2001: 14941762.
31.	Lv Yisheng, Duan Yanjie, Kang Wenwen, et al. Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 2015, 16(2): 865-873.
32.	Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks//Advances in Neural Information Processing Systems, Vancouver: MIT Press, 2007: 153-160.
33.	Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507.
34.	Srivastava N, Hinton G, Krizhevsky A, et al. Dropout:a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
35.	Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines//Proceedings of the 27rd international conference on machine learning, Haifa: ACM, 2010.
36.	Peng X, Xiao S, Feng J, et al. Deep subspace clustering with sparsity prior//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York: Morgan Kaufmann, 2016: 1925-1931.
37.	Lecun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444.
38.	Steinley D. Properties of the Hubert-Arable adjusted rand index. Psychol Methods, 2004, 9(3): 386.
39.	Fleiss J L, Cohen J, Everitt B S. Large sample standard errors of kappa and weighted kappa. Psychol Bull, 1969, 72(5): 323.
40.	Timofeev R. Classification and regression trees (CART) theory and applications. Humboldt University, Berlin, 2004: 1-40.
41.	Suykens J , Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett, 1999, 9(3): 293-300.
42.	Freund Y, Schapire R E. Experiments with a new boosting algorithm//Proceedings of the 13rd international conference on machine learning, Bari: ACM, 1996, 96: 148-156.
43.	Liaw A, Wiener M. Classification and regression by randomForest. R news, 2002, 2(3): 18-22.

1. Kasianowicz J J, Brandin E, Branton D, et al. Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci U S A, 1996, 93(24): 13770-13773.
2. Venkatesan B M, Bashir R. Nanopore sensors for nucleic acid analysis. Nat Nanotechnol, 2011, 6(10): 615-624.
3. Ying Y L, Cao C, Long Y T. Single molecule analysis by biological nanopore sensors. Analyst, 2014, 139(16): 3826-3835.
4. Wang Y, Patil K M, Yan S, et al. Nanopore sequencing accurately identifies the mutagenic DNA lesion O⁶—carboxymethyl guanine and reveals its behavior in replication. Angewandte Chemie International Edition, 2019, 58(25): 8432-8436.
5. Henley R Y, Ashcroft B, Farrell I, et al. Electrophoretic deformation of individual transfer RNA molecules reveals their identity. Nano Lett, 2016, 16(1): 138-144.
6. Smith A M, Abu-Shumays R, Akeson M, et al. Capture, unfolding, and detection of individual tRNA molecules using a nanopore device. Front Bioeng Biotechnol, 2015, 3: 91.
7. Zhang X, Xu X, Yang Z, et al. Mimicking ribosomal unfolding of RNA pseudoknot in a protein channel. J Am Chem Soc, 2015, 137(50): 15742-15752.
8. Zhang X, Zhang D, Zhao C, et al. Nanopore electric snapshots of an RNA tertiary folding pathway. Nat Commun, 2017, 8(1): 1458.
9. Krause M, Niazi A M, Labun K, et al. Tailfindr: alignment-free poly(A) length measurement for Oxford nanopore RNA and DNA sequencing. RNA, 2019, 25(10): 1229-1241.
10. Simpson J T, Workman R E, Zuzarte P C, et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods, 2017, 14(4): 407-410.
11. Carral A D, Sarap C S, Liu K, et al. 2D MoS2 nanopores: ionic current blockade height for clustering DNA events. 2D Mater, 2019, 6(4): 045011.
12. Farshad M, Rasaiah J C. Molecular dynamics simulation study of transverse and longitudinal ionic currents in solid-state nanopore DNA sequencing. ACS Appl Nano Mater, 2020, 3(2): 1438-1447.
13. Jia Shen, Luo Haochen, Gao Qiheng, et al. Detection of m6A RNA methylation in nanopore sequencing data using support vector machine//2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Suzhou: IEEE, 2019: 1-5.
14. Schreiber J, Karplus K. Analysis of nanopore data using hidden Markov models. Bioinformatics, 2015, 31(12): 1897-1903.
15. Liu H, Begik O, Lucas M C, et al. Accurate detection of m(6)A RNA modifications in native RNA sequences. Nat Commun, 2019, 10(1): 4079.
16. Ni P, Huang N, Zhang Z, et al. DeepSignal: detecting DNA methylation state from nanopore sequencing reads using deep-learning. Bioinformatics, 2019, 35(22): 4586-4595.
17. Stoiber M, Quick J, Egan R, et al. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv, 2016: 094672.
18. Alpaydin E, Bishop C M. Introduction to machine learning. MIT press, 2014.
19. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms//Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh: ACM, 2006: 161-168.
20. Michie D, Spiegelhalter D J, Taylor C C. Machine learning, neural and statistical classification. Journal of the American Statistical Association, 1994, 91(433): 2291432.
21. Riedmiller M. Advanced supervised learning in multi-layer perceptrons-from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces, 1994, 16(3): 265-278.
22. Xu R, Wunsch II D C. Clustering. IEEE Computational Intelligence Magazine, 2009, 4(3): 92-95.
23. Macqueen J. Some methods for classification and analysis of multivariate observations//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: Univ of California Press, 1967, 1(14): 281-297.
24. Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise//Proceedings of the second international conference on knowledge discovery and data mining, Oregon: ACM, 1996, 96(34): 226-231.
25. Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell, 1995, 17(8): 790-799.
26. Kurita T. An efficient agglomerative clustering algorithm using a heap. Pattern Recognit, 1991, 24(3): 205-209.
27. Ng A Y, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems, 2002, 2: 849-856.
28. Le Q V. Building high-level features using large scale unsupervised learning//2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver: IEEE, 2013: 8595-8598.
29. Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 2010, 11: 3371-3408.
30. Johnson D H, Sinanovic S. Symmetrizing the kullback-leibler distance. IEEE Transactions on Information Theory, 2001: 14941762.
31. Lv Yisheng, Duan Yanjie, Kang Wenwen, et al. Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 2015, 16(2): 865-873.
32. Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks//Advances in Neural Information Processing Systems, Vancouver: MIT Press, 2007: 153-160.
33. Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507.
34. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout:a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
35. Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines//Proceedings of the 27rd international conference on machine learning, Haifa: ACM, 2010.
36. Peng X, Xiao S, Feng J, et al. Deep subspace clustering with sparsity prior//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York: Morgan Kaufmann, 2016: 1925-1931.
37. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444.
38. Steinley D. Properties of the Hubert-Arable adjusted rand index. Psychol Methods, 2004, 9(3): 386.
39. Fleiss J L, Cohen J, Everitt B S. Large sample standard errors of kappa and weighted kappa. Psychol Bull, 1969, 72(5): 323.
40. Timofeev R. Classification and regression trees (CART) theory and applications. Humboldt University, Berlin, 2004: 1-40.
41. Suykens J , Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett, 1999, 9(3): 293-300.
42. Freund Y, Schapire R E. Experiments with a new boosting algorithm//Proceedings of the 13rd international conference on machine learning, Bari: ACM, 1996, 96: 148-156.
43. Liaw A, Wiener M. Classification and regression by randomForest. R news, 2002, 2(3): 18-22.

Previous Article
Design, simulation and application of multichannel microfluidic chip for cell migration
Next Article
Structural design and performance analysis of an auxiliary dining robot

Journal of Biomedical Engineering

Unsupervised deep learning for identifying the O⁶-carboxymethyl guanine by nanopore sequencing

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content

Journal of Biomedical Engineering

Unsupervised deep learning for identifying the O6-carboxymethyl guanine by nanopore sequencing

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content

Unsupervised deep learning for identifying the O⁶-carboxymethyl guanine by nanopore sequencing