VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research

Jiaqi Liang; Chaoye Wang; Di Zhang; Yubin Xie; Yanru Zeng; Tianqin Li; Zhixiang Zuo; Jian Ren; Qi Zhao

doi:10.1016/j.jgg.2022.12.005

Volume 50 Issue 3

Mar. 2023

Turn off MathJax

Article Contents

Article Navigation > Journal of Genetics and Genomics > 2023 > 50(3): 151-162

PDF( 2141 KB)

VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research

doi: 10.1016/j.jgg.2022.12.005

a. State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China;
b. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, Guangdong 510275, China;
c. Department of Coloproctology Surgery, Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510655, China;
d. Precision Medicine Institute, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510060, China;
e. Computer Science Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States

Funds:

This study is supported by National Key R&D Program of China (2021YFA1302100 to Q.Z)

Guangdong Basic and Applied Basic Research Foundation (2021A1515011743 to Q.Z)

National Key Clinical Discipline (to D.Z).

the National Natural Science Foundation of China (82172861 to Q.Z)

Received Date: 2022-12-25
Accepted Date: 2022-12-26
Publish Date: 2023-01-03

Abstract

Abstract

Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research. With its advantages in both feature shrinkage and biological interpretability, Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is one of the most popular methods for the scenarios of clinical biomarker development. However, in practice, applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables, leading to the overfitting of the model. Here, we present VSOLassoBag, a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data. Using a bagging strategy in combination with a parametric method or inflection point search method, VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction. In addition, by comparing with multiple existing algorithms, VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others. In summary, VSOLassoBag, which is available at https://seqworld.com/VSOLassoBag/ under the GPL v3 license, provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data. For user's convenience, we implement VSOLassoBag as an R package that provides multithreading computing configurations.
- Feature selection,
- LASSO bagging algorithm,
- Biomarker discovery,
- Omics data

FullText(HTML)

References(45)

References

[1]	Bhlmann, P., van de Geer, S., 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
[2]	Breiman, L.J.M.l., 2001. Random forests. Machine Learning, 45, 5-32.
[3]	Brennecke, P., Anders, S., Kim, J.K., Kolodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C., Heisler, M.G., 2013. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093-1095.
[4]	Cheng, L.H., Hsu, T.C., Lin, C., 2021. Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction. Scientific reports 11, 14914.
[5]	Curtis, C., Shah, S.P., Chin, S.F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y., et al., 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346-352.
[6]	Diaz-Uriarte, R., de Andres, S.A., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3.
[7]	Elgin Christo, V.R., Khanna Nehemiah, H., Minu, B., Kannan, A., 2019. Correlation-based ensemble feature selection using bioinspired algorithms and classification using backpropagation neural network. Comput. Math. Methods Med. 2019, 7398307.
[8]	Friedman, J.H., Hastie, T., Tibshirani, R., 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33, 1-22.
[9]	Goldman, M.J., Craft, B., Hastie, M., Repecka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N., Zhu, J., Haussler, D., 2020. Visualizing and interpreting cancer genomics data via the Xena platform. Nature Biotechnology 38, 675-678.
[10]	Guo, C., Gao, Y.Y., Ju, Q.Q., Zhang, C.X., Gong, M., Li, Z.L., 2021. The landscape of gene co-expression modules correlating with prognostic genetic abnormalities in AML. J. Transl. Med. 19, 228.
[11]	Hamidi, O., Tapak, L., Jafarzadeh Kohneloo, A., Sadeghifar, M., 2014. High-dimensional additive hazards regression for oral squamous cell carcinoma using microarray data: a comparative study. BioMed research international 2014, 393280.
[12]	Hamraz, M., Gul, N., Raza, M., Khan, D.M., Khalil, U., Zubair, S., Khan, Z., 2021. Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments. PeerJ. Computer science 7, e562.
[13]	Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B., 1998. Support vector machines. IEEE Intelligent Systems and their Applications 13, 18-28.
[14]	Huang, J., Ma, S., Zhang, C.-H., 2008. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica 18,1603-1618.
[15]	Ishwaran, H., Malley, J.D., 2014. Synthetic learning machines. Biodata Min. 7, 28.
[16]	Jianqing, F., Rui, S., 2010. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38, 3567-3604.
[17]	Ju, H.Q., Zhao, Q., Wang, F., Lan, P., Wang, Z., Zuo, Z.X., Wu, Q.N., Fan, X.J., Mo, H.Y., Chen, L., et al., 2019. A circRNA signature predicts postoperative recurrence in stage II/III colon cancer. EMBO Mol. Med. 11, e10168.
[18]	Konietschke, F., Schwab, K., Pauly, M., 2021. Small sample sizes: A big data problem in high-dimensional data analysis. Stat. Methods Med. Res. 30, 687-701.
[19]	Liao, J.G., Chin, K.V., 2007. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23, 1945-1951.
[20]	Lin, E., Lin, C.H., Lane, H.Y., 2021. Prediction of functional outcomes of schizophrenia with genetic biomarkers using a bagging ensemble machine learning method with feature selection. Scientific reports 11, 10179.
[21]	Loh, W.Y., 2011. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 14-23.
[22]	Lu, J.H., Zuo, Z.X., Wang, W., Zhao, Q., Qiu, M.Z., Luo, H.Y., Chen, Z.H., Mo, H.Y., Wang, F., Yang, D.D., et al., 2018. A two-microRNA-based signature predicts first-line chemotherapy outcomes in advanced colorectal cancer patients. Cell death discovery 4, 116.
[23]	Luo, H., Zhao, Q., Wei, W., Zheng, L., Yi, S., Li, G., Wang, W., Sheng, H., Pu, H., Mo, H., et al., 2020. Circulating tumor DNA methylation profiles enable early diagnosis, prognosis prediction, and screening for colorectal cancer. Sci. Transl. Med. 12, eaax7533.
[24]	Meinshausen, N., Yu, B.J.T.a.o.s., 2009. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37, 246-270.
[25]	Park, H., Imoto, S., Miyano, S., 2015. Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets. PLoS ONE 10, e0141869.
[26]	Pokarowski, P., Mielniczuk, J., 2015. Combined l1 and greedy l0 penalized least squares for linear model selection. J. Mach. Learn. Res. 16, 961-992.
[27]	Qu, C., Zhang, L., Li, J., Deng, F., Tang, Y., Zeng, X., Peng, X., 2021. Improving feature selection performance for classification of gene expression data using Harris Hawks optimizer with variable neighborhood learning. Brief. Bioinform. 22, bbab097.
[28]	Quezada, H., Guzman-Ortiz, A.L., Diaz-Sanchez, H., Valle-Rios, R., Aguirre-Hernandez, J., 2017. Omics-based biomarkers: current status and potential use in the clinic. Bol. Med. Hosp. Infant Mex. 74, 219-226.
[29]	Royston, P., Altman, D.G., 2013. External validation of a Cox prognostic model: principles and methods. BMC Med. Res. Methodol. 13, 33.
[30]	Salem, O.A.M., Liu, F., Chen, Y.P., Chen, X., 2020. Ensemble Fuzzy Feature Selection Based on Relevancy, Redundancy, and Dependency Criteria. Entropy. 22, 757.
[31]	Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B., 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior, 2011. 31st International Conference on Distributed Computing Systems Workshops. pp. 166-171.
[32]	Su, W., Bogdan, M., Candes, E., 2017. False discoveries occur early on the lasso path. Ann. Statist. 45, 2133-2150.
[33]	Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B, 73, 273-282.
[34]	Tibshirani, R.J.J.o.t.R.S.S.S.B., 1996. Regression shrinkage and selection via the lasso. 58, 267-288.
[35]	Wang, S., Nan, B., Rosset, S., Zhu, J., 2011. Random Lasso. Ann. Appl. Stat. 5, 468-485.
[36]	White, K.R., Stefanski, L.A., Wu, Y., 2017. Variable Selection in Kernel Regression Using Measurement Error Selection Likelihoods. J. Am. Stat. Assoc. 112, 1587-1597.
[37]	Xiao, Y., Wu, J., Lin, Z., Zhao, X., 2018. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 153, 1-9.
[38]	Xu, J., Qu, K., Yuan, M., Yang, J., 2021. Feature Selection Combining Information Theory View and Algebraic View in the Neighborhood Decision System. Entropy (Basel, Switzerland) 23, 704.
[39]	Xu, R.H., Wei, W., Krawczyk, M., Wang, W., Luo, H., Flagg, K., Yi, S., Shi, W., Quan, Q., Li, K., et al., 2017. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Nat. mater. 16, 1155-1161.
[40]	Yamada, M., Jitkrittum, W., Sigal, L., Xing, E.P., Sugiyama, M.J.N.c., 2014. High-dimensional feature selection by feature-wise kernelized lasso. Neural. Comput. 26, 185-207.
[41]	Zhang, Z., Chen, L., Xu, P., Xing, L., Hong, Y., Chen, P., 2020. Gene correlation network analysis to identify regulatory factors in sepsis. J. Transl. Med. 18, 381.
[42]	Zhao, Q., Sun, Y., Liu, Z., Zhang, H., Li, X., Zhu, K., Liu, Z.X., Ren, J., Zuo, Z., 2020. CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect. Brief Bioinform. 21, 1818-1824.
[43]	Zhu, J., Wen, C., Zhu, J., Zhang, H., Wang, X., 2020. A polynomial algorithm for best-subset selection problem. Proc. Natl. Acad. Sci. U. S. A. 117, 33117-33123.
[44]	Zhu, L., Li, L., Li, R., Zhu, L., 2011. Model-Free Feature Screening for Ultrahigh Dimensional Data. J. Am. Stat. Assoc. 106, 1464-1475.
[45]	Zou, H.J.J.o.t.A.s.a., 2006. The adaptive lasso and its oracle properties. 101, 1418-1429.