5.9
CiteScore
5.9
Impact Factor
Volume 50 Issue 3
Mar.  2023
Turn off MathJax
Article Contents

VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research

doi: 10.1016/j.jgg.2022.12.005
Funds:

This study is supported by National Key R&D Program of China (2021YFA1302100 to Q.Z)

Guangdong Basic and Applied Basic Research Foundation (2021A1515011743 to Q.Z)

National Key Clinical Discipline (to D.Z).

the National Natural Science Foundation of China (82172861 to Q.Z)

  • Received Date: 2022-12-25
  • Accepted Date: 2022-12-26
  • Publish Date: 2023-01-03
  • Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research. With its advantages in both feature shrinkage and biological interpretability, Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is one of the most popular methods for the scenarios of clinical biomarker development. However, in practice, applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables, leading to the overfitting of the model. Here, we present VSOLassoBag, a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data. Using a bagging strategy in combination with a parametric method or inflection point search method, VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction. In addition, by comparing with multiple existing algorithms, VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others. In summary, VSOLassoBag, which is available at https://seqworld.com/VSOLassoBag/ under the GPL v3 license, provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data. For user's convenience, we implement VSOLassoBag as an R package that provides multithreading computing configurations.
  • loading
  • [1]
    Bhlmann, P., van de Geer, S., 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
    [2]
    Breiman, L.J.M.l., 2001. Random forests. Machine Learning, 45, 5-32.
    [3]
    Brennecke, P., Anders, S., Kim, J.K., Kolodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C., Heisler, M.G., 2013. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093-1095.
    [4]
    Cheng, L.H., Hsu, T.C., Lin, C., 2021. Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction. Scientific reports 11, 14914.
    [5]
    Curtis, C., Shah, S.P., Chin, S.F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y., et al., 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346-352.
    [6]
    Diaz-Uriarte, R., de Andres, S.A., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3.
    [7]
    Elgin Christo, V.R., Khanna Nehemiah, H., Minu, B., Kannan, A., 2019. Correlation-based ensemble feature selection using bioinspired algorithms and classification using backpropagation neural network. Comput. Math. Methods Med. 2019, 7398307.
    [8]
    Friedman, J.H., Hastie, T., Tibshirani, R., 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33, 1-22.
    [9]
    Goldman, M.J., Craft, B., Hastie, M., Repecka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N., Zhu, J., Haussler, D., 2020. Visualizing and interpreting cancer genomics data via the Xena platform. Nature Biotechnology 38, 675-678.
    [10]
    Guo, C., Gao, Y.Y., Ju, Q.Q., Zhang, C.X., Gong, M., Li, Z.L., 2021. The landscape of gene co-expression modules correlating with prognostic genetic abnormalities in AML. J. Transl. Med. 19, 228.
    [11]
    Hamidi, O., Tapak, L., Jafarzadeh Kohneloo, A., Sadeghifar, M., 2014. High-dimensional additive hazards regression for oral squamous cell carcinoma using microarray data: a comparative study. BioMed research international 2014, 393280.
    [12]
    Hamraz, M., Gul, N., Raza, M., Khan, D.M., Khalil, U., Zubair, S., Khan, Z., 2021. Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments. PeerJ. Computer science 7, e562.
    [13]
    Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B., 1998. Support vector machines. IEEE Intelligent Systems and their Applications 13, 18-28.
    [14]
    Huang, J., Ma, S., Zhang, C.-H., 2008. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica 18,1603-1618.
    [15]
    Ishwaran, H., Malley, J.D., 2014. Synthetic learning machines. Biodata Min. 7, 28.
    [16]
    Jianqing, F., Rui, S., 2010. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38, 3567-3604.
    [17]
    Ju, H.Q., Zhao, Q., Wang, F., Lan, P., Wang, Z., Zuo, Z.X., Wu, Q.N., Fan, X.J., Mo, H.Y., Chen, L., et al., 2019. A circRNA signature predicts postoperative recurrence in stage II/III colon cancer. EMBO Mol. Med. 11, e10168.
    [18]
    Konietschke, F., Schwab, K., Pauly, M., 2021. Small sample sizes: A big data problem in high-dimensional data analysis. Stat. Methods Med. Res. 30, 687-701.
    [19]
    Liao, J.G., Chin, K.V., 2007. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23, 1945-1951.
    [20]
    Lin, E., Lin, C.H., Lane, H.Y., 2021. Prediction of functional outcomes of schizophrenia with genetic biomarkers using a bagging ensemble machine learning method with feature selection. Scientific reports 11, 10179.
    [21]
    Loh, W.Y., 2011. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 14-23.
    [22]
    Lu, J.H., Zuo, Z.X., Wang, W., Zhao, Q., Qiu, M.Z., Luo, H.Y., Chen, Z.H., Mo, H.Y., Wang, F., Yang, D.D., et al., 2018. A two-microRNA-based signature predicts first-line chemotherapy outcomes in advanced colorectal cancer patients. Cell death discovery 4, 116.
    [23]
    Luo, H., Zhao, Q., Wei, W., Zheng, L., Yi, S., Li, G., Wang, W., Sheng, H., Pu, H., Mo, H., et al., 2020. Circulating tumor DNA methylation profiles enable early diagnosis, prognosis prediction, and screening for colorectal cancer. Sci. Transl. Med. 12, eaax7533.
    [24]
    Meinshausen, N., Yu, B.J.T.a.o.s., 2009. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37, 246-270.
    [25]
    Park, H., Imoto, S., Miyano, S., 2015. Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets. PLoS ONE 10, e0141869.
    [26]
    Pokarowski, P., Mielniczuk, J., 2015. Combined l1 and greedy l0 penalized least squares for linear model selection. J. Mach. Learn. Res. 16, 961-992.
    [27]
    Qu, C., Zhang, L., Li, J., Deng, F., Tang, Y., Zeng, X., Peng, X., 2021. Improving feature selection performance for classification of gene expression data using Harris Hawks optimizer with variable neighborhood learning. Brief. Bioinform. 22, bbab097.
    [28]
    Quezada, H., Guzman-Ortiz, A.L., Diaz-Sanchez, H., Valle-Rios, R., Aguirre-Hernandez, J., 2017. Omics-based biomarkers: current status and potential use in the clinic. Bol. Med. Hosp. Infant Mex. 74, 219-226.
    [29]
    Royston, P., Altman, D.G., 2013. External validation of a Cox prognostic model: principles and methods. BMC Med. Res. Methodol. 13, 33.
    [30]
    Salem, O.A.M., Liu, F., Chen, Y.P., Chen, X., 2020. Ensemble Fuzzy Feature Selection Based on Relevancy, Redundancy, and Dependency Criteria. Entropy. 22, 757.
    [31]
    Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B., 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior, 2011. 31st International Conference on Distributed Computing Systems Workshops. pp. 166-171.
    [32]
    Su, W., Bogdan, M., Candes, E., 2017. False discoveries occur early on the lasso path. Ann. Statist. 45, 2133-2150.
    [33]
    Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B, 73, 273-282.
    [34]
    Tibshirani, R.J.J.o.t.R.S.S.S.B., 1996. Regression shrinkage and selection via the lasso. 58, 267-288.
    [35]
    Wang, S., Nan, B., Rosset, S., Zhu, J., 2011. Random Lasso. Ann. Appl. Stat. 5, 468-485.
    [36]
    White, K.R., Stefanski, L.A., Wu, Y., 2017. Variable Selection in Kernel Regression Using Measurement Error Selection Likelihoods. J. Am. Stat. Assoc. 112, 1587-1597.
    [37]
    Xiao, Y., Wu, J., Lin, Z., Zhao, X., 2018. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 153, 1-9.
    [38]
    Xu, J., Qu, K., Yuan, M., Yang, J., 2021. Feature Selection Combining Information Theory View and Algebraic View in the Neighborhood Decision System. Entropy (Basel, Switzerland) 23, 704.
    [39]
    Xu, R.H., Wei, W., Krawczyk, M., Wang, W., Luo, H., Flagg, K., Yi, S., Shi, W., Quan, Q., Li, K., et al., 2017. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Nat. mater. 16, 1155-1161.
    [40]
    Yamada, M., Jitkrittum, W., Sigal, L., Xing, E.P., Sugiyama, M.J.N.c., 2014. High-dimensional feature selection by feature-wise kernelized lasso. Neural. Comput. 26, 185-207.
    [41]
    Zhang, Z., Chen, L., Xu, P., Xing, L., Hong, Y., Chen, P., 2020. Gene correlation network analysis to identify regulatory factors in sepsis. J. Transl. Med. 18, 381.
    [42]
    Zhao, Q., Sun, Y., Liu, Z., Zhang, H., Li, X., Zhu, K., Liu, Z.X., Ren, J., Zuo, Z., 2020. CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect. Brief Bioinform. 21, 1818-1824.
    [43]
    Zhu, J., Wen, C., Zhu, J., Zhang, H., Wang, X., 2020. A polynomial algorithm for best-subset selection problem. Proc. Natl. Acad. Sci. U. S. A. 117, 33117-33123.
    [44]
    Zhu, L., Li, L., Li, R., Zhu, L., 2011. Model-Free Feature Screening for Ultrahigh Dimensional Data. J. Am. Stat. Assoc. 106, 1464-1475.
    [45]
    Zou, H.J.J.o.t.A.s.a., 2006. The adaptive lasso and its oracle properties. 101, 1418-1429.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (256) PDF downloads (21) Cited by ()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return