2023, 50(3): 151-162.
doi: 10.1016/j.jgg.2022.12.005
Abstract:
Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research. With its advantages in both feature shrinkage and biological interpretability, Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is one of the most popular methods for the scenarios of clinical biomarker development. However, in practice, applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables, leading to the overfitting of the model. Here, we present VSOLassoBag, a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data. Using a bagging strategy in combination with a parametric method or inflection point search method, VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction. In addition, by comparing with multiple existing algorithms, VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others. In summary, VSOLassoBag, which is available at https://seqworld.com/VSOLassoBag/ under the GPL v3 license, provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data. For user's convenience, we implement VSOLassoBag as an R package that provides multithreading computing configurations.