5.9
CiteScore
5.9
Impact Factor
Volume 46 Issue 9
Sep.  2019
Turn off MathJax
Article Contents

Examining the practical limits of batch effect-correction algorithms: When should you care about batch effects?

doi: 10.1016/j.jgg.2019.08.002
More Information
  • Corresponding author: E-mail address: goh.informatics@gmail.com (Wilson Wen Bin Goh)
  • Received Date: 2019-05-11
  • Accepted Date: 2019-08-04
  • Rev Recd Date: 2019-08-02
  • Available Online: 2019-09-20
  • Publish Date: 2019-09-20
  • Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of “best” are also interesting: are BECAs robust against various degrees of confounding and if so, what is the limit? Using two different methods for simulating class (phenotype) and batch effects and taking various representative datasets across both genomics (RNA-Seq) and proteomics platforms, we demonstrate that under situations where sample classes and batch factors are moderately confounded, most BECAs are remarkably robust and only weakly affected by upstream normalization procedures. This observation is consistently supported across the multitude of test datasets. BECAs do have limits: When sample classes and batch factors are strongly confounded, BECA performance declines, with variable performance in precision, recall and also batch correction. We also report that while conventional normalization methods have minimal impact on batch effect correction, they do not affect downstream statistical feature selection, and in strongly confounded scenarios, may even outperform BECAs. In other words, removing batch effects is no guarantee of optimal functional analysis. Overall, this study suggests that simplistic performance ranking exercises are quite trivial, and all BECAs are compromises in some context or another.
  • loading
  • [1]
    Broadhurst, D., Goodacre, R., Reinke, S.N., Kuligowski, J., Wilson, I.D., Lewis, M.R., Dunn, W.B., 2018. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics 14, 72.
    [2]
    Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C., 2011. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. Plos One 6, e17238.
    [3]
    Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 861-874.
    [4]
    Ferreira, T., Wilson, S.R., Choi, Y.G., Risso, D., Dudoit, S., Speed, T.P., Ngai, J., 2014. Silencing of odorant receptor genes by G protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron. Neuron 81, 847-859.
    [5]
    Frazee, A.C., Jaffe, A.E., Langmead, B., Leek, J.T., 2015. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31, 2778-2784.
    [6]
    Giuliani, A., 2017. The application of principal component analysis to drug discovery and biomedical data. Drug Discov. Today 22, 1069-1076.
    [7]
    Goh, W.W.B., Sng, J.C., Yee, J.Y., See, Y.M., Lee, T.S., Wong, L., Lee, J., 2017. Can peripheral blood-derived gene expressions characterize individuals at ultra-high risk for psychosis? Comput. Psychiatr. 1, 168-183.
    [8]
    Goh, W.W.B., Wang, W., Wong, L., 2017. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498-507.
    [9]
    Goh, W.W.B., Wong, L., 2017. Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics. BMC Genomics 18, 142-142.
    [10]
    Goh, W.W.B., Wong, L., 2018. Dealing with Confounders in Omics Analysis. Trends Biotechnol. 36, 488-498.
    [11]
    Hatzi, K., Geng, H., Doane, A.S., Meydan, C., LaRiviere, R., Cardenas, M., Duy, C., Shen, H., Vidal, M.N.C., Baslan, T., Mohammad, H.P., Kruger, R.G., Shaknovich, R., Haberman, A.M., Inghirami, G., Lowe, S.W., Melnick, A.M., 2019. Histone demethylase LSD1 is required for germinal center formation and BCL6-driven lymphomagenesis. Nat. Immunol. 20, 86-96.
    [12]
    Hornung, R., Boulesteix, A.-L., Causeur, D., 2016. Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment. BMC Bioinformatics 17, 27.
    [13]
    Isogai, Y., Wu, Z., Love, M.I., Ahn, M.H., Bambah-Mukku, D., Hua, V., Farrell, K., Dulac, C., 2018. Multisensory logic of infant-directed aggression by males. Cell 175, 1827-1841.
    [14]
    Jaffe, A.E., Hyde, T., Kleinman, J., Weinbergern, D.R., Chenoweth, J.G., McKay, R.D., Leek, J.T., Colantuoni, C., 2015. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics 16, 372.
    [15]
    Johnson, W.E., Li, C., Rabinovic, A., 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127.
    [16]
    Langley, S.R., Mayr, M., 2015. Comparative analysis of statistical methods used for detecting differential expression in label-free mass spectrometry proteomics. J. Proteomics 129, 83-92.
    [17]
    Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A., Molter, C., Weiss-Solis, D.Y., Duque, R., Bersini, H., Nowe, A., 2012. Batch effect removal methods for microarray gene expression data integration: a survey. Brief. Bioinform. 14, 469-490.
    [18]
    Leek, J.T., 2014. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic acids Res. 42, e161.
    [19]
    Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D., 2012. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882-883.
    [20]
    Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A., 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733-739.
    [21]
    Leek, J.T., Storey, J.D., 2007. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724-1735.
    [22]
    Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., Zhao, C., Elloumi, F., Shi, W., Thomas, R., Lin, S., Tillinghast, G., Liu, G., Zhou, Y., Herman, D., Li, Y., Deng, Y., Fang, H., Bushel, P., Woods, M., Zhang, J., 2010. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 10, 278-291.
    [23]
    Nygaard, V., Rodland, E.A., Hovig, E., 2016. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17, 29-39.
    [24]
    Oytam, Y., Sobhanmanesh, F., Duesing, K., Bowden, J.C., Osmond-McLeod, M., Ross, J., 2016. Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets. BMC Bioinformatics 17, 332.
    [25]
    Reese, S.E., Archer, K.J., Therneau, T.M., Atkinson, E.J., Vachon, C.M., de Andrade, M., Kocher, J.-P.A., Eckel-Passow, J.E., 2013. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 29, 2877-2883.
    [26]
    Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C., Muller, M., 2011. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77.
    [27]
    Sanchez-Illana, A., Pineiro-Ramos, J.D., Sanjuan-Herraez, J.D., Vento, M., Quintas, G., Kuligowski, J., 2018. Evaluation of batch effect elimination using quality control replicates in LC-MS metabolite profiling. Anal. Chim. Acta 1019, 38-48.
    [28]
    Shao, C., Liu, Y., Ruan, H., Li, Y., Wang, H., Kohl, F., Goropashnaya, A.V., Fedorov, V.B., Zeng, R., Barnes, B.M., Yan, J., 2010. Shotgun proteomics analysis of hibernating arctic ground squirrels. Mol. Cell Proteomics 9, 313-326.
    [29]
    Sims, A.H., Smethurst, G.J., Hey, Y., Okoniewski, M.J., Pepper, S.D., Howell, A., Miller, C.J., Clarke, R.B., 2008. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1, 42.
    [30]
    Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T., 2005. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940-3941.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (148) PDF downloads (3) Cited by ()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return