Review Articles

Variable screening in multivariate linear regression with high-dimensional covariates

Shiferaw B. Bizuayehu ,

School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Lu Li ,

School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, People’s Republic of China

Jin Xu

School of Statistics, East China Normal University, Shanghai, People’s Republic of China;Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People’s Republic of China

Pages 0 | Received 01 Jan. 2021, Accepted 09 Sep. 2021, Published online: 06 Oct. 2021,
  • Abstract
  • Full Article
  • References
  • Citations

We propose two variable selection methods in multivariate linear regression with high-dimensional covariates. The first method uses a multiple correlation coefficient to fast reduce the dimension of the relevant predictors to a moderate or low level. The second method extends the univariate forward regression of Wang [(2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516] in a unified way such that the variable selection and model estimation can be obtained simultaneously. We establish the sure screening property for both methods. Simulation and real data applications are presented to show the finite sample performance of the proposed methods in comparison with some naive method.

References

  • Anderson, T. (2003). An introduction to statistical multivariate analysis (3rd ed.). Wiley.
  • Bickel, P. J., & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199227. https://doi.org/10.1214/009053607000000758
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37(4), 373384. https://doi.org/10.1080/00401706.1995.10484371
  • Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B, 59(1), 354. https://doi.org/10.1111/rssb.1997.59.issue-1
  • Cai, T., Li, H., Liu, W., & Xie, J. (2013). Covariate–adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100(1), 139156. https://doi.org/10.1093/biomet/ass058
  • Cai, T., Liu, W., & Luo, X. (2011). A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594607. https://doi.org/10.1198/jasa.2011.tm10155
  • Cai, T., & Lv, J. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 23652369. https://doi.org/10.1214/009053607000000442
  • Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 23132351. https://doi.org/10.1214/009053606000001523
  • Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500), 15331545. https://doi.org/10.1080/01621459.2012.734178
  • Cooper, C. (1997). The crippling consequences of fractures and their impact on quality of life. American Journal of Medicine, 103(2), 1219. https://doi.org/10.1016/S0002-9343(97)90022-X
  • Deshpande, S., Rockova, V., & George, E. (2019). Simultaneous variable and covariance selection with the multivariate Spike- and Slab lasso. Journal of Computational and Graphical Statistics, 28(4), 921931. https://doi.org/10.1080/10618600.2019.1593179
  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407499. https://doi.org/10.1214/009053604000000067
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 13481360. https://doi.org/10.1198/016214501753382273
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70(5), 849911. https://doi.org/10.1111/rssb.2008.70.issue-5
  • Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32(3), 928961. https://doi.org/10.1214/009053604000000256
  • Fang, K.-T., Kotz, S., & Ng, K. W. (2018). Symmetric multivariate and related distributions. Chapman and Hall/CRC.
  • Ferte, C., Trister, A. D., Erich, H., & Bot, B. (2013). Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology. Clinical Cancer Research, 19(16), 43154325. https://doi.org/10.1158/1078-0432.CCR-12-3937
  • Frank, L., & Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 10135. https://doi.org/10.1080/00401706.1993.10485033
  • Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3), 397416. https://doi.org/10.1080/10618600.1998.10474784
  • He, K., Lian, H., Ma, S., & Huang, J. Z. (2018). Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates. Journal of Statistical Planning and Inference, 113(522), 746754. https://doi.org/10.1080/01621459.2017.1285774
  • Jia, B., Xu, S., Xiao, G., & Lambda, V. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics, 73(4), 12211230. https://doi.org/10.1111/biom.v73.4
  • Kim, S., Sohn, K., & Xing, E. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics (Oxford, England), 25(12), 204212. https://doi.org/10.1093/bioinformatics/btp218
  • Kong, X., Liu, Z., Yao, Y., & Zhou, W. (2017). Sure screening by ranking the canonical correlations. Test, 26(1), 4670. https://doi.org/10.1007/s11749-016-0497-z
  • Kong, Y., Zheng, Z., & Lv, J. (2016). The constrained Dantzing selector with enhanced consistency. Journal of Machine Learning Research, 17(123), 122.
  • Lee, W., & Liu, Y. (2012). Simultaneous multiple response regression and inverse covariate matrix estimation via penalized Gaussian maximum likelihood. Journal of Multivariate Analysis, 111, 241255. https://doi.org/10.1016/j.jmva.2012.03.013
  • Li, B., Chuns, H., & Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of American Statistical Association, 107(497), 152167. https://doi.org/10.1080/01621459.2011.644498
  • Li, C., & Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics (Oxford, England), 24(9), 11751182. https://doi.org/10.1093/bioinformatics/btn081
  • Li, G., Peng, H., Zhang, J., & Zhu, L. (2012). Robust rank correlation based screening. Annals of Statistics, 40(3), 18461877. https://doi.org/10.1214/12-AOS1024
  • Li, Y., Li, G., Lian, H., & Tong, T. (2017). Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models. Journal of Multivariate Analysis, 155, 133150. https://doi.org/10.1016/j.jmva.2016.12.006
  • Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics, 71(2), 354363. https://doi.org/10.1111/biom.v71.2
  • Liang, H., Wang, H., & Tsai, C.-L. (2012). Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear model. Statistica Sinica, 22(2), 531554. https://doi.org/10.5705/ss.2010.134
  • Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39(1), 147. https://doi.org/10.1214/09-AOS776
  • Pecanka, J., van der Vaart, A. W., & Marianne, J. (2019). Modeling association between multivariate correlated and high-dimensional sparse covariates: The adaptive SVS method. Journal of Applied Statistics, 46(5), 893913. https://doi.org/10.1080/02664763.2018.1523377
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., & Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics, 4(1), 5377. https://doi.org/10.1214/09-AOAS271
  • Ravikumar, P., Wainwright, M., & Lafferty, J. (2010). High-dimensional Ising model selection using l1 regularized logistic regression. Annals of Statistics, 38(3), 12871319. https://doi.org/10.1214/09-AOS691
  • Ren, J., Du, Y., Li, S., Ma, S., Jiang, Y., & Wu, C. (2019). Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genetic Epidemiology, 43(3), 276291. https://doi.org/10.1002/gepi.2018.43.issue-3
  • Ren, J., He, T., Li, Y., Liu, S., Du, Y., Jiang, Y., & Wu, C. (2017). Network-based regularization for high dimensional SNP data in the case-control study of type 2 diabetes. BMC Genetics, 18(1), 44. https://doi.org/10.1186/s12863-017-0495-5
  • Reppe, S., Refvem, H., Gautvik, V. T., Olstad, O. K., Høvring, P. I., Reinholt, F. P., Holden, M., Frigessi, A., Jemtland, R., & Gautvik, K. M. (2010). Eight genes are highly associated with BMD variation in postmenopausal Caucasian women. Bone, 46(3), 604612. https://doi.org/10.1016/j.bone.2009.11.007
  • Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4), 947962. https://doi.org/10.1198/jcgs.2010.09188
  • Saulis, L., & Statulevicius, V. (1991). Limit theorems for large deviations (Vol. 73). Springer Science & Business Media.
  • Setdji, C. M., & R. D. Cook (2004). K-means inverse regression. Technometrics, 46(4), 421429. https://doi.org/10.1198/004017004000000437
  • Smith, M., & Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. Journal of American Statistical Association, 102(478), 417431. https://doi.org/10.1198/016214506000001031
  • Sofer, T., Dicker, L., & Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Statistica Sinica, 24(4), 16331654. https://doi.org/10.5705/ss.2013.019
  • Song, Y., Schreier, P. J., Ramirez, D., & Hasija, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing, 128, 449458. https://doi.org/10.1016/j.sigpro.2016.05.020
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  • Turlach, B., Venables, W., & Wright, S. (2005). Simultaneous variable selection. Technometrics, 47(3), 349363. https://doi.org/10.1198/004017005000000139
  • Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 15121524. https://doi.org/10.1198/jasa.2008.tm08516
  • Wang, J., Zhang, Z., & Ye, J. (2019). Two-layer feature reduction for sparse-group lasso via decomposition of convex sets. Journal of Machine Learning, 20(163), 142.
  • Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 11291141. https://doi.org/10.1007/s11222-014-9498-5
  • Yi, N. (2010). Statistical analysis of genetic interactions. Genetics Research, 92(5–6), 443459. https://doi.org/10.1017/S0016672310000595
  • Yin, J., & Li, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics, 5(4), 26302650. https://doi.org/10.1214/11-AOAS494
  • Zhang, C., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4), 15671594. https://doi.org/10.1214/07-AOS520
  • Zhang, H. H., & Lu, W. (2007). Adaptive lasso for Cox's proportional hazards model. Biometrika, 94(3), 691703. https://doi.org/10.1093/biomet/asm037
  • Zhang, N., Yu, Z., & Wu, Q. (2019). Overlapping sliced inverse regression for dimension reduction. Analysis and Applications, 17(5), 715736. https://doi.org/10.1142/S0219530519400013
  • Zhao, W., Lian, H., & Ma, S. (2017). Robust reduced–rank modeling via rank regression. Journal of Statistical Planning and Inference, 180, 112. https://doi.org/10.1016/j.jspi.2016.08.009
  • Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 14641475. https://doi.org/10.1198/jasa.2011.tm10563
  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 14181429. https://doi.org/10.1198/016214506000000735
  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301320. https://doi.org/10.1111/rssb.2005.67.issue-2

To cite this article: Shiferaw B. Bizuayehu, Lu Li & Jin Xu (2021): Variable screening in
multivariate linear regression with high-dimensional covariates, Statistical Theory and Related
Fields, DOI: 10.1080/24754269.2021.1982607
To link to this article: https://doi.org/10.1080/24754269.2021.1982607