Review Articles

Group screening for ultra-high-dimensional feature under linear model

Yong Niu ,

a School of Statistics, East China Normal University, Shanghai, People's Republic of China;b Department of Mathematics and Physics, Hefei University, Hefei, People's Republic of China

Riquan Zhang

a School of Statistics, East China Normal University, Shanghai, People's Republic of China

Zhangriquan@163.com,rqzhang@stat.ecnu.edu.cn

Pages 43-54 | Received 18 Jul. 2018, Accepted 17 Jun. 2019, Published online: 04 Jul. 2019,
  • Abstract
  • Full Article
  • References
  • Citations

Abstract

Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems, such as gene-wide association studies and the multi-factor analysis-of-variance (ANOVA). To address this issue, we proposed a group screening method to do variables selection on groups of variables in linear models. This group screening method is based on a working independence, and sure screening property is also established for our approach. To enhance the finite sample performance, a data-driven thresholding and a two-stage iterative procedure are developed. To the best of our knowledge, screening for grouped variables rarely appeared in the literature, and this method can be regarded as an important and non-trivial extension of screening for individual variables. An extensive simulation study and a real data analysis demonstrate its finite sample performance.

References

  1. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences96(12), 6745–6750. doi: 10.1073/pnas.96.12.6745 [Crossref][Web of Science ®], [Google Scholar]
  2. Bakin, S. (1999). Adaptive regression and model selection in data mining problems (Ph.D. thesis). Australian National University, Canberra. [Google Scholar]
  3. Breheny, P., & Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and its Interface2(3), 369. doi: 10.4310/SII.2009.v2.n3.a10 [Crossref][Web of Science ®], [Google Scholar]
  4. Breheny, P., & Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing25(2), 173–187. doi: 10.1007/s11222-013-9424-2 [Crossref][Web of Science ®], [Google Scholar]
  5. Fan, J., Feng, Y., & Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association106(494), 544–557. doi: 10.1198/jasa.2011.tm09779 [Taylor & Francis Online][Web of Science ®], [Google Scholar]
  6. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B: Statistical Methodology70(5), 849–911. doi: 10.1111/j.1467-9868.2008.00674.x [Crossref][Web of Science ®], [Google Scholar]
  7. Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the linear model. Journal of Machine Learning Research10, 1829–1853. [Web of Science ®], [Google Scholar]
  8. Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Annals of Statistics38(6), 3567–3604. doi: 10.1214/10-AOS798 [Crossref][Web of Science ®], [Google Scholar]
  9. He, X., Wang, L., & Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics41(1), 342–369. doi: 10.1214/13-AOS1087 [Crossref][Web of Science ®], [Google Scholar]
  10. Huang, J., Ma, S., Xie, H., & Zhang, C. H. (2009). A group bridge approach for variable selection. Biometrika96(2), 339–355. doi: 10.1093/biomet/asp020 [Crossref][Web of Science ®], [Google Scholar]
  11. Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association107(499), 1129–1139. doi: 10.1080/01621459.2012.695654 [Taylor & Francis Online][Web of Science ®], [Google Scholar]
  12. Shao, X., & Zhang, J. (2014). Martingale difference correlation and its use in high-dimensional variable screening. The American Statistical Association109(507), 1302–1318. doi: 10.1080/01621459.2014.887012 [Taylor & Francis Online][Web of Science ®], [Google Scholar]
  13. Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer. [Crossref], [Google Scholar]
  14. Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association104(488), 1512–1524. doi: 10.1198/jasa.2008.tm08516 [Taylor & Francis Online][Web of Science ®], [Google Scholar]
  15. Wei, F., & Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli16(4), 1369–1384. doi: 10.3150/10-BEJ252 [Crossref][Web of Science ®], [Google Scholar]
  16. Yang, Y., & Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing2015(6), 1129–1141. doi: 10.1007/s11222-014-9498-5 [Crossref], [Google Scholar]
  17. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B: Statistical Methodology68(1), 49–67. doi: 10.1111/j.1467-9868.2005.00532.x [Crossref][Web of Science ®], [Google Scholar]
  18. Zhang, C. H., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics36(4), 1567–1594. doi: 10.1214/07-AOS520 [Crossref][Web of Science ®], [Google Scholar]
  19. Zhao, S. D., & Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis105(1), 397–411. doi: 10.1016/j.jmva.2011.08.002 [Crossref][Web of Science ®], [Google Scholar]
  20. Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics37(6A), 3468–3497. doi: 10.1214/07-AOS584 [Crossref][Web of Science ®], [Google Scholar]
  21. Zhong, W., & Zhu, L.-P. (2015). An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation85(11), 2331–2345. doi: 10.1080/00949655.2014.928820 [Taylor & Francis Online][Web of Science ®], [Google Scholar]