Review Articles

Factor analysis of correlation matrices when the number of random variables exceeds the sample size

Miguel Marino ,

Department of Family Medicine, Oregon Health & Science University, Portland, OR, USA

marinom@ohsu.edu

Yi Li

Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Pages 246-256 | Received 30 May. 2017, Accepted 30 Oct. 2017, Published online: 30 Nov. 2017,
  • Abstract
  • Full Article
  • References
  • Citations

ABSTRACT

Factor analysis which studies correlation matrices is an effective means of data reduction whose inference on the correlation matrix typically requires the number of random variables, p, to be relatively small and the sample size, n, to be approaching infinity. In contemporary data collection for biomedical studies, disease surveillance and genetics, p > n limits the use of existing factor analysis methods to study the correlation matrix. The motivation for the research here comes from studying the correlation matrix of log annual cancer mortality rate change for p = 59 cancer types from 1969 to 2008 (n = 39) in the U.S.A. We formalise a test statistic to perform inference on the structure of the correlation matrix when p > n. We develop an approach based on group sequential theory to estimate the number of relevant factors to be extracted. To facilitate interpretation of the extracted factors, we propose a BIC (Bayesian Information Criterion)-type criterion to produce a sparse factor loading representation. The proposed methodology outperforms competing ad hoc methodologies in simulation analyses, and identifies three significant underlying factors responsible for the observed correlation between cancer mortality rate changes.

References

  1. Anderson, T. (1963). Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics, 34(1), 122148[Google Scholar]
  2. Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71(1), 135171[Google Scholar]
  3. Bai, J., & Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1), 191221[Google Scholar]
  4. Baik, J., & Silverstein, J. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97(6), 13821408[Google Scholar]
  5. Bickel, P., & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199227[Google Scholar]
  6. Cai, T., & Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494), 672684[Taylor & Francis Online], [Google Scholar]
  7. Carvalho, C., Chang, J., Lucas, J., Nevins, J., Wang, Q., & West, M. (2008). High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103(484), 14381456[Taylor & Francis Online], [Google Scholar]
  8. Fan, J., Fan, Y., & Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1), 186197[Google Scholar]
  9. Fan, J., Liao, Y., & Liu, H. (2016). An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1), C1C32[Web of Science ®], [Google Scholar]
  10. Fan, J., Lv, J., & Qi, L. (2011). Sparse high-dimensional models in economics. Annual Review of Economics, 3, 291317[Google Scholar]
  11. Ghosh, J., & Dunson, D. B. (2009). Bayesian model selection in factor analytic models. In D. Dunson (Ed.), Random effect and latent variable model selection (pp. 151–163). Hoboken, NJ: Wiley. [Google Scholar]
  12. Guttman, L. (1954). Some necessary conditions for common-factor analysis. Psychometrika, 19(2), 149161[Google Scholar]
  13. Hayton, J., Allen, D., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7(2), 191205[Google Scholar]
  14. Horn, J. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179185[Google Scholar]
  15. Huang, J., Liu, N., Pourahmadi, M., & Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93(1), 8598[Google Scholar]
  16. Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. Englewood Cliffs, NJ: Prentice Hall[Google Scholar]
  17. Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29, 295327[Google Scholar]
  18. Kaiser, H. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141151[Google Scholar]
  19. Kim, H., Fay, M., Feuer, E., & Midthune, D. (2000). Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine, 19(3), 335351[Google Scholar]
  20. Lan, G., & DeMets, D. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659663[Google Scholar]
  21. Leek, J. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics, 67(2), 344352[Google Scholar]
  22. Onatski, A. (2009). Testing hypotheses about the number of factors in large factor models. Econometrica, 77(5), 14471479[Crossref], [Web of Science ®], [Google Scholar]
  23. Patterson, N., Price, A., & Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2(12), 20742093[Google Scholar]
  24. Rothman, A., Levina, E., & Zhu, J. (2009). Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485), 177186[Taylor & Francis Online], [Google Scholar]
  25. Shen, H., & Huang, J. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6), 10151034[Google Scholar]
  26. Shimizu, H., Arimura, Y., Onodera, K., Takahashi, H., Okahara, S., Kodaira, J., … Hosokawa, M. (2016). Malignant potential of gastrointestinal cancers assessed by structural equation modeling. PloS One, 11(2), e0149327[Google Scholar]
  27. Tracy, C., & Widom, H. (2000). The distribution of the largest eigenvalue in the Gaussian ensembles. Calogero-Moser-Sutherland Models, 4, 461472[Google Scholar]
  28. West, M. (2003). Bayesian factor regression models in the large p, small n paradigm. Bayesian Statistics, 7, 723732[Google Scholar]
  29. Wong, F., Carter, C., & Kohn, R. (2003). Efficient estimation of covariance selection models. Biometrika, 90(4), 809830[Google Scholar]
  30. Zhou, Y., Wang, P., Wang, X., Zhu, J., & Song, P. X. K. (2017). Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis. Genetic Epidemiology, 41(1), 7080[Google Scholar]
  31. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265286[Taylor & Francis Online], [Google Scholar]
  32. Zou, H., Hastie, T., & Tibshirani, R. (2007). On the degrees of freedom of the lasso. The Annals of Statistics, 35(5), 21732192[Google Scholar]
  33. Zwick, W., & Velicer, W. (1986). Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99(3), 432442[Google Scholar]