Review Articles

A selective review of statistical methods using calibration information from similar studies

Jing Qin ,

National Institute of Allergy and Infectious Diseases, National Institutes of Health, Frederick, MD, USA

Yukun Liu ,

KLATASDS – MOE, School of Statistics, East China Normal University, Shanghai, People's Republic of China

Pengfei Li

Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada

ykliu@sfs.ecnu.edu.cn

Pages | Received 04 Jan. 2021, Accepted 10 Jan. 2022, Published online: 17 Feb. 2022,
  • Abstract
  • Full Article
  • References
  • Citations

In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.

References

  • Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of the 21 ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (pp. 1–16). ACM. 
  • Back, K., & Brown, D. P. (1992). GMM, maximum likelihood, and nonparametric efficiency. Economics Letters39(1), 23–28. https://doi.org/10.1016/0165-1765(92)90095-G 
  • Braverman, M., Garg, A., Ma, T., Nguyen, H., & Woodruff, D. (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th annual ACM symposium on theory of computing (pp. 1011–1020). ACM. 
  • Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. (2009). Introduction to meta-analysis. Wiley. 
  • Chatterjee, N., Chen, Y.-H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157 
  • Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating population level information: an empirical likelihood based approach. Journal of the Royal Statistical Society: Series B70(2), 311–328. https://doi.org/10.1111/rssb.2008.70.issue-2 
  • Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika80(1), 107–116. https://doi.org/10.1093/biomet/80.1.107 
  • Chen, J., Sitter, R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika89(1), 230–237. https://doi.org/10.1093/biomet/89.1.230 
  • Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley. 
  • Dersimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials7(3), 177–188. https://doi.org/10.1016/0197-2456(86)90046-2 
  • Duan, R., Ning, Y., & Chen, Y. (2020). Heterogeneity-aware and communication-efficient distributed statistical inference. arXiv:1912.09623v1. 
  • Duchi, J., Jordan, M., Wainwright, M., & Zhang, Y. (2015). Optimality guarantees for distributed statistical estimation. arXiv:1405.0782. 
  • Han, P., & Lawless, J. (2016). Comment. Journal of the American Statistical Association111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
  • Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica50(4), 1029–1054. https://doi.org/10.2307/1912775
  • Hartely, H. O., & Rao, J. N. K. (1968). A new estimation theory for sample surveys. Biometrika55(3), 547–557. https://doi.org/10.1093/biomet/55.3.547 
  • Imbens, G., & Lancaster, T. (1994). Combining micro and macro data in microeconometric models. Review of Economic Studies61(4), 655–680. https://doi.org/10.2307/2297913
  • Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distribution statistical inference. Journal of the American Statistical Association114(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274 
  • Lee, J., Liu, Q., Sun, Y., & Taylor, J. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research18, 1–30. http://jmlr.org/papers/v18/16-002.html 
  • Lin, D. Y., & Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika97(2), 321–332. https://doi.org/10.1093/biomet/asq006 
  • Luo, L. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society, Series B82(1), 69–97. https://doi.org/10.1111/rssb.12352 
  • Neiswanger, W., Wang, C., & Xing, E. (2015). Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the 30th conference on uncertainty in artificial intelligence (pp. 623–632). AUAI Press. 
  • Nguyen, T. D., Shih, M. H., Srivastava, D., Tirthapura, S., & Xu, B. (2021). Stratified random sampling from streaming and stored data. Distributed and Parallel Databases39(3), 665–710. https://doi.org/10.1007/s10619-020-07315-w
  • Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237 
  • Owen, A. B. (1990). Empirical likelihood ratio confidence regions. Annals of Statistics18(1), 90–120. https://doi.org/10.1214/aos/1176347494 
  • Owen, A. B. (2001). Empirical likelihood. CRC. 
  • Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484 
  • Qin, J. (2017). Biased sampling, over-identified parameter problems and beyond. Springer. 
  • Qin, J., & Lawless, J. (1994). Empirical likelihood and general equations. Annals of Statistics22(1), 300–325. https://doi.org/10.1214/aos/1176325370
  • Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate specific disease prevalence information to increase the power of case-control study. Biometrika102(1), 169–180. https://doi.org/10.1093/biomet/asu048 
  • Susanne, M. S. (2007). Point estimation with exponentially tilted empirical likelihood. Annals of Statistics35(2), 634–672. https://doi.org/10.1214/009053606000001208 
  • Tian, L., & Gu, Q. (2016). Communication-efficient distributed sparse linear discriminant analysis. arXiv:1610.04798. 
  • van de Geer, S., Buhlmann, P., Ritov, Y., & Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high dimensional models. Annals of Statistics42(3), 1166–1202. https://doi.org/10.1214/14-AOS1221 
  • van de Vaart, V. W. (2000). Asymptotic statistics. Cambridge University Press. 
  • Wang, X., & Dunson, D. (2015). Parallelizing MCMC via Weierstrass sampler. arXiv:1312.4605. 
  • Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In Proceedings of the 34th international conference on machine learning, Sydney, Australia, PMLR 70 (pp. 3636–3645). 
  • Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association96(453), 185–193. https://doi.org/10.1198/016214501750333054 
  • Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer. 
  • Zeng, D. & Lin, D. Y. (2015). On random-effects meta-analysis. Biometrika, 102(2), 281–294. 
  • Zhang, Y., Duchi, J., & Wainwright, M. (2013). Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research14, 3321–3363. 
  • Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika107(3), 689–703. https://doi.org/10.1093/biomet/asaa014

To cite this article: Jing Qin, Yukun Liu & Pengfei Li (2022): A selective review of statistical methods using calibration information from similar studies, Statistical Theory and Related Fields, DOI: 10.1080/24754269.2022.2037201

To link to this article: https://doi.org/10.1080/24754269.2022.2037201