A selective review of statistical methods using calibration information from similar studies

ISSN 2475-4269

CN 31-2182/O1

Yukun Liu ,

KLATASDS – MOE, School of Statistics, East China Normal University, Shanghai, People's Republic of China

ykliu@sfs.ecnu.edu.cn

Pengfei Li

Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada

Pages | Received 23 Jun. 2022, Accepted 24 Jun. 2022, Published online: 27 Jul. 2022,

Abstract
Full Article
References
Citations

In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.

Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002, June 3–5). Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, Madison, Wisconsin, USA (pp. 1–16). ACM.
Back, K., & Brown, D. P. (1992). GMM, maximum likelihood, and nonparametric efficiency. Economics Letters, 39(1), 23–28. https://doi.org/10.1016/0165-1765(92)90095-G
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. (2009). Introduction to meta-analysis. Wiley.
Braverman, M., Garg, A., Ma, T., Nguyen, H., & Woodruff, D. (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on theory of computing (pp. 1011–1020). ACM.
Chatterjee, N., Chen, Y.-H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating population level information: an empirical likelihood based approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2), 311–328. https://doi.org/10.1111/rssb.2008.70.issue-2
Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika, 80(1), 107–116. https://doi.org/10.1093/biomet/80.1.107
Chen, J., Sitter, R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika, 89(1), 230–237. https://doi.org/10.1093/biomet/89.1.230
Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley.
Dersimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3), 177–188. https://doi.org/10.1016/0197-2456(86)90046-2
Duan, R., Ning, Y., & Chen, Y. (2020). Heterogeneity-aware and communication-efficient distributed statistical inference. arXiv:1912.09623v1.
Duchi, J., Jordan, M., Wainwright, M., & Zhang, Y. (2015). Optimality guarantees for distributed statistical estimation. arXiv:1405.0782.
Han, P., & Lawless, J. (2016). Comment. Journal of the American Statistical Association, 111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50(4), 1029–1054. https://doi.org/10.2307/1912775
Hartely, H. O., & Rao, J. N. K. (1968). A new estimation theory for sample surveys. Biometrika, 55(3), 547–557. https://doi.org/10.1093/biomet/55.3.547
Imbens, G., & Lancaster, T. (1994). Combining micro and macro data in microeconometric models. Review of Economic Studies, 61(4), 655–680. https://doi.org/10.2307/2297913
Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distribution statistical inference. Journal of the American Statistical Association, 114(526), 668–681.https://doi.org/10.1080/01621459.2018.1429274
Lee, J., Liu, Q., Sun, Y., & Taylor, J. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(2017), 1–30.
Lin, D. Y., & Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97(2), 321–332. https://doi.org/10.1093/biomet/asq006
Luo, L., & Song, P. X. K. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society, Series B, 82(1), 69–97. https://doi.org/10.1111/rssb.12352
Neiswanger, W., Wang, C., & Xing, E. (2015). Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the thirtieth conference on uncertainty in artificial intelligence (pp. 623–632). AUAI Press.
Nguyen, T. D., Shih, M. H., Srivastava, D., Tirthapura, S., & Xu, B. (2021). Stratified random sampling from streaming and stored data. Distributed and Parallel Databases, 39, 665–710. https://doi.org/10.1007/s10619-020-07315-w
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
Owen, A. B. (1990). Empirical likelihood ratio confidence regions. Annals of Statistics, 18(1), 90–120. https://doi.org/10.1214/aos/1176347494
Owen, A. B. (2001). Empirical likelihood. CRC.
Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484
Qin, J. (2017). Biased sampling, over-identified parameter problems and beyond. Springer-Verlag.
Qin, J., & Lawless, J. (1994). Empirical likelihood and general equations. Annals of Statistics, 22(1), 300–325. https://doi.org/10.1214/aos/1176325370
Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate specific disease prevalence information to increase the power of case-control study. Biometrika, 102(1), 169–180. https://doi.org/10.1093/biomet/asu048
Susanne, M. S. (2007). Point estimation with exponentially tilted empirical likelihood. Annals of Statistics, 35(2), 634–672. https://doi.org/10.1214/009053606000001208
Tian, L., & Gu, Q. (2016). Communication-efficient distributed sparse linear discriminant analysis. arXiv:1610.04798.
van de Geer, S., Buhlmann, P., Ritov, Y., & Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high dimensional models. Annals of Statistics, 42(3), 1166–1202. https://doi.org/10.1214/14-AOS1221
van de Vaart, V. W. (2000). Asymptotic statistics. Cambridge University Press.
Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In Proceedings of the 34th international conference on machine learning, 70 (pp. 3636–3645). PMLR.
Wang, X., & Dunson, D. (2015). Parallelizing MCMC via Weierstrass sampler. arXiv:1312.4605.
Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association, 96(453), 185–193. https://doi.org/10.1198/016214501750333054
Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer.
Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
Zhang, Y., Duchi, J., & Wainwright, M. (2013). Communi-cation-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14, 3321–3363.
Zeng, D., & Lin, D. Y. (2015). On random-effects meta-analysis. Biometrika, 102(2), 281–294. https://doi.org/10.1093/biomet/asv011

To cite this article: Jing Qin, Yukun Liu & Pengfei Li (2022): A selective review of statistical methods using calibration information from similar studies, Statistical Theory and Related Fields, DOI: 10.1080/24754269.2022.2096426

To link to this article: https://doi.org/10.1080/24754269.2022.2096426

Archives

Authors

About the Journal

Links

Search

Archives