Review Articles

Empirical likelihood inference and goodness-of-fit test for logistic regression model under two-phase case-control sampling

Zhen Sheng ,

KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Yukun Liu ,

KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Jing Qin

National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD, USA

Pages 0 | Received 02 Mar. 2021, Accepted 18 Jun. 2021, Published online: 08 Jul. 2021,
  • Abstract
  • Full Article
  • References
  • Citations

Due to cost-effectiveness and high efficiency, two-phase case-control sampling has been widely used in epidemiology studies. We develop a semi-parametric empirical likelihood approach to two-phase case-control data under the logistic regression model. We show that the maximum empirical likelihood estimator has an asymptotically normal distribution, and the empirical likelihood ratio follows an asymptotically central chi-square distribution. We find that the maximum empirical likelihood estimator is equal to Breslow and Holubkov (1997)’s maximum likelihood estimator. Even so, the limiting distribution of the likelihood ratio, likelihood-ratio-based interval, and test are all new. Furthermore, we construct new Kolmogorov–Smirnov type goodness-of-fit tests to test the validation of the underlying logistic regression model. Our simulation results and a real application show that the likelihood-ratio-based interval and test have certain merits over the Wald-type counterparts and that the proposed goodness-of-fit test is valid.

References

  • Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika, 59(1), 1935. https://doi.org/10.1093/biomet/59.1.19
  • Anderson, J. A. (1979). Multivariate logistic compounds. Biometrika, 66, 1726. https://doi.org/10.2307/2335237
  • Breslow, N. E., & Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika, 75(1), 1120. https://doi.org/10.1093/biomet/75.1.11
  • Breslow, N. E., & Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Applied Statistics, 48(4), 457468. https://doi.org/10.1111/1467-9876.00165
  • Breslow, N., & Day, N. E. (1980). Statistical methods in cancer research. Volume 1. The analysis of case-control studies. IARC Scientific Publications.
  • Breslow, N. E., & Holubkov, R. (1997). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society Series B, 59(2), 447461. https://doi.org/10.1111/rssb.1997.59.issue-2
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E., & Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences, 1(1), 3249. https://doi.org/10.1007/s12561-009-9001-6
  • Cai, S., Chen, J., & Zidek, J. V. (2017). Hypothesis testing in the presence of multiple samples under density ratio models. Statistica Sinica, 27, 761783. https://doi.org/10.5705/ss.2014.168
  • Chen, J., & Liu, Y. (2013). Quantile and quantile-function estimations under density ratio model. Annals of Statistics, 41(3), 16691692. https://doi.org/10.1214/13-AOS1129
  • Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika, 80(1), 107116. https://doi.org/10.1093/biomet/80.1.107
  • Chen, J., Sitter, R. R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika, 89(1), 230237. https://doi.org/10.1093/biomet/89.1.230
  • D'Angio, G. J., Breslow, N., Beckwith, J. B., Evans, A., Baum, H., Fernbach, D., Hrabovsky, E., Jones, B., & Kelalis, P. (1989). Treatment of Wilms' tumour. Results of the third national Wilms' tumor study. Cancer, 64(2), 349360. https://doi.org/10.1002/(ISSN)1097-0142
  • Diao, G., Ning, J., & Qin, J. (2012). Maximum likelihood estimation for semiparametric density ratio model. The International Journal of Biostatistics, 8(1). https://doi.org/10.1515/1557-4679.1372
  • DiCiccio, T., Hall, P., & Romano, J. (1991). Empirical likelihood is Bartlett-correctable. The Annals of Statistics, 19(2), 10531061. https://doi.org/10.1214/aos/1176348137
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 126. https://doi.org/10.1214/aos/1176344552
  • Farewell, V. (1979). Some results on the estimation of logistic models based on retrospective data. Biometrika, 66(1), 2732. https://doi.org/10.1093/biomet/66.1.27
  • Flanders, W. D., & Greenland, S. (1991). Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine, 10(5), 739747. https://doi.org/10.1002/(ISSN)1097-0258
  • Green, D. M., Breslow, N. E., Beckwith, J. B., Finklestein, J. Z., Grundy, P. E., P. R. Thomas, Kim, T., Shochat, S. J., Haase, G. M., Ritchey, M. L., Kelalis, P. P., & D'Angio, G. J. (1998). Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms' tumor: A report from the national Wilms' tumor study group. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 16(1), 237245. https://doi.org/10.1200/JCO.1998.16.1.237
  • Kitamura, Y. (2006). Empirical likelihood methods in econometrics: theory and practice. Discussion Paper 1569. Cowles Foundation.
  • Lawless, J. F., Kalbfleisch, J. D., & Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society Series B, 61(2), 413438. https://doi.org/10.1111/rssb.1999.61.issue-2
  • Liu, Y., & Chen, J. (2010). Adjusted empirical likelihood with high-order precision. The Annals of Statistics, 38(3), 13411362. https://doi.org/10.1214/09-AOS750
  • Luo, X., & Tsai, W. Y. (2012). A proportional likelihood ratio model. Biometrika, 99(1), 211222. https://doi.org/10.1093/biomet/asr060
  • Neyman, J. (1938). Contribution to the theory of sampling human populations. Journal of the American Statistical Association, 33(201), 101116. https://doi.org/10.1080/01621459.1938.10503378
  • Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237249. https://doi.org/10.1093/biomet/75.2.237
  • Owen, A. B. (2001). Empirical likelihood. Chapman and Hall.
  • Prentice, R. L., & Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika, 66(3), 403411. https://doi.org/10.1093/biomet/66.3.403
  • Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3), 619630. https://doi.org/10.1093/biomet/85.3.619
  • Qin, J., & Zhang, B. (1997). A goodness-of-fit test for logistic regression models based on case-control data. Biometrika, 84(3), 609618. https://doi.org/10.1093/biomet/84.3.609
  • Qin, J., & Zhang, B. (2005). Density estimation under a two-sample semiparametric model. Journal of Nonparametric Statistics, 17(6), 665683. https://doi.org/10.1080/10485250500039346
  • Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102(1), 169180. https://doi.org/10.1093/biomet/asu048
  • Saegusa, T., & Wellner, J. A. (2013). Weighted likelihood estimation under two-phase sampling. Annals of Statistics, 41(1), 269295. https://doi.org/10.1214/12-AOS1073
  • Schaid, D. J., Jenkins, G. D., Ingle, J. N., & Weinshilboum, R. M. (2013). Two-phase designs to follow-up genome-wide association signals with DNA resequencing studies. Genetic Epidemiology, 37(3), 229238. https://doi.org/10.1002/gepi.2013.37.issue-3
  • Schill, W., Jöckel, K. H., Drescher, K., & Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika, 80(2), 339352. https://doi.org/10.1093/biomet/80.2.339
  • Scott, A. J., & Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84(1), 5771. https://doi.org/10.1093/biomet/84.1.57
  • Thomas, D. C., Yang, Z., & Yang, F. (2013). Two-phase and family-based designs for next-generation sequencing studies. Frontiers in Genetics, 4, 276. https://doi.org/10.3389/fgene.2013.00276
  • Walker, A. M. (1982). Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known. Biometrics, 38(4), 10251032. https://doi.org/10.2307/2529883
  • White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115(1), 119128. https://doi.org/10.1093/oxfordjournals.aje.a113266
  • Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer.
  • Zhao, P., & Wu, C. (2019). Some theoretical and practical aspects of empirical likelihood methods for complex surveys. International Statistical Review, 87(1), S239S256. https://doi.org/10.1111/insr.v87.S1
  • Zhou, H., Song, R., Wu, Y., & Qin, J. (2011). Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics, 67(1), 194202. https://doi.org/10.1111/j.1541-0420.2010.01446.x

To cite this article: Zhen Sheng, Yukun Liu & Jing Qin (2022) Empirical likelihood inference and goodness-of-fit test for logistic regression model under two-phase case-control sampling, Statistical Theory and Related Fields, 6:4, 265-276, DOI: 10.1080/24754269.2021.1946373 To link to this article: https://doi.org/10.1080/24754269.2021.1946373