Review Articles

Kernel regression utilizing heterogeneous datasets

Chi-Shian Dai ,

Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA

Jun Shao

Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA;School of Statistics, East China Normal University, Shanghai, People's Republic of China

jshao@wisc.edu

Pages | Received 05 Dec. 2022, Accepted 08 Apr. 2023, Published online: 28 Apr. 2023,
  • Abstract
  • Full Article
  • References
  • Citations

Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.

References

  • Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in Econometrics: Fifth World Congress (Vol. 1, pp. 99–144). Cambridge University Press. 
  • Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157 
  • Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association86(414), 328–332. https://doi.org/10.2307/2290564 
  • Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica33, in press. https://doi.org/10.5705/ss.202021.0446
  • Fan, J., Farmen, M., & Gijbels, I. (1998). Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(3), 591–608. https://doi.org/10.1111/1467-9868.00142 
  • Fan, J., Gasser, T., Gijbels, I., Brockmann, M., & Engel, J. (1997). Local polynomial regression: optimal kernels and asymptotic minimax efficiency. Annals of the Institute of Statistical Mathematics49(1), 79–99. https://doi.org/10.1023/A:1003162622169 
  • Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer. 
  • Kim, H. J., Wang, Z., & Kim, J. K. (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448
  • Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association102(479), 997–1008. https://doi.org/10.1198/016214507000000536 
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035 
  • Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science32(2), 293–312. https://doi.org/10.1214/16-STS584 
  • Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
  • Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association99(468), 1131–1139. https://doi.org/10.1198/016214504000000601 
  • Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications9(1), 141–142. https://doi.org/10.1137/1109020 
  • Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory10(2), 1–21. https://doi.org/10.1017/S0266466600008409. 
  • Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4 
  • Rao, J. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
  • Shao, J. (2003). Mathematical statistics. 2nd ed., Springer. 
  • Shao, Y., Cook, R. D., & Weisberg, S. (2007). Marginal tests with sliced average variance estimation. Biometrika94(2), 285–296. https://doi.org/10.1093/biomet/asm021 
  • Wand, M. P., & Jones, M. C. (1994, December). Kernel smoothing. Number 60 in Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton. 
  • Wasserman, L. (2006). All of nonparametric statistics. Springer. 
  • Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology)64(3), 363–410. https://doi.org/10.1111/1467-9868.03411 
  • Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: a review. Japanese Journal of Statistics and Data Science3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w 
  • Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
  • Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics11(1), 161–184. https://doi.org/10.1214/16-AOAS998 

To cite this article: Chi-Shian Dai & Jun Shao (2024) Kernel regression utilizing heterogeneous datasets, Statistical Theory and Related Fields, 8:1, 51-68, DOI: 10.1080/24754269.2023.2202579

To link to this article: https://doi.org/10.1080/24754269.2023.2202579