Review Articles

FragmGAN: generative adversarial nets for fragmentary data imputation and prediction

Fang Fang ,

KLATASDS – MOE, School of Statistics, East China Normal University, Shanghai, People's Republic of China

ffang@sfs.ecnu.edu.cn

Shenliao Bao

KLATASDS – MOE, School of Statistics, East China Normal University, Shanghai, People's Republic of China

Pages | Received 25 Dec. 2022, Accepted 12 Oct. 2023, Published online: 27 Oct. 2023,
  • Abstract
  • Full Article
  • References
  • Citations

Modern scientific research and applications very often encounter ‘fragmentary data’ which brings big challenges to imputation and prediction. By leveraging the structure of response patterns, we propose a unified and flexible framework based on Generative Adversarial Nets (GAN) to deal with fragmentary data imputation and label prediction at the same time. Unlike most of the other generative model based imputation methods that either have no theoretical guarantee or only consider Missing Completed At Random (MCAR), the proposed FragmGAN has theoretical guarantees for imputation with data Missing At Random (MAR) while no hint mechanism is needed. FragmGAN trains a predictor with the generator and discriminator simultaneously. This linkage mechanism shows significant advantages for predictive performances in extensive experiments.

References

  • Awan, S. E., Bennamoun, M., Sohel, F., Sanfilippo, F., & Dwivedi, G. (2021). Imputation of missing data with class imbalance using conditional generative adversarial networks. Neurocomputing453(17), 164–171. https://doi.org/10.1016/j.neucom.2021.04.010 
  • Camino, R. D., Hammerschmidt, C. A., & State, R. (2019). Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666v1.
  • Dalca, A. V., Guttag, J., & Sabuncu, M. R. (2019). Unsupervised data imputation via variational inference of deep subspaces. arXiv preprint arXiv:1903.03503v1.
  • Deng, G., Han, C., & Matteson, D. S. (2020). Learning to rank with missing data via generative adversarial networks. arXiv preprint arXiv:2011.02089v2. 
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussions). Journal of Royal Statistical Society Series B70(5), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x 
  • Fang, F., Lan, W., Tong, J., & Shao, J. (2019). Model averaging for prediction with fragmentary data. Journal of Business & Economic Statistics37(3), 517–527. https://doi.org/10.1080/07350015.2017.1383263 
  • Friedjungová, M., Vasata, D., Balatsko, M., & Jirina, M. (2020). Missing features reconstruction using a Wasserstein generative adversarial imputation network. In International Conference on Computational Science (ICCS 2020). pp. 225–239.  
  • García-Laencina, P. J., Sancho-Gómez, J. -L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications19(2), 263–282. https://doi.org/10.1007/s00521-009-0295-6  
  • Ghalebikesabi, S., Cornish, R., Holmes, C., & Kelly, L. (2021). Deep generative missingness pattern-set mixture models. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021). pp. 3727–3735. 
  • Gondara, L., & Wang, K. (2017). Multiple imputation using deep denoising autoencoders. arXiv preprint arXiv:1705.02737. 
  • Gong, Y., Hajimirsadeghi, H., He, J., Durand, T., & Mori, G. (2021). Variational selective autoencoder: Learning from partially-observed heterogeneous data. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021). pp. 2377–2385.  
  • Hwang, U., Jung, D., & Yoon, S. (2019). HexaGAN: Generative adversarial nets for real world classification. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019). pp. 2921–2930.  
  • Ipsen, N. B., Mattei, P. -A., & Frellsen, J. (2021). NOT-MIWAE: Deep generative modelling with missing not at random data. In International Conference on Learning Representations (ICLR 2021). 
  • Ivanov, O., Figurnov, M., & Vetrov, D. (2019). Variational autoencoder with arbitrary conditioning. In International Conference on Learning Representations (ICLR 2019). 
  • Lee, D., Kim, J., Moon, W.-J., & Ye, J. C. (2019). CollaGAN: Collaborative gan for missing image data imputation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019). pp. 2487–2496.  
  • Li, Q., & Li, L. (2021). Integrative factor regression and its inference for multimodal data analysis. Journal of the American Statistical Association, https://doi.org/10.1080/01621459.2021.1914635.  
  • Li, S. C. -X., Jiang, B., & Marlin, B. (2019). MisGAN: Learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations (ICLR 2019).  
  • Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
  • Lin, H., Liu, W., & Lan, W. (2021). Regression analysis with individual-specific patterns of missing covariates. Journal of Business & Economic Statistics39(1), 179–188. https://doi.org/10.1080/07350015.2019.1635486  
  • Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. (2nd ed.).John Wiley and Sons.  
  • Ma, W., & Chen, H. G. (2019). Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 
  • Mattei, P.-A., & Frellsen, J. (2019). MIWAE: Deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019). pp. 4413–4423.  
  • Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research11(Aug), 2287–2322. 
  • Neves, D. T., Naik, M. G., & Proenca, A. (2021). SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN methods for missing data imputation. In International Conference on Computational Science (ICCS 2021). pp. 98–113. 
  • Qiu, W., Huang, Y., & Li, Q. (2020). IFGAN: Missing value imputation using feature-specific generative adversarial networks. In IEEE International Conference on Big Data (Big Data2020). pp. 4715–4723. 
  • Richardson, T. W., Wu, W., Lin, L., Xu, B., & Bernal, E. A. (2020). MCFlow: Monte carlo flow models for data imputation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020). pp. 14205–14214. 
  • Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys. John Wiley and Sons.  
  • Smieja, M., Kolomycki, M., Struski, L., Juda, M., & Figueiredo, M. A. T. (2020). Iterative imputation of missing data using auto-encoder dynamics. In International Conference on Neural Information Processing (ICONIP 2020). 
  • Stekhoven, D. J., & Buhlmann, P. (2011). MissForest – nonparametric missing value imputation for mixed-type data. Bioinformatics28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597 
  • van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: multivariate imputation by chained equations in R. Journal of Statistical Software45(3), 1–67.  
  • Wang, Y., Li, D., Li, X., & Yang, M. (2021). PC-GAIN: pseudo-label conditional generative adversarial imputation networks for incomplete data. Neural Networks141(Sep), 395–403. https://doi.org/10.1016/j.neunet.2021.05.033 
  • Xue, F., & Qu, A. (2021). Integrating multi-source block-wise missing data in model selection. Journal of the American Statistical Association116(536), 1914–1927. https://doi.org/10.1080/01621459.2020.1751176  
  • Yoon, J., Jordon, J., & van der Schaar, M. (2018). GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018). pp. 5689–5698.
  • Yoon, S., & Sull, S. (2020). GAMIN: Generative adversarial multiple imputation network for highly missing data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020). pp. 8456–8464. 
  • You, J., Ma, X., Ding, D., Kochenderfer, M., & Leskovec, J. (2020). Handling missing data with graph representation learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). 
  • Zhang, Y., Tang, N., & Annie, Q. (2020). Imputed factor regression for high-dimensional blockwise missing data. Statistica Sinica30(2), 631–651. 

To cite this article: Fang Fang & Shenliao Bao (2024) FragmGAN: generative adversarial nets for fragmentary data imputation and prediction, Statistical Theory and Related Fields, 8:1, 15-28, DOI: 10.1080/24754269.2023.2272554

To link to this article: https://doi.org/10.1080/24754269.2023.2272554