Review Articles

A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniques

Xuetong Li ,

Guanghua School of Management, Peking University, Beijing, People's Republic of China

Yuan Gao ,

Guanghua School of Management, Peking University, Beijing, People's Republic of China

yuan_gao@pku.edu.cn; ygao_stat@outlook.com

Hong Chang ,

Guanghua School of Management, Peking University, Beijing, People's Republic of China

Danyang Huang ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Yingying Ma ,

School of Economics and Management, Beihang University, Beijing, People's Republic of China

Rui Pan ,

School of Statistics and Mathematics, Central University of Finance and Economics, Beijing, People's Republic of China

Haobo Qi ,

School of Statistics, Beijing Normal University, Beijing, People's Republic of China

Feifei Wang ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Shuyuan Wu ,

School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

Ke Xu ,

School of Statistics, University of International Business and Economics, Beijing, People's Republic of China

Jing Zhou ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Xuening Zhu ,

School of Data Science and MOE Laboratory for National Development and Intelligent Governance, Fudan University, Shanghai, People's Republic of China

Yingqiu Zhu ,

School of Statistics, University of International Business and Economics, Beijing, People's Republic of China

Hansheng Wang

Guanghua School of Management, Peking University, Beijing, People's Republic of China

Pages | Received 28 Jan. 2024, Accepted 05 Apr. 2024, Published online: 23 Apr. 2024,
  • Abstract
  • Full Article
  • References
  • Citations

This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.

References

  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., …Zheng, X. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
  • Agarwal, N., Bullins, B., & Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research18(1), 4148–4187.
  • Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica31(2), 749–772.
  • Alhamzawi, R., & Ali, H. T. M. (2018). Bayesian quantile regression for ordinal longitudinal data. Journal of Applied Statistics45(5), 815–828. https://doi.org/10.1080/02664763.2017.1315059
  • Assran, M., & Rabbat, M. (2020). On the convergence of Nesterov's accelerated gradient method in stochastic settings. In International Conference on Machine Learning. PMLR.
  • Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in neural information processing systems. Curran Associates, Inc.
  • Battey, H., Fan, J., Liu, H., Lu, J., & Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics46(3), 1352–1382. https://doi.org/10.1214/17-AOS1587
  • Bauer, M., Cook, H., & Khailany, B. (2011). CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery.
  • Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., & Wood, F. (2017). Online learning rate adaptation with hypergradient descent. arXiv: 1703.04782.
  • Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics, SIAM.
  • Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences2(1), 183–202. https://doi.org/10.1137/080716542
  • Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
  • Bergou, E. H., Diouane, Y., Kunc, V., Kungurtsev, V., & Royer, C. W. (2022). A subsampling line-search method with second-order results. INFORMS Journal on Optimization4(4), 403–425. https://doi.org/10.1287/ijoo.2022.0072
  • Bickel, P. J., & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics9(6), 1196–1217.
  • Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica7(1), 1–31.
  • Blot, M., Picard, D., Cord, M., & Thome, N. (2016). Gossip training for deep learning. arXiv: 1611.09726.
  • Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review60(2), 223–311. https://doi.org/10.1137/16M1080173
  • Broyden, C. G., Dennis Jr, J. E., & Moré, J. J. (1973). On the local and superlinear convergence of quasi-Newton methods. IMA Journal of Applied Mathematics12(3), 223–245. https://doi.org/10.1093/imamat/12.3.223
  • Casella, G., & Berger, R. L. (2002). Statistical inference. Duxbury Pacific Grove.
  • Chang, X., Lin, S., & Wang, Y. (2017). Divide and conquer local average regression. Electronic Journal of Statistics11(1), 1326–1350. https://doi.org/10.1214/17-EJS1265
  • Chen, C. W., Dunson, D. B., Reed, C., & Yu, K. (2013). Bayesian variable selection in quantile regression. Statistics and Its Interface6(2), 261–274. https://doi.org/10.4310/SII.2013.v6.n2.a9
  • Chen, S., Yu, D., Zou, Y., Yu, J., & Cheng, X. (2022). Decentralized wireless federated learning with differential privacy. IEEE Transactions on Industrial Informatics18(9), 6273–6282. https://doi.org/10.1109/TII.2022.3145010
  • Chen, S. X., & Peng, L. (2021). Distributed statistical inference for massive data. The Annals of Statistics49(5), 2851–2869.
  • Chen, W., Wang, Z., & Zhou, J. (2014). Large-scale L-BFGS using mapreduce. In Advances in neural information processing systems. Curran Associates, Inc.
  • Chen, X., Lee, J. D., Tong, X. T., & Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics48(1), 251–273. https://doi.org/10.1214/18-AOS1801
  • Chen, X., Liu, W., Mao, X., & Yang, Z. (2020). Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research21(1), 7432–7474.
  • Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics47(6), 3244–3273.
  • Chen, X., Liu, W., & Zhang, Y. (2022). First-order newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association117(540), 1858–1874. https://doi.org/10.1080/01621459.2021.1891925
  • Chen, X., & Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica24(4), 1655–1684.
  • Chen, Z., Mou, S., & Maguluri, S. T. (2022). Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proceedings of the ACM on Measurement and Analysis of Computing Systems6(1), 1–24. https://doi.org/10.1145/3508039
  • Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Narasimhamurthy, S., & Laure, E. (2018). Characterizing deep-learning I/O workloads in TensorFlow. In International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems.IEEE.
  • Choi, D., Passos, A., Shallue, C. J., & Dahl, G. E. (2019). Faster neural network training with data echoing. arXiv: 1907.05550.
  • Crane, R., & Roosta, F. (2019). DINGO: Distributed Newton-type method for gradient-norm optimization. In Advances in neural information processing systems. Curran Associates, Inc.
  • Cyrus, S., Hu, B., Van Scoy, B., & Lessard, L. (2018). A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference (pp. 1376–1381). IEEE.
  • Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on Optimization1(1), 1–17. https://doi.org/10.1137/0801001
  • Defazio, A., Bach, F., & Lacoste Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems. Curran Associates, Inc.
  • Deng, J., Dong, W., Socher, R., Li, Li, & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE.
  • Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
  • Dieuleveut, A., Durmus, A., & Bach, F. (2020). Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics48(3), 1348–1382. https://doi.org/10.1214/19-AOS1850
  • Drineas, P., Mahoney, M. W., & Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics.
  • Drineas, P., Mahoney, M. W., Muthukrishnan, S., & Sarlós, T. (2011). Faster least squares approximation. Numerische Mathematik117(2), 219–249. https://doi.org/10.1007/s00211-010-0331-6
  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research12(7), 257–269.
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics7(1), 1–26. https://doi.org/10.1214/aos/1176344552
  • Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics9(3), 586–596. https://doi.org/10.1214/aos/1176345462
  • Eisen, M., Mokhtari, A., & Ribeiro, A. (2017). Decentralized quasi-Newton methods. IEEE Transactions on Signal Processing65(10), 2613–2628. https://doi.org/10.1109/TSP.2017.2666776
  • Fan, J., Guo, Y., & Wang, K. (2023). Communication-efficient accurate statistical estimation. Journal of the American Statistical Association118(542), 1000–1010. https://doi.org/10.1080/01621459.2021.1969238
  • Fan, J., Li, R., Zhang, C., & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology)70(5), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  • Fan, J., Wang, D., Wang, K., & Zhu, Z. (2019). Distributed estimation of principal eigenspaces. The Annals of Statistics47(6), 3009–3031. https://doi.org/10.1214/18-AOS1713
  • Gao, D., Ju, C., Wei, X., Liu, Y., Chen, T., & Yang, Q. (2019). HHHFL: Hierarchical heterogeneous horizontal federated learning for electroencephalography. arXiv: 1909.05784.
  • Gao, Y., Li, J., Zhou, Y., Xiao, F., & Liu, H. (2021). Optimization methods for large-scale machine learning. In 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE.
  • Gao, Y., Zhu, X., Qi, H., Li, G., Zhang, R., & Wang, H. (2023). An asymptotic analysis of random partition based minibatch momentum methods for linear regression models. Journal of Computational and Graphical Statistics32(3), 1083–1096. https://doi.org/10.1080/10618600.2022.2143786
  • Gargiani, M., Zanelli, A., Diehl, M., & Hutter, F. (2020). On the promise of the stochastic generalized Gauss-Newton method for training DNNs. arXiv: 2006.02409.
  • Ge, R., Kakade, S. M., Kidambi, R., & Netrapalli, P. (2019). The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in neural information processing systems. Curran Associates, Inc.
  • Gitman, I., Lang, H., Zhang, P., & Xiao, L. (2019). Understanding the role of momentum in stochastic gradient methods. In Advances in neural information processing systems. Curran Associates, Inc.
  • Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of Computation24(109), 23–26. https://doi.org/10.1090/mcom/1970-24-109
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
  • Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., & Richtárik, P. (2019). SGD: General analysis and improved rates. In International Conference on Machine Learning. PMLR.
  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677.
  • Gu, J., & Chen, S. (2023). Statistical inference for decentralized federated learning. Working Paper.
  • Gürbüzbalaban, M., Ozdaglar, A., & Parrilo, P. A. (2021). Why random reshuffling beats stochastic gradient descent. Mathematical Programming186, 49–84. https://doi.org/10.1007/s10107-019-01440-w
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
  • Hector, E. C., & Song, P. X. (2020). Doubly distributed supervised learning and inference with high-dimensional correlated outcomes. Journal of Machine Learning Research21(1), 6983–7017.
  • Hector, E. C., & Song, P. X. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. Journal of the American Statistical Association116(534), 805–818. https://doi.org/10.1080/01621459.2020.1736082
  • Hoffer, E., Nun, T. B., Hubara, I., Giladi, N., Hoefler, T., & Soudry, D. (2019). Augment your batch: Better training with larger batches. arXiv: 1901.09335.
  • Hu, A., Jiao, Y., Liu, Y., Shi, Y., & Wu, Y. (2021). Distributed quantile regression for massive heterogeneous data. Neurocomputing448, 249–262. https://doi.org/10.1016/j.neucom.2021.03.041
  • Huang, C., & Huo, X. (2019). A distributed one-step estimator. Mathematical Programming174(1-2), 41–76. https://doi.org/10.1007/s10107-019-01369-0
  • Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems. Curran Associates, Inc.
  • Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association114(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274
  • Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics23, 462–466. https://doi.org/10.1214/aoms/1177729392
  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv: 1412.6980.
  • Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology)76(4), 795–816. https://doi.org/10.1111/rssb.12050
  • Koenker, R. (2005). Quantile regression. Cambridge University Press.
  • Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica46(1), 33–50. https://doi.org/10.2307/1913643
  • Korkmaz, S. (2020). Deep learning-based imbalanced data classification for drug discovery. Journal of Chemical Information and Modeling60(9), 4180–4190. https://doi.org/10.1021/acs.jcim.9b01162
  • Kostov, P., & Davidova, S. (2013). A quantile regression analysis of the effect of farmers' attitudes and perceptions on market participation. Journal of Agricultural Economics64(1), 112–132. https://doi.org/10.1111/jage.2013.64.issue-1
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. Curran Associates, Inc.
  • Lalitha, A., Shekhar, S., Javidi, T., & Koushanfar, F. (2018). Fully decentralized federated learning. In 3rd Workshop on Bayesian Deep Learning (NeurIPS). Curran Associates Inc.
  • Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer.
  • Lee, C., Lim, C. H., & Wright, S. J. (2018). A distributed quasi-Newton algorithm for empirical risk minimization with nonsmooth regularization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery.
  • Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research18(1), 115–144.
  • Li, K. H. (1994). Reservoir-sampling algorithms of time complexity O(n(1+log⁡(N/n))). ACM Transactions on Mathematical Software20(4), 481–493. https://doi.org/10.1145/198429.198435
  • Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association107(499), 1129–1139. https://doi.org/10.1080/01621459.2012.695654
  • Li, X., Li, R., Xia, Z., & Xu, C. (2020). Distributed feature screening via componentwise debiasing. Journal of Machine Learning Research21(24), 1–32.
  • Li, X., Liang, J., Chang, X., & Zhang, Z. (2022). Statistical estimation and online inference via local SGD. In Conference on Learning Theory. PMLR.
  • Li, X., Zhu, X., & Wang, H. (2023). Distributed logistic regression for massive data with rare events. arXiv: 2304.02269.
  • Li, Y., Chen, C., Liu, N., Huang, H., Zheng, Z., & Yan, Q. (2021). A blockchain-based decentralized federated learning framework with committee consensus. IEEE Network35(1), 234–241. https://doi.org/10.1109/MNET.65
  • Lian, H., & Fan, Z. (2018). Divide-and-conquer for debiased ℓ1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research18(182), 1–26.
  • Lian, X., Zhang, C., Zhang, H., Hsieh, C. J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in neural information processing systems. PMLR.
  • Lin, S., & Zhou, D. (2018). Distributed kernel-based gradient descent algorithms. Constructive Approximation47(2), 249–276. https://doi.org/10.1007/s00365-017-9379-1
  • Liu, W., Chen, L., & Wang, W. (2022). General decentralized federated learning for communication-computation tradeoff. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops. IEEE.
  • Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing70, 797–809. https://doi.org/10.1109/TSP.2022.3146785
  • Liu, Y., Gao, Y., & Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum. In Advances in neural information processing systems. Curran Associates, Inc.
  • Loizou, N., & Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications77(3), 653–710. https://doi.org/10.1007/s10589-020-00220-z
  • Luo, L., & Song, P. X. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology)82(1), 69–97. https://doi.org/10.1111/rssb.12352
  • Ma, J., & Yarats, D. (2018). Quasi-hyperbolic momentum and Adam for deep learning. arXiv: 1810.06801.
  • Ma, P., Mahoney, M., & Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning. PMLR.
  • Ma, X., Winslett, M., Lee, J., & Yu, S. (2003). Improving MPI-IO output performance with active buffering plus threads. In International Parallel and Distributed Processing Symposium. IEEE.
  • Ma, Y., Leng, C., & Wang, H. (2024). Optimal subsampling bootstrap for massive data. Journal of Business and Economic Statistics42(1), 174–186. https://doi.org/10.1080/07350015.2023.2166514
  • Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning3(2), 123–224.
  • Mcdonald, R., Mohri, M., Silberman, N., Walker, D., & Mann, G. S. (2009). Efficient large-scale distributed training of conditional maximum entropy models. In Advances in neural information processing Systems. Curran Associates, Inc.
  • Mishchenko, K., Khaled, A., & Richtárik, P. (2020). Random reshuffling: Simple analysis with vast improvements. In Advances in neural information processing systems. Curran Associates, Inc.
  • Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., & Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Conference on Learning Theory. PMLR.
  • Moulines, E., & Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems. Curran Associates, Inc.
  • Mukkamala, M. C., & Hein, M. (2017). Variants of RMSProp and adagrad with logarithmic regret bounds. In International Conference on Machine Learning. PMLR.
  • Nadiradze, G., Sabour, A., Davies, P., Li, S., & Alistarh, D. (2021). Asynchronous decentralized SGD with quantized and local updates. In Advances in neural information processing systems. Curran Associates, Inc.
  • Nakamura, K., Derbel, B., Won, K. J., & Hong, B. W. (2021). Learning-rate annealing methods for deep neural networks. Electronics10(16), 2029. https://doi.org/10.3390/electronics10162029
  • Nedic, A., Olshevsky, A., & Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization27(4), 2597–2633. https://doi.org/10.1137/16M1084316
  • Needell, D., & Ward, R. (2017). Batched stochastic gradient descent with weighted sampling. In Approximation theory XV: San Antonio 2016 15. Springer.
  • Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General)135(3), 370–384. https://doi.org/10.2307/2344614
  • Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k2). In Doklady Akademii Nauk. Russian Academy of Sciences.
  • Nitzberg, B., & Lo, V. (1997). Collective buffering: Improving parallel I/O performance. In IEEE International Symposium on High Performance Distributed Computing. IEEE.
  • Ofeidis, I., Kiedanski, D., & Tassiulas, L. (2022). An overview of the data-loader landscape: Comparative performance analysis. arXiv: 2209.13705.
  • Pan, R., Ren, T., Guo, B., Li, F., Li, G., & Wang, H. (2022). A note on distributed quantile regression by pilot sampling and one-step updating. Journal of Business and Economic Statistics40(4), 1691–1700. https://doi.org/10.1080/07350015.2021.1961789
  • Pan, R., Zhu, Y., Guo, B., Zhu, X., & Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering35(9), 9502–9513. https://doi.org/10.1109/TKDE.2023.3241075
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., …Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. Curran Associates, Inc.
  • Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics4(5), 1–17. https://doi.org/10.1016/0041-5553(64)90137-5
  • Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization30(4), 838–855. https://doi.org/10.1137/0330046
  • Pumma, S., Si, M., Feng, W., & Balaji, P. (2017). Parallel I/O optimizations for scalable deep learning. In IEEE International Conference on Parallel and Distributed Systems. IEEE.
  • Qi, H., Huang, D., Zhu, Y., Huang, D., & Wang, H. (2023). Mini-batch gradient descent with buffer. arXiv: 2312.08728.
  • Qi, H., Wang, F., & Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics32(4), 1348–1360. https://doi.org/10.1080/10618600.2023.2204130
  • Qu, G., & Li, N. (2019). Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control65(6), 2566–2581. https://doi.org/10.1109/TAC.9
  • Reich, B. J., Fuentes, M., & Dunson, D. B. (2012). Bayesian spatial quantile regression. Journal of the American Statistical Association106(493), 6–20. https://doi.org/10.1198/jasa.2010.ap09237
  • Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International Conference on Machine Learning. PMLR.
  • Richardson, A. M., & Lidbury, B. A. (2013). Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data. BMC Bioinformatics14, 206. https://doi.org/10.1186/1471-2105-14-206
  • Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics22(3), 400–407. https://doi.org/10.1214/aoms/1177729586
  • Rosenblatt, J. D., & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Journal of the IMA5(4), 379–404.
  • Roux, N., Schmidt, M., & Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems. Curran Associates, Inc.
  • Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv: 1609.04747.
  • Savazzi, S., Nicoli, M., & Rampa, V. (2020). Federated learning with cooperating devices: A consensus approach for massive lot networks. IEEE Internet of Things Journal7(5), 4641–4654. https://doi.org/10.1109/JIoT.6488907
  • Sengupta, S., Volgushev, S., & Shao, X. (2016). A subsampled double bootstrap for massive data. Journal of the American Statistical Association111(515), 1222–1232. https://doi.org/10.1080/01621459.2015.1080709
  • Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning. PMLR.
  • Shao, J. (2003). Mathematical statistics. Springer.
  • Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Springer.
  • Shu, J., Zhu, Y., Zhao, Q., Meng, D., & Xu, Z. (2022). Mlr-snet: Transferable LR schedules for heterogeneous tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(3), 3505–3521.
  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. OpenReview.net.
  • Soori, S., Mishchenko, K., Mokhtari, A., Dehnavi, M. M., & Gurbuzbalaban, M. (2020). DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
  • Stich, S. U. (2019). Local SGD converges fast and communicates little. In 2019 International Conference on Learning Representations.
  • Su, L., & Xu, J. (2019). Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems3(1), 1–41.
  • Sutskever, I. (2013). Training recurrent neural networks. University of Toronto Toronto.
  • Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. PMLR.
  • Suwandarathna, S., & Koggalage, R. (2007). Increasing hard drive performance – from a thermal perspective. In International Conference on Industrial and Information Systems.
  • Tan, C., Ma, S., Dai, Y., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
  • Tan, K. M., Battey, H., & Zhou, W. (2022). Communication-constrained distributed quantile regression with optimal statistical guarantees. Journal of Machine Learning Research23(1), 12456–12516.
  • Tang, H., Lian, X., Yan, M., Zhang, C., & Liu, J. (2018). Decentralized training over decentralized data. In International Conference on Machine Learning. PMLR.
  • Tang, K., Liu, W., & Zhang, Y. (2023). Acceleration of stochastic gradient descent with momentum by averaging: Finite-sample rates and asymptotic normality. arXiv: 2305.17665.
  • Tang, L., Zhou, L., & Song, P. X. K. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis176, Article 104567. https://doi.org/10.1016/j.jmva.2019.104567
  • Toulis, P., & Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics45(4), 1694–1727. https://doi.org/10.1214/16-AOS1506
  • Tu, J., Liu, W., Mao, X., & Xu, M. (2023). Distributed semi-supervised sparse statistical inference. arXiv: 2306.10395.
  • Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University Press.
  • Vanhaesebrouck, P., Bellet, A., & Tommasi, M. (2017). Decentralized collaborative learning of personalized models over networks. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR.
  • Van Scoy, B., Freeman, R. A., & Lynch, K. M. (2017). The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters2(1), 49–54. https://doi.org/10.1109/LCSYS.2017.2722406
  • Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software1(1), 37–57. https://doi.org/10.1145/3147.3165
  • Volgushev, S., Chao, S., & Cheng, G. (2019). Distributed inference for quantile regression processes. The Annals of Statistics47(3), 1634–1662. https://doi.org/10.1214/18-AOS1730
  • Wang, F., Huang, D., Gao, T., Wu, S., & Wang, H. (2022). Sequential one–step estimator by sub–sampling for customer churn analysis with massive data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics)71(5), 1753–1786.
  • Wang, F., Zhu, Y., Huang, D., Qi, H., & Wang, H. (2021). Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Computational Statistics and Data Analysis162, Article 107265. https://doi.org/10.1016/j.csda.2021.107265
  • Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516
  • Wang, H. (2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice13(3), 46. https://doi.org/10.1007/s42519-019-0048-5
  • Wang, H. (2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research20(132), 1–59.
  • Wang, H. (2020). Logistic regression for massive data with rare events. In International Conference on Machine Learning. PMLR.
  • Wang, H., & Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika108(1), 99–112. https://doi.org/10.1093/biomet/asaa043
  • Wang, H., Yang, M., & Stufken, J. (2019b). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association114(525), 393–405. https://doi.org/10.1080/01621459.2017.1408468
  • Wang, H., Zhu, R., & Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association113(522), 829–844. https://doi.org/10.1080/01621459.2017.1292914
  • Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In International Conference on Machine Learning. PMLR.
  • Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate Newton method for distributed optimization. In Advances in neural information processing systems. Curran Associates, Inc.
  • Wang, X., Yang, Z., Chen, X., & Liu, W. (2019a). Distributed inference for linear support vector machine. Journal of Machine Learning Research20(113), 1–41.
  • Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., & Srebro, N. (2020). Is local SGD better than minibatch SGD? In International Conference on Machine Learning. PMLR.
  • Wu, S., Huang, D., & Wang, H. (2023a). Network gradient descent algorithm for decentralized federated learning. Journal of Business and Economic Statistics41(3), 806–818. https://doi.org/10.1080/07350015.2022.2074426
  • Wu, S., Huang, D., & Wang, H. (2023b). Quasi-Newton updating for large-scale distributed learning. Journal of the Royal Statistical Society: Series B (Statistical Methodology)85(4), 1326–1354. https://doi.org/10.1093/jrsssb/qkad059
  • Wu, S., Zhu, X., & Wang, H. (2023c). Subsampling and jackknifing: A practically convenient solution for large data analysis with limited computational resources. Statistica Sinica33(3), 2041–2064.
  • Xu, G., Sit, T., Wang, L., & Huang, C. Y. (2017). Estimation and inference of quantile regression for survival data under biased sampling. Journal of the American Statistical Association112(520), 1571–1586. https://doi.org/10.1080/01621459.2016.1222286
  • Xu, Q., Cai, C., Jiang, C., Sun, F., & Huang, X. (2020). Block average quantile regression for massive dataset. Statistical Papers61, 141–165. https://doi.org/10.1007/s00362-017-0932-6
  • Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. PMLR.
  • Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers60(2), 585–599. https://doi.org/10.1007/s00362-018-01068-6
  • Yu, J., Wang, H., Ai, M., & Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association117(537), 265–276. https://doi.org/10.1080/01621459.2020.1773832
  • Yu, L., Balasubramanian, K., Volgushev, S., & Erdogdu, M. (2021). An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. In Advanced in neural information processing systems. Curran Associates, Inc.
  • Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization26(3), 1835–1854. https://doi.org/10.1137/130943170
  • Zeiler, M. D. (2012). AdaDelta: An adaptive learning rate method. arXiv: 1212.5701.
  • Zhang, J., & Ré, C. (2016). ParallelSGD: When does averaging help? In International Conference on Machine Learning Workshop on Optimization in Machine Learning. International Conference on Machine Learning (ICML).
  • Zhang, Y., & Lin, X. (2015). DiSCO: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning. PMLR.
  • Zhang, Y., Wainwright, M. J., & Duchi, J. C. (2012). Communication-efficient algorithms for statistical optimization. In Advances in neural information processing systems. Curran Associates, Inc.
  • Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A framework of rebalancing imbalanced healthcare data for rare events' classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering2018, Article 6275435–
  • Zhong, W., Wan, C., & Zhang, W. (2022). Estimation and inference for multi-kink quantile regression. Journal of Business and Economic Statistics40(3), 1123–1139. https://doi.org/10.1080/07350015.2021.1901720
  • Zhou, L., She, X., & Song, P. X. (2023). Distributed empirical likelihood approach to integrating unbalanced datasets. Statistica Sinica33(3), 2209–2231.
  • Zhu, M., Su, W., & Chipman, H. A. (2006). LAGO: A computationally efficient approach for statistical detection. Technometrics48(2), 193–205. https://doi.org/10.1198/004017005000000643
  • Zhu, W., Chen, X., & Wu, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association118(541), 393–404. https://doi.org/10.1080/01621459.2021.1933498
  • Zhu, X., Li, F., & Wang, H. (2021). Least-square approximation for a distributed system. Journal of Computational and Graphical Statistics30(4), 1004–1018. https://doi.org/10.1080/10618600.2021.1923517
  • Zhu, X., Pan, R., Wu, S., & Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business and Economic Statistics40(4), 1892–1903. https://doi.org/10.1080/07350015.2021.1990771
  • Zhu, Y., Huang, D., Gao, Y., Wu, R., Chen, Y., Zhang, B., & Wang, H. (2021). Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks141, 11–29. https://doi.org/10.1016/j.neunet.2021.03.025
  • Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., & Chowdhury, F. (2019). Efficient user-level storage disaggregation for deep learning. In 2019 IEEE International Conference on Cluster Computing.
  • Zhuang, J., Cai, J., Wang, R., Zhang, J., & Zheng, W. (2019). CARE: Class attention to regions of lesion for classification on imbalanced data. In International Conference on Medical Imaging with Deep Learning. PMLR.
  • Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
  • Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics36(4), 1509–1533.

To cite this article: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, Jing Zhou, Xuening Zhu, Yingqiu Zhu & Hansheng Wang (23 Apr 2024): A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniques, Statistical Theory and Related Fields, DOI: 10.1080/24754269.2024.2343151

To link to this article: https://doi.org/10.1080/24754269.2024.2343151