A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniques

ISSN 2475-4269

CN 31-2182/O1

Yuan Gao ,

Guanghua School of Management, Peking University, Beijing, People's Republic of China

yuan_gao@pku.edu.cn; ygao_stat@outlook.com

Hong Chang ,

Guanghua School of Management, Peking University, Beijing, People's Republic of China

Danyang Huang ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Yingying Ma ,

School of Economics and Management, Beihang University, Beijing, People's Republic of China

Rui Pan ,

School of Statistics and Mathematics, Central University of Finance and Economics, Beijing, People's Republic of China

Haobo Qi ,

School of Statistics, Beijing Normal University, Beijing, People's Republic of China

Feifei Wang ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Shuyuan Wu ,

School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, People's Republic of China

Ke Xu ,

School of Statistics, University of International Business and Economics, Beijing, People's Republic of China

Jing Zhou ,

Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, People's Republic of China

Xuening Zhu ,

School of Data Science and MOE Laboratory for National Development and Intelligent Governance, Fudan University, Shanghai, People's Republic of China

Yingqiu Zhu ,

School of Statistics, University of International Business and Economics, Beijing, People's Republic of China

Hansheng Wang

Guanghua School of Management, Peking University, Beijing, People's Republic of China

Pages 163-185 | Received 28 Jan. 2024, Accepted 05 Apr. 2024, Published online: 23 Apr. 2024,

Abstract
Full Article
References
Citations

This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., …Zheng, X. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
Agarwal, N., Bullins, B., & Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1), 4148–4187.
Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749–772.
Alhamzawi, R., & Ali, H. T. M. (2018). Bayesian quantile regression for ordinal longitudinal data. Journal of Applied Statistics, 45(5), 815–828. https://doi.org/10.1080/02664763.2017.1315059
Assran, M., & Rabbat, M. (2020). On the convergence of Nesterov's accelerated gradient method in stochastic settings. In International Conference on Machine Learning. PMLR.
Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in neural information processing systems. Curran Associates, Inc.
Battey, H., Fan, J., Liu, H., Lu, J., & Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3), 1352–1382. https://doi.org/10.1214/17-AOS1587
Bauer, M., Cook, H., & Khailany, B. (2011). CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery.
Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., & Wood, F. (2017). Online learning rate adaptation with hypergradient descent. arXiv: 1703.04782.
Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics, SIAM.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. https://doi.org/10.1137/080716542
Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
Bergou, E. H., Diouane, Y., Kunc, V., Kungurtsev, V., & Royer, C. W. (2022). A subsampling line-search method with second-order results. INFORMS Journal on Optimization, 4(4), 403–425. https://doi.org/10.1287/ijoo.2022.0072
Bickel, P. J., & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6), 1196–1217.
Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica, 7(1), 1–31.
Blot, M., Picard, D., Cord, M., & Thome, N. (2016). Gossip training for deep learning. arXiv: 1611.09726.
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223–311. https://doi.org/10.1137/16M1080173
Broyden, C. G., Dennis Jr, J. E., & Moré, J. J. (1973). On the local and superlinear convergence of quasi-Newton methods. IMA Journal of Applied Mathematics, 12(3), 223–245. https://doi.org/10.1093/imamat/12.3.223
Casella, G., & Berger, R. L. (2002). Statistical inference. Duxbury Pacific Grove.
Chang, X., Lin, S., & Wang, Y. (2017). Divide and conquer local average regression. Electronic Journal of Statistics, 11(1), 1326–1350. https://doi.org/10.1214/17-EJS1265
Chen, C. W., Dunson, D. B., Reed, C., & Yu, K. (2013). Bayesian variable selection in quantile regression. Statistics and Its Interface, 6(2), 261–274. https://doi.org/10.4310/SII.2013.v6.n2.a9
Chen, S., Yu, D., Zou, Y., Yu, J., & Cheng, X. (2022). Decentralized wireless federated learning with differential privacy. IEEE Transactions on Industrial Informatics, 18(9), 6273–6282. https://doi.org/10.1109/TII.2022.3145010
Chen, S. X., & Peng, L. (2021). Distributed statistical inference for massive data. The Annals of Statistics, 49(5), 2851–2869.
Chen, W., Wang, Z., & Zhou, J. (2014). Large-scale L-BFGS using mapreduce. In Advances in neural information processing systems. Curran Associates, Inc.
Chen, X., Lee, J. D., Tong, X. T., & Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1), 251–273. https://doi.org/10.1214/18-AOS1801
Chen, X., Liu, W., Mao, X., & Yang, Z. (2020). Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research, 21(1), 7432–7474.
Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244–3273.
Chen, X., Liu, W., & Zhang, Y. (2022). First-order newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, 117(540), 1858–1874. https://doi.org/10.1080/01621459.2021.1891925
Chen, X., & Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4), 1655–1684.
Chen, Z., Mou, S., & Maguluri, S. T. (2022). Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1), 1–24. https://doi.org/10.1145/3508039
Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Narasimhamurthy, S., & Laure, E. (2018). Characterizing deep-learning I/O workloads in TensorFlow. In International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems.IEEE.
Choi, D., Passos, A., Shallue, C. J., & Dahl, G. E. (2019). Faster neural network training with data echoing. arXiv: 1907.05550.
Crane, R., & Roosta, F. (2019). DINGO: Distributed Newton-type method for gradient-norm optimization. In Advances in neural information processing systems. Curran Associates, Inc.
Cyrus, S., Hu, B., Van Scoy, B., & Lessard, L. (2018). A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference (pp. 1376–1381). IEEE.
Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on Optimization, 1(1), 1–17. https://doi.org/10.1137/0801001
Defazio, A., Bach, F., & Lacoste Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems. Curran Associates, Inc.
Deng, J., Dong, W., Socher, R., Li, Li, & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). IEEE.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
Dieuleveut, A., Durmus, A., & Bach, F. (2020). Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3), 1348–1382. https://doi.org/10.1214/19-AOS1850
Drineas, P., Mahoney, M. W., & Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics.
Drineas, P., Mahoney, M. W., Muthukrishnan, S., & Sarlós, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219–249. https://doi.org/10.1007/s00211-010-0331-6
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 257–269.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1–26. https://doi.org/10.1214/aos/1176344552
Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 9(3), 586–596. https://doi.org/10.1214/aos/1176345462
Eisen, M., Mokhtari, A., & Ribeiro, A. (2017). Decentralized quasi-Newton methods. IEEE Transactions on Signal Processing, 65(10), 2613–2628. https://doi.org/10.1109/TSP.2017.2666776
Fan, J., Guo, Y., & Wang, K. (2023). Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 118(542), 1000–1010. https://doi.org/10.1080/01621459.2021.1969238
Fan, J., Li, R., Zhang, C., & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Fan, J., Wang, D., Wang, K., & Zhu, Z. (2019). Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6), 3009–3031. https://doi.org/10.1214/18-AOS1713
Gao, D., Ju, C., Wei, X., Liu, Y., Chen, T., & Yang, Q. (2019). HHHFL: Hierarchical heterogeneous horizontal federated learning for electroencephalography. arXiv: 1909.05784.
Gao, Y., Li, J., Zhou, Y., Xiao, F., & Liu, H. (2021). Optimization methods for large-scale machine learning. In 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE.
Gao, Y., Zhu, X., Qi, H., Li, G., Zhang, R., & Wang, H. (2023). An asymptotic analysis of random partition based minibatch momentum methods for linear regression models. Journal of Computational and Graphical Statistics, 32(3), 1083–1096. https://doi.org/10.1080/10618600.2022.2143786
Gargiani, M., Zanelli, A., Diehl, M., & Hutter, F. (2020). On the promise of the stochastic generalized Gauss-Newton method for training DNNs. arXiv: 2006.02409.
Ge, R., Kakade, S. M., Kidambi, R., & Netrapalli, P. (2019). The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in neural information processing systems. Curran Associates, Inc.
Gitman, I., Lang, H., Zhang, P., & Xiao, L. (2019). Understanding the role of momentum in stochastic gradient methods. In Advances in neural information processing systems. Curran Associates, Inc.
Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109), 23–26. https://doi.org/10.1090/mcom/1970-24-109
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., & Richtárik, P. (2019). SGD: General analysis and improved rates. In International Conference on Machine Learning. PMLR.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677.
Gu, J., & Chen, S. (2023). Statistical inference for decentralized federated learning. Working Paper.
Gürbüzbalaban, M., Ozdaglar, A., & Parrilo, P. A. (2021). Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186, 49–84. https://doi.org/10.1007/s10107-019-01440-w
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
Hector, E. C., & Song, P. X. (2020). Doubly distributed supervised learning and inference with high-dimensional correlated outcomes. Journal of Machine Learning Research, 21(1), 6983–7017.
Hector, E. C., & Song, P. X. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. Journal of the American Statistical Association, 116(534), 805–818. https://doi.org/10.1080/01621459.2020.1736082
Hoffer, E., Nun, T. B., Hubara, I., Giladi, N., Hoefler, T., & Soudry, D. (2019). Augment your batch: Better training with larger batches. arXiv: 1901.09335.
Hu, A., Jiao, Y., Liu, Y., Shi, Y., & Wu, Y. (2021). Distributed quantile regression for massive heterogeneous data. Neurocomputing, 448, 249–262. https://doi.org/10.1016/j.neucom.2021.03.041
Huang, C., & Huo, X. (2019). A distributed one-step estimator. Mathematical Programming, 174(1-2), 41–76. https://doi.org/10.1007/s10107-019-01369-0
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems. Curran Associates, Inc.
Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274
Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23, 462–466. https://doi.org/10.1214/aoms/1177729392
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv: 1412.6980.
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4), 795–816. https://doi.org/10.1111/rssb.12050
Koenker, R. (2005). Quantile regression. Cambridge University Press.
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50. https://doi.org/10.2307/1913643
Korkmaz, S. (2020). Deep learning-based imbalanced data classification for drug discovery. Journal of Chemical Information and Modeling, 60(9), 4180–4190. https://doi.org/10.1021/acs.jcim.9b01162
Kostov, P., & Davidova, S. (2013). A quantile regression analysis of the effect of farmers' attitudes and perceptions on market participation. Journal of Agricultural Economics, 64(1), 112–132. https://doi.org/10.1111/jage.2013.64.issue-1
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. Curran Associates, Inc.
Lalitha, A., Shekhar, S., Javidi, T., & Koushanfar, F. (2018). Fully decentralized federated learning. In 3rd Workshop on Bayesian Deep Learning (NeurIPS). Curran Associates Inc.
Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer.
Lee, C., Lim, C. H., & Wright, S. J. (2018). A distributed quasi-Newton algorithm for empirical risk minimization with nonsmooth regularization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery.
Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1), 115–144.
Li, K. H. (1994). Reservoir-sampling algorithms of time complexity O(n(1+log⁡(N/n))). ACM Transactions on Mathematical Software, 20(4), 481–493. https://doi.org/10.1145/198429.198435
Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139. https://doi.org/10.1080/01621459.2012.695654
Li, X., Li, R., Xia, Z., & Xu, C. (2020). Distributed feature screening via componentwise debiasing. Journal of Machine Learning Research, 21(24), 1–32.
Li, X., Liang, J., Chang, X., & Zhang, Z. (2022). Statistical estimation and online inference via local SGD. In Conference on Learning Theory. PMLR.
Li, X., Zhu, X., & Wang, H. (2023). Distributed logistic regression for massive data with rare events. arXiv: 2304.02269.
Li, Y., Chen, C., Liu, N., Huang, H., Zheng, Z., & Yan, Q. (2021). A blockchain-based decentralized federated learning framework with committee consensus. IEEE Network, 35(1), 234–241. https://doi.org/10.1109/MNET.65
Lian, H., & Fan, Z. (2018). Divide-and-conquer for debiased ℓ1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18(182), 1–26.
Lian, X., Zhang, C., Zhang, H., Hsieh, C. J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in neural information processing systems. PMLR.
Lin, S., & Zhou, D. (2018). Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47(2), 249–276. https://doi.org/10.1007/s00365-017-9379-1
Liu, W., Chen, L., & Wang, W. (2022). General decentralized federated learning for communication-computation tradeoff. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops. IEEE.
Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing, 70, 797–809. https://doi.org/10.1109/TSP.2022.3146785
Liu, Y., Gao, Y., & Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum. In Advances in neural information processing systems. Curran Associates, Inc.
Loizou, N., & Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3), 653–710. https://doi.org/10.1007/s10589-020-00220-z
Luo, L., & Song, P. X. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1), 69–97. https://doi.org/10.1111/rssb.12352
Ma, J., & Yarats, D. (2018). Quasi-hyperbolic momentum and Adam for deep learning. arXiv: 1810.06801.
Ma, P., Mahoney, M., & Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning. PMLR.
Ma, X., Winslett, M., Lee, J., & Yu, S. (2003). Improving MPI-IO output performance with active buffering plus threads. In International Parallel and Distributed Processing Symposium. IEEE.
Ma, Y., Leng, C., & Wang, H. (2024). Optimal subsampling bootstrap for massive data. Journal of Business and Economic Statistics, 42(1), 174–186. https://doi.org/10.1080/07350015.2023.2166514
Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 123–224.
Mcdonald, R., Mohri, M., Silberman, N., Walker, D., & Mann, G. S. (2009). Efficient large-scale distributed training of conditional maximum entropy models. In Advances in neural information processing Systems. Curran Associates, Inc.
Mishchenko, K., Khaled, A., & Richtárik, P. (2020). Random reshuffling: Simple analysis with vast improvements. In Advances in neural information processing systems. Curran Associates, Inc.
Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., & Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Conference on Learning Theory. PMLR.
Moulines, E., & Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems. Curran Associates, Inc.
Mukkamala, M. C., & Hein, M. (2017). Variants of RMSProp and adagrad with logarithmic regret bounds. In International Conference on Machine Learning. PMLR.
Nadiradze, G., Sabour, A., Davies, P., Li, S., & Alistarh, D. (2021). Asynchronous decentralized SGD with quantized and local updates. In Advances in neural information processing systems. Curran Associates, Inc.
Nakamura, K., Derbel, B., Won, K. J., & Hong, B. W. (2021). Learning-rate annealing methods for deep neural networks. Electronics, 10(16), 2029. https://doi.org/10.3390/electronics10162029
Nedic, A., Olshevsky, A., & Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4), 2597–2633. https://doi.org/10.1137/16M1084316
Needell, D., & Ward, R. (2017). Batched stochastic gradient descent with weighted sampling. In Approximation theory XV: San Antonio 2016 15. Springer.
Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. https://doi.org/10.2307/2344614
Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k2). In Doklady Akademii Nauk. Russian Academy of Sciences.
Nitzberg, B., & Lo, V. (1997). Collective buffering: Improving parallel I/O performance. In IEEE International Symposium on High Performance Distributed Computing. IEEE.
Ofeidis, I., Kiedanski, D., & Tassiulas, L. (2022). An overview of the data-loader landscape: Comparative performance analysis. arXiv: 2209.13705.
Pan, R., Ren, T., Guo, B., Li, F., Li, G., & Wang, H. (2022). A note on distributed quantile regression by pilot sampling and one-step updating. Journal of Business and Economic Statistics, 40(4), 1691–1700. https://doi.org/10.1080/07350015.2021.1961789
Pan, R., Zhu, Y., Guo, B., Zhu, X., & Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering, 35(9), 9502–9513. https://doi.org/10.1109/TKDE.2023.3241075
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., …Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. Curran Associates, Inc.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. https://doi.org/10.1016/0041-5553(64)90137-5
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. https://doi.org/10.1137/0330046
Pumma, S., Si, M., Feng, W., & Balaji, P. (2017). Parallel I/O optimizations for scalable deep learning. In IEEE International Conference on Parallel and Distributed Systems. IEEE.
Qi, H., Huang, D., Zhu, Y., Huang, D., & Wang, H. (2023). Mini-batch gradient descent with buffer. arXiv: 2312.08728.
Qi, H., Wang, F., & Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics, 32(4), 1348–1360. https://doi.org/10.1080/10618600.2023.2204130
Qu, G., & Li, N. (2019). Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 65(6), 2566–2581. https://doi.org/10.1109/TAC.9
Reich, B. J., Fuentes, M., & Dunson, D. B. (2012). Bayesian spatial quantile regression. Journal of the American Statistical Association, 106(493), 6–20. https://doi.org/10.1198/jasa.2010.ap09237
Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International Conference on Machine Learning. PMLR.
Richardson, A. M., & Lidbury, B. A. (2013). Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data. BMC Bioinformatics, 14, 206. https://doi.org/10.1186/1471-2105-14-206
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407. https://doi.org/10.1214/aoms/1177729586
Rosenblatt, J. D., & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Journal of the IMA, 5(4), 379–404.
Roux, N., Schmidt, M., & Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems. Curran Associates, Inc.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv: 1609.04747.
Savazzi, S., Nicoli, M., & Rampa, V. (2020). Federated learning with cooperating devices: A consensus approach for massive lot networks. IEEE Internet of Things Journal, 7(5), 4641–4654. https://doi.org/10.1109/JIoT.6488907
Sengupta, S., Volgushev, S., & Shao, X. (2016). A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111(515), 1222–1232. https://doi.org/10.1080/01621459.2015.1080709
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning. PMLR.
Shao, J. (2003). Mathematical statistics. Springer.
Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Springer.
Shu, J., Zhu, Y., Zhao, Q., Meng, D., & Xu, Z. (2022). Mlr-snet: Transferable LR schedules for heterogeneous tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3505–3521.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. OpenReview.net.
Soori, S., Mishchenko, K., Mokhtari, A., Dehnavi, M. M., & Gurbuzbalaban, M. (2020). DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
Stich, S. U. (2019). Local SGD converges fast and communicates little. In 2019 International Conference on Learning Representations.
Su, L., & Xu, J. (2019). Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1), 1–41.
Sutskever, I. (2013). Training recurrent neural networks. University of Toronto Toronto.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. PMLR.
Suwandarathna, S., & Koggalage, R. (2007). Increasing hard drive performance – from a thermal perspective. In International Conference on Industrial and Information Systems.
Tan, C., Ma, S., Dai, Y., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
Tan, K. M., Battey, H., & Zhou, W. (2022). Communication-constrained distributed quantile regression with optimal statistical guarantees. Journal of Machine Learning Research, 23(1), 12456–12516.
Tang, H., Lian, X., Yan, M., Zhang, C., & Liu, J. (2018). Decentralized training over decentralized data. In International Conference on Machine Learning. PMLR.
Tang, K., Liu, W., & Zhang, Y. (2023). Acceleration of stochastic gradient descent with momentum by averaging: Finite-sample rates and asymptotic normality. arXiv: 2305.17665.
Tang, L., Zhou, L., & Song, P. X. K. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176, Article 104567. https://doi.org/10.1016/j.jmva.2019.104567
Toulis, P., & Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4), 1694–1727. https://doi.org/10.1214/16-AOS1506
Tu, J., Liu, W., Mao, X., & Xu, M. (2023). Distributed semi-supervised sparse statistical inference. arXiv: 2306.10395.
Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University Press.
Vanhaesebrouck, P., Bellet, A., & Tommasi, M. (2017). Decentralized collaborative learning of personalized models over networks. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR.
Van Scoy, B., Freeman, R. A., & Lynch, K. M. (2017). The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1), 49–54. https://doi.org/10.1109/LCSYS.2017.2722406
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1(1), 37–57. https://doi.org/10.1145/3147.3165
Volgushev, S., Chao, S., & Cheng, G. (2019). Distributed inference for quantile regression processes. The Annals of Statistics, 47(3), 1634–1662. https://doi.org/10.1214/18-AOS1730
Wang, F., Huang, D., Gao, T., Wu, S., & Wang, H. (2022). Sequential one–step estimator by sub–sampling for customer churn analysis with massive data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 71(5), 1753–1786.
Wang, F., Zhu, Y., Huang, D., Qi, H., & Wang, H. (2021). Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Computational Statistics and Data Analysis, 162, Article 107265. https://doi.org/10.1016/j.csda.2021.107265
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516
Wang, H. (2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3), 46. https://doi.org/10.1007/s42519-019-0048-5
Wang, H. (2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132), 1–59.
Wang, H. (2020). Logistic regression for massive data with rare events. In International Conference on Machine Learning. PMLR.
Wang, H., & Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112. https://doi.org/10.1093/biomet/asaa043
Wang, H., Yang, M., & Stufken, J. (2019b). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525), 393–405. https://doi.org/10.1080/01621459.2017.1408468
Wang, H., Zhu, R., & Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829–844. https://doi.org/10.1080/01621459.2017.1292914
Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In International Conference on Machine Learning. PMLR.
Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate Newton method for distributed optimization. In Advances in neural information processing systems. Curran Associates, Inc.
Wang, X., Yang, Z., Chen, X., & Liu, W. (2019a). Distributed inference for linear support vector machine. Journal of Machine Learning Research, 20(113), 1–41.
Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., & Srebro, N. (2020). Is local SGD better than minibatch SGD? In International Conference on Machine Learning. PMLR.
Wu, S., Huang, D., & Wang, H. (2023a). Network gradient descent algorithm for decentralized federated learning. Journal of Business and Economic Statistics, 41(3), 806–818. https://doi.org/10.1080/07350015.2022.2074426
Wu, S., Huang, D., & Wang, H. (2023b). Quasi-Newton updating for large-scale distributed learning. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(4), 1326–1354. https://doi.org/10.1093/jrsssb/qkad059
Wu, S., Zhu, X., & Wang, H. (2023c). Subsampling and jackknifing: A practically convenient solution for large data analysis with limited computational resources. Statistica Sinica, 33(3), 2041–2064.
Xu, G., Sit, T., Wang, L., & Huang, C. Y. (2017). Estimation and inference of quantile regression for survival data under biased sampling. Journal of the American Statistical Association, 112(520), 1571–1586. https://doi.org/10.1080/01621459.2016.1222286
Xu, Q., Cai, C., Jiang, C., Sun, F., & Huang, X. (2020). Block average quantile regression for massive dataset. Statistical Papers, 61, 141–165. https://doi.org/10.1007/s00362-017-0932-6
Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. PMLR.
Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60(2), 585–599. https://doi.org/10.1007/s00362-018-01068-6
Yu, J., Wang, H., Ai, M., & Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265–276. https://doi.org/10.1080/01621459.2020.1773832
Yu, L., Balasubramanian, K., Volgushev, S., & Erdogdu, M. (2021). An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. In Advanced in neural information processing systems. Curran Associates, Inc.
Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835–1854. https://doi.org/10.1137/130943170
Zeiler, M. D. (2012). AdaDelta: An adaptive learning rate method. arXiv: 1212.5701.
Zhang, J., & Ré, C. (2016). ParallelSGD: When does averaging help? In International Conference on Machine Learning Workshop on Optimization in Machine Learning. International Conference on Machine Learning (ICML).
Zhang, Y., & Lin, X. (2015). DiSCO: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning. PMLR.
Zhang, Y., Wainwright, M. J., & Duchi, J. C. (2012). Communication-efficient algorithms for statistical optimization. In Advances in neural information processing systems. Curran Associates, Inc.
Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A framework of rebalancing imbalanced healthcare data for rare events' classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018, Article 6275435–
Zhong, W., Wan, C., & Zhang, W. (2022). Estimation and inference for multi-kink quantile regression. Journal of Business and Economic Statistics, 40(3), 1123–1139. https://doi.org/10.1080/07350015.2021.1901720
Zhou, L., She, X., & Song, P. X. (2023). Distributed empirical likelihood approach to integrating unbalanced datasets. Statistica Sinica, 33(3), 2209–2231.
Zhu, M., Su, W., & Chipman, H. A. (2006). LAGO: A computationally efficient approach for statistical detection. Technometrics, 48(2), 193–205. https://doi.org/10.1198/004017005000000643
Zhu, W., Chen, X., & Wu, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, 118(541), 393–404. https://doi.org/10.1080/01621459.2021.1933498
Zhu, X., Li, F., & Wang, H. (2021). Least-square approximation for a distributed system. Journal of Computational and Graphical Statistics, 30(4), 1004–1018. https://doi.org/10.1080/10618600.2021.1923517
Zhu, X., Pan, R., Wu, S., & Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business and Economic Statistics, 40(4), 1892–1903. https://doi.org/10.1080/07350015.2021.1990771
Zhu, Y., Huang, D., Gao, Y., Wu, R., Chen, Y., Zhang, B., & Wang, H. (2021). Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks, 141, 11–29. https://doi.org/10.1016/j.neunet.2021.03.025
Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., & Chowdhury, F. (2019). Efficient user-level storage disaggregation for deep learning. In 2019 IEEE International Conference on Cluster Computing.
Zhuang, J., Cai, J., Wang, R., Zhang, J., & Zheng, W. (2019). CARE: Class attention to regions of lesion for classification on imbalanced data. In International Conference on Medical Imaging with Deep Learning. PMLR.
Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4), 1509–1533.

To cite this article: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, Jing Zhou, Xuening Zhu, Yingqiu Zhu & Hansheng Wang (23 Apr 2024): A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniques, Statistical Theory and Related Fields, DOI: 10.1080/24754269.2024.2343151

To link to this article: https://doi.org/10.1080/24754269.2024.2343151

Archives

References

Authors

About the Journal

Links

Search

Archives