计算机科学

基于Map/Reduce的分布式数据排序算法分析

  • 余晟隽 ,
  • 宫学庆 ,
  • 祝 君 ,
  • 钱卫宁
展开
  • 华东师范大学 数据科学与工程研究院, 上海 200062

收稿日期: 2016-06-27

  网络出版日期: 2016-11-29

基金资助

国家自然科学基金(61332006); 国家 863 计划项目(2015AA015307)

Sorting algorithm analysis of distributed data based on Map/Reduce

  • YU Sheng-jun ,
  • GONG Xue-qing ,
  • ZHU jun ,
  • QIAN Wei-ning
Expand
  • Institute for Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2016-06-27

  Online published: 2016-11-29

摘要

为了解决大规模数据的存储与计算, 近年来分布式系统得到了大量的应用. 如何在分布式系统中对大规模数据集进行排序是影响许多应用性能的基础问题, 其中不仅涉及每个节点上排序算法的选择, 更重要的是设计协调各节点的分布式算法. 本文总结了分布式系统中常用的分布式排序算法, 对每种算法的执行流程、代价模型和适用场景进行了分析, 并通过实验对分析结果进行了验证. 本文的工作可以帮助开发人员选择和优化分布式环境下大规模数据排序的算法.

本文引用格式

余晟隽 , 宫学庆 , 祝 君 , 钱卫宁 . 基于Map/Reduce的分布式数据排序算法分析[J]. 华东师范大学学报(自然科学版), 2016 , 2016(5) : 121 -130 . DOI: 10.3969/j.issn.1000-5641.2016.05.014

Abstract

Distributed system has been widely applied in recent years to tackle the storage and calculation of big data. Sorting of large-scale dataset in the distributed system has become the fundamental problem to affect a varieties of application performances which is not only concerning about the selection of sorting algorithm at each node, but also about the development of distributed algorithms to coordinate at each node. This paper summarizes the common distributed sorting algorithms which are applied in the distributed system. Analysis has been conducted to the implementation process, cost model and applicable field of each algorithm. And the analysis results have been verified by experiments. This work can help developers choose and optimize the big data sorting
algorithm in distributed environments.

参考文献

[ 1 ] KNUTH D E. The Art of Computer Programming: Sorting and Searching [M]. 2nd ed. Indianapolis: Addison-Wesley Professional, 1998.
[ 2 ] BORTHAKUR D. The hadoop distributed file system: Architecture and design [J]. Hadoop Project Website, 2007, 11: 1-10.
[ 3 ] DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
[ 4 ] CHRIS NYBERG, MEHUL SHAH. Sort Benchmark Home Page [EB/OL]. (2015) [2016-04-20]. http://sortbenchmark.org/.
[ 5 ] BORTHAKUR D, GRAY J, SARMA J S, et al. Apache Hadoop goes realtime at Facebook [C]//Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 2011: 1071-1080.
[ 6 ] MANE S B, SAWANT Y, KAZI S, et al. Real time sentiment analysis of twitter data using hadoop [J]. International Journal of Computer Science and Information Technolo, 2014, 5(3): 3098-3100.
[ 7 ] O’MALLEY O, MURTHY A C. Winning a 60 second dash with a yellow elephant [J]. Proceedings of Sort Benchmark, 2009, 1810(9): 1-9.
[ 8 ] WANG J, WU Y, CAI H, et al. Fuxi Sort [EB/OL]. (2015) [2016-04-20]. http://sortbenchmark.org/Fux-iSort2015.pdf.
[ 9 ] GRIFFITHS N. Nmon performance: A free tool to analyze AIX and Linux performance [EB/OL]. (2003-11-04)[2016-04-20]. http://www.ibm.com/developerworks/aix/library/au-analyze aix/.

文章导航

/