华东师范大学学报(自然科学版)

• 计算机科学 • 上一篇    下一篇

基于Map/Reduce的分布式数据排序算法分析

余晟隽, 宫学庆, 祝 君, 钱卫宁   

  1. 华东师范大学 数据科学与工程研究院, 上海 200062
  • 收稿日期:2016-06-27 出版日期:2016-09-25 发布日期:2016-11-29
  • 通讯作者: 余晟隽, 男, 硕士研究生, 研究方向为分布式数据库. E-mail: sjyu@obase.com.cn.
  • 基金资助:

    国家自然科学基金(61332006); 国家 863 计划项目(2015AA015307)

Sorting algorithm analysis of distributed data based on Map/Reduce

YU Sheng-jun, GONG Xue-qing, ZHU jun, QIAN Wei-ning   

  1. Institute for Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2016-06-27 Online:2016-09-25 Published:2016-11-29

摘要:

为了解决大规模数据的存储与计算, 近年来分布式系统得到了大量的应用. 如何在分布式系统中对大规模数据集进行排序是影响许多应用性能的基础问题, 其中不仅涉及每个节点上排序算法的选择, 更重要的是设计协调各节点的分布式算法. 本文总结了分布式系统中常用的分布式排序算法, 对每种算法的执行流程、代价模型和适用场景进行了分析, 并通过实验对分析结果进行了验证. 本文的工作可以帮助开发人员选择和优化分布式环境下大规模数据排序的算法.

关键词: 分布式系统, 排序算法, 代价模型

Abstract:

Distributed system has been widely applied in recent years to tackle the storage and calculation of big data. Sorting of large-scale dataset in the distributed system has become the fundamental problem to affect a varieties of application performances which is not only concerning about the selection of sorting algorithm at each node, but also about the development of distributed algorithms to coordinate at each node. This paper summarizes the common distributed sorting algorithms which are applied in the distributed system. Analysis has been conducted to the implementation process, cost model and applicable field of each algorithm. And the analysis results have been verified by experiments. This work can help developers choose and optimize the big data sorting
algorithm in distributed environments.

Key words: distributed system, sorting algorithm, cost model