华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (5): 56-66.doi: 10.3969/j.issn.1000-5641.2018.05.005

• 高性能数据库管理 • 上一篇    下一篇

分布式数据库系统中的并行分组聚合实现

徐石磊, 魏星, 江红, 钱卫宁, 周傲英   

  1. 华东师范大学 计算机科学与软件工程学院, 上海 200062
  • 收稿日期:2018-07-04 出版日期:2018-09-25 发布日期:2018-09-26
  • 通讯作者: 江红,女,副教授,研究方向为现代信息系统及其开发技术.E-mail:hjiang@cc.ecnu.edu.cn. E-mail:hjiang@cc.ecnu.edu.cn
  • 作者简介:徐石磊,男,硕士研究生,研究方向为数据存储与数据挖掘.E-mail:xsl118857@sina.com.
  • 基金资助:
    上海市青年科技英才扬帆计划(17YF1427800)

Implementation of the parallel GroupBy and Aggregation functions in a distributed database system

XU Shi-lei, WEI Xing, JIANG Hong, QIAN Wei-ning, ZHOU Ao-ying   

  1. School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China
  • Received:2018-07-04 Online:2018-09-25 Published:2018-09-26

摘要: 伴随着新型互联网应用中对数据统计、分析需求的增大,分组、聚合已经成为数据分析应用中出现频率最多的请求之一.本文就类OLAP(on-line transactionprocessing)应用中常见的Aggregation、GroupBy原理进行了分析.针对一般事务型数据库采用排序分组的缺点,提出了两种Hash分组聚合的具体实现方案,并提出一种利用统计信息动态决策Hash桶数、Hash分组聚合方案的策略.根据分布式数据库多副本的特点,本文又提出了一种Hash分组聚合节点级的并行方案.最后,在开源数据库OceanBase进行了具体的实现.通过实验证明,本文提出的利用统计信息动态决策Hash分组聚合方案相比排序分组具有极大的效率提升.

关键词: OceanBase, GroupBy, Hash, 数据分布

Abstract: With the increase in demand for data statistics and analysis in new Internet applications, data grouping and aggregation have become amongst the most common operations in data analysis applications. This paper analyzes the operating principles of the Aggregation and GroupBy functions commonly used in analytical applications. Based on the disadvantages of sort grouping for general-transactional databases, two kinds of Hash GroupBy implementations are proposed; in addition,a strategy for dynamically determining the number of Hash buckets and Hash GroupBy schemes, based on statistical information, is proposed. Based on the characteristics of distributed clusters, implementation of the Hash GroupBy operator push down is proposed. Experiments have shown that the use of statistical information to dynamically determine the Hash group option improves efficiency.

Key words: OceanBase, GroupBy, Hash, Data distribution

中图分类号: