分布式数据库系统中的并行分组聚合实现

徐石磊; 魏星; 江红; 钱卫宁; 周傲英

doi:10.3969/j.issn.1000-5641.2018.05.005

华东师范大学学报（自然科学版） >

2018 , Vol. 2018 >Issue 5: 56 - 66

DOI: https://doi.org/10.3969/j.issn.1000-5641.2018.05.005

高性能数据库管理

分布式数据库系统中的并行分组聚合实现

徐石磊 ,
魏星 ,
江红 ,
钱卫宁 ,
周傲英

展开

华东师范大学计算机科学与软件工程学院, 上海 200062

徐石磊,男,硕士研究生,研究方向为数据存储与数据挖掘.E-mail:xsl118857@sina.com.

收稿日期: 2018-07-04

网络出版日期: 2018-09-26

基金资助

上海市青年科技英才扬帆计划（17YF1427800）

收起

Implementation of the parallel GroupBy and Aggregation functions in a distributed database system

XU Shi-lei ,
WEI Xing ,
JIANG Hong ,
QIAN Wei-ning ,
ZHOU Ao-ying

Expand

School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China

Received date: 2018-07-04

Online published: 2018-09-26

Fold

摘要

伴随着新型互联网应用中对数据统计、分析需求的增大，分组、聚合已经成为数据分析应用中出现频率最多的请求之一.本文就类OLAP（on-line transactionprocessing）应用中常见的Aggregation、GroupBy原理进行了分析.针对一般事务型数据库采用排序分组的缺点，提出了两种Hash分组聚合的具体实现方案，并提出一种利用统计信息动态决策Hash桶数、Hash分组聚合方案的策略.根据分布式数据库多副本的特点，本文又提出了一种Hash分组聚合节点级的并行方案.最后，在开源数据库OceanBase进行了具体的实现.通过实验证明，本文提出的利用统计信息动态决策Hash分组聚合方案相比排序分组具有极大的效率提升.

关键词： OceanBase; GroupBy; Hash; 数据分布

本文引用格式

徐石磊 , 魏星 , 江红 , 钱卫宁 , 周傲英 . 分布式数据库系统中的并行分组聚合实现[J]. 华东师范大学学报（自然科学版）, 2018 , 2018(5) : 56 -66 . DOI: 10.3969/j.issn.1000-5641.2018.05.005

Abstract

With the increase in demand for data statistics and analysis in new Internet applications, data grouping and aggregation have become amongst the most common operations in data analysis applications. This paper analyzes the operating principles of the Aggregation and GroupBy functions commonly used in analytical applications. Based on the disadvantages of sort grouping for general-transactional databases, two kinds of Hash GroupBy implementations are proposed; in addition,a strategy for dynamically determining the number of Hash buckets and Hash GroupBy schemes, based on statistical information, is proposed. Based on the characteristics of distributed clusters, implementation of the Hash GroupBy operator push down is proposed. Experiments have shown that the use of statistical information to dynamically determine the Hash group option improves efficiency.

Key words： OceanBase; GroupBy; Hash; Data distribution

参考文献

[1] 杨传辉. 大规模分布式存储系统[M].北京:机械工业出版社, 2013.
[2] HO C T, AGRAWAL R, MEGIDDO N, et al. Range queries in OLAP data cubes[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2008:73-88.
[3] ROUSSOPOULOS N. Materialized views and data warehouses[C]//ACM SIGMOD International Conference on Management of Data. ACM, 1998:21-26.
[4] GUPTA H, MUMICK I S. Selection of views to materialize in a data warehouse[J]. IEEE Transactions on Knowledge & Data Engineering, 1997, 17(1):24-43.
[5] GOVINDARAJAN S, AGARWAL P K, ARGE L. CRB-Tree:An efficient indexing scheme for range-aggregate queries[J]. Lecture Notes in Computer Science, 2003, 2572:143-157.
[6] TAO Y, PAPADIAS D. Range aggregate processing in spatial databases[J]. IEEE Transactions on Knowledge & Data Engineering, 2004, 16(12):1555-1570.
[7] SELLIS T. The R+-Tree:A dynamic index for multi-dimensional object[C]//Proceeding of the 13th VLDB Conf. VLDB, 1987:507-518.
[8] TAO Y, SHENG C, CHUNG C W, et al. Range aggregation with set selection[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(5):1240-1252.
[9] CONDIE T, CONWAY N, ALVARO P, et al. Online aggregation and continuous query support in MapReduce[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2010:1115-1118.
[10] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J]. Communications of The ACM, 2008, 51(1):107-113.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献