With the increase in demand for data statistics and analysis in new Internet applications, data grouping and aggregation have become amongst the most common operations in data analysis applications. This paper analyzes the operating principles of the Aggregation and GroupBy functions commonly used in analytical applications. Based on the disadvantages of sort grouping for general-transactional databases, two kinds of Hash GroupBy implementations are proposed; in addition,a strategy for dynamically determining the number of Hash buckets and Hash GroupBy schemes, based on statistical information, is proposed. Based on the characteristics of distributed clusters, implementation of the Hash GroupBy operator push down is proposed. Experiments have shown that the use of statistical information to dynamically determine the Hash group option improves efficiency.
XU Shi-lei
,
WEI Xing
,
JIANG Hong
,
QIAN Wei-ning
,
ZHOU Ao-ying
. Implementation of the parallel GroupBy and Aggregation functions in a distributed database system[J]. Journal of East China Normal University(Natural Science), 2018
, 2018(5)
: 56
-66
.
DOI: 10.3969/j.issn.1000-5641.2018.05.005
[1] 杨传辉. 大规模分布式存储系统[M].北京:机械工业出版社, 2013.
[2] HO C T, AGRAWAL R, MEGIDDO N, et al. Range queries in OLAP data cubes[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2008:73-88.
[3] ROUSSOPOULOS N. Materialized views and data warehouses[C]//ACM SIGMOD International Conference on Management of Data. ACM, 1998:21-26.
[4] GUPTA H, MUMICK I S. Selection of views to materialize in a data warehouse[J]. IEEE Transactions on Knowledge & Data Engineering, 1997, 17(1):24-43.
[5] GOVINDARAJAN S, AGARWAL P K, ARGE L. CRB-Tree:An efficient indexing scheme for range-aggregate queries[J]. Lecture Notes in Computer Science, 2003, 2572:143-157.
[6] TAO Y, PAPADIAS D. Range aggregate processing in spatial databases[J]. IEEE Transactions on Knowledge & Data Engineering, 2004, 16(12):1555-1570.
[7] SELLIS T. The R+-Tree:A dynamic index for multi-dimensional object[C]//Proceeding of the 13th VLDB Conf. VLDB, 1987:507-518.
[8] TAO Y, SHENG C, CHUNG C W, et al. Range aggregation with set selection[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(5):1240-1252.
[9] CONDIE T, CONWAY N, ALVARO P, et al. Online aggregation and continuous query support in MapReduce[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2010:1115-1118.
[10] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J]. Communications of The ACM, 2008, 51(1):107-113.