数据管理

基于布谷鸟过滤器的外连接算法

  • 于洋 ,
  • 周敏奇 ,
  • 方祝和
展开
  • 华东师范大学 数据科学与工程学院, 上海 200062
于洋,男,硕士研究生,研究方向为内存数据库系统.E-mail:youngfish93@hotmail.com

收稿日期: 2017-06-19

  网络出版日期: 2017-09-25

基金资助

国家自然科学基金重点项目(61332006)

An outer join algorithm based on Cuckoo filter

  • YU Yang ,
  • ZHOU Min-qi ,
  • FANG Zhu-he
Expand
  • School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2017-06-19

  Online published: 2017-09-25

摘要

近十几年,由于互联网的发展异常迅猛,数据规模不断增加,分布式数据库的分析效率亟待优化,其中连接操作更是分布式数据库的主要性能瓶颈.外连接在商业中运用非常广泛.分布式外连接算法涉及到大量的网络传输,严重影响系统性能,虽然有一些研究针对内连接进行了优化,但这些优化方法并不能直接应用于外连接.文章中基于Cuckoofilter(布谷鸟过滤器)的分布式外连接算法,通过构建Cuckoofilter对数据进行筛选和分配,减少数据传输量的同时,提高执行的并行度,使得查询性能得到提升.通过在Ginkgo上实现该算法,并加以充分实验,验证得出该算法提高了分布式外连接操作的效率.

关键词: Cuckoo filter; Ginkgo; 外连接

本文引用格式

于洋 , 周敏奇 , 方祝和 . 基于布谷鸟过滤器的外连接算法[J]. 华东师范大学学报(自然科学版), 2017 , 2017(5) : 40 -51 . DOI: 10.3969/j.issn.1000-5641.2017.05.005

Abstract

In recent years, due to the development of the Internet, data size has been increased rapidly. At the era of big data, the analysis efficiency of distributed database system needs to be optimized urgently. Nevertheless, the join operation is the main performance bottleneck of a distributed database system. Join operations are mainly divided into inner join and outer join, and outer join is widely used in the business situations. Distributed join algorithm involves a large amount of network transmission, which affects the performance of the system severely. Although there are some studies in the literature of inner join optimized, these optimization methods cannot be directly applied to outer join. This paper proposes a distributed outer join algorithm based on Cuckoo filter. By building a Cuckoo filter with replication subdivision technology for data filtering and allocation, it reduces the amount of data transmission and improves the degree of parallelism accordingly. Finally, it improves the query performance. We implement this algorithm in Ginkgo. Based on the given extensive experimental verification, the algorithm largely improves the efficiency of the outer join.

参考文献

[1] STONEBRAKER M. The case for shared nothing[J]. Database Engineering, 1986, 9:4-9.
[2] 王立. 分布式内存数据库系统的查询处理与优化[D]. 上海:华东师范大学, 2015.
[3] 邹先霞, 贾维嘉, 潘久辉. 实化外连接视图的增量计算[J]. 系统工程与电子技术, 2011, 33(4):938-942.
[4] 张磊, 方祝和, 周敏奇, 等. 面向内存计算的连接算法[J]. 华东师范大学学报(自然科学版), 2014(5):180-191.
[5] BERNSTEIN P A, CHIU D-M W. Using semi-joins to solve relational queries[J] Journal of the Association for Computing Machinery, 1981, 28(1):25-40.
[6] 樊秋实, 周敏奇, 周傲英. 基线与增量数据分离架构下的分布式连接算法[J]. 计算机学报, 2016, 39(10):2102-2113
[7] 茅潇潇, 段惠超, 高明. OceanBase中基于布隆过滤器的连接算法[J]. 华东师范大学学报(自然科学版), 2016(5):67-74.
[8] BLASGEN M W, ESWARAN K P. Storage and access in relational data bases[J]. IBM Systems Journal, 1977, 16(4):363-377.
[9] 赵宇兰. 分布式数据库查询优化研究[M]. 成都:电子科技大学出版社, 2016.
[10] LI J T, XIA X L. Research on outer join optimization in parallel DBMS[C]//Proceedings of the 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011). 2011:502-507.
[11] XU Y, KOSTAMAA P. A new algorithm for small-large table outer joins in parallel DBMS[C]//Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE) (2010). IEEE, 2010:1018-1024.
[12] CHENG L, KOTOULAS S, WARD T E, et al. Robust and efficient large-large table outer joins on distributed infrastructures[C]//European Conference on Parallel Processing. Berlin:Springer International Publishing, 2014:258-269.
[13] XU Y, KOSTAMAA P, ZHOU X, et al. Handling data skew in parallel joins in shared-nothing systems[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 2008:1043-1052.
[14] CHENG L, KOTOULAS S, WARD T E, et al. Efficiently handling skew in outer joins on distributed systems[C]//Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2014:295-304.
[15] FAN B, ANDERSEN D G, KAMINSKY M, et al. Cuckoo Filter:Practically better than bloom[C]//Proceedings of the ACM International on Conference on Emerging Networking Experiments & Technologies. ACM, 2014:75-88.
[16] PAGH R, RODLER F F. Cuckoo Hashing[J]. Journal of Algorithms, 2004, 51(2):122-144.
文章导航

/