基于分布式系统OceanBase的并行连接

徐石磊; 王雷; 胡卉芪; 钱卫宁; 周傲英

doi:10.3969/j.issn.1000-5641.2017.05.001

华东师范大学学报（自然科学版） >

2017 , Vol. 2017 >Issue 5: 1 - 10

DOI: https://doi.org/10.3969/j.issn.1000-5641.2017.05.001

数据管理

基于分布式系统OceanBase的并行连接

徐石磊 ,
王雷 ,
胡卉芪 ,
钱卫宁 ,
周傲英

展开

华东师范大学计算机科学与软件工程学院, 上海 200062

徐石磊,男,硕士研究生,研究方向为数据存储与数据挖掘.E-mail:xsl118857@sina.com

收稿日期: 2017-06-19

网络出版日期: 2017-09-25

基金资助

2017年上海市青年科技英才扬帆计划（17YF1427800）

收起

Parallel join based on distributed system OceanBase

XU Shi-lei ,
WANG Lei ,
HU Hui-qi ,
QIAN Wei-ning ,
ZHOU Ao-ying

Expand

School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China

Received date: 2017-06-19

Online published: 2017-09-25

Fold

摘要

随着应用数据的飞速增长以及分布式数据库系统的不断涌现，数据存储在物理独立的节点已经成为一种趋势.在这种情况下，当应用需要进行复杂join查询时，就会不可避免地产生非常多的网络传输代价.所以，如何提高分布式系统中join查询的效率成为研究热点.本文在分析分布式数据库系统OceanBase执行nested loop join、Hashjoin、semi-join等算法的基础上，提出了合理利用硬件资源采用多线程并行执行join操作的优化思想，并在OceanBase数据库中分别对nested loop join、Hashjoin、semi-join等算法进行了并行改造.实验结果表明，在一定线程数内join算法执行效率与并行度呈正相关.

关键词： 查询; semi-join; OceanBase; 并行连接

本文引用格式

徐石磊 , 王雷 , 胡卉芪 , 钱卫宁 , 周傲英 . 基于分布式系统OceanBase的并行连接[J]. 华东师范大学学报（自然科学版）, 2017 , 2017(5) : 1 -10 . DOI: 10.3969/j.issn.1000-5641.2017.05.001

Abstract

With the rapid growth of application data and the continued development of distributed database systems, data storage in physical independent nodes has become a trend. In this trend, when the application needs to perform complex join queries, it inevitably generates a lot of network traffic. Therefore, improving the efficiency of join query in distributed system is a hot topic. Based on the analysis of the nested loop join, Hash join, semi-join in the OceanBase, this paper puts forward the optimization idea of using hardware resources reasonably and using multithread to execute join operations in parallel. We implement experiment on OceanBase with nested loop join algorithm, Hash join algorithm, semi-join algorithm respectively. The experimental results confirm that the efficiency of join algorithm is positively related to parallelism in a certain number of threads.

Key words： query; semi-join; OceanBase; parallel join

参考文献

[1] 杨传辉. 大规模分布式存储系统[M]. 北京:机械工业出版社, 2013.
[2] BERNSTEIN P A, GOODMAN N, WONG E, et al. Query processing in a system for distributed databases (SDD-1)[J]. ACM Transactions on Database Systems, 1981, 6(4):602-625.
[3] ZHANG X F, CHEN L, WANG M. Efficient multi-way theta-join processing using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 11(5):1184-1195.
[4] BLASGEN M W, ESWARAN K P. Storage and access in relational databases[J]. IBM Systems Journal, 1977, 16(4):363-377.
[5] ZHOU J R, ROSS K A. Implementing database operations using SIMD instructions[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002:145-156.
[6] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:Cluster computing with working sets[C/OL]//Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. (2010-06-25)[2017-04-01]. https://www.usenix.org/legacy/events/hotcloud10/tech/fullpapers/Zaharia.pdf?CFID=973306186&CFTOKEN=67460167.
[7] MERRETT T H. Why sort-merge gives the best implementation of the natural join[J]. ACM Sigmod Record, 1983, 13(2):39-51.
[8] KIM C, PARK J, SATISH N, et al. CloudRAMSort:Fast and efficient large-scale distributed RAM sort on shared-nothing cluster[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2012:841-850.
[9] BABB E. Implementing a relational database by means of specialzed hardware[J]. ACM Transactions on Database Systems, 1979, 4(1):1-29.
[10] BONCZ P A, ZUKOWSKI M, NES N. MonetDB/X100:Hyper-pipelining query execution[C/OL]//Proceedings of the 2005 CIDR Conference on Innovative Data Systems Research. 2005:225-237[2017-04-01]. https://www.researchgate.net/publication/45338800 MonetDBX 100 Hyper-Pipelining Query Execution.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献