With the rapid growth of application data and the continued development of distributed database systems, data storage in physical independent nodes has become a trend. In this trend, when the application needs to perform complex join queries, it inevitably generates a lot of network traffic. Therefore, improving the efficiency of join query in distributed system is a hot topic. Based on the analysis of the nested loop join, Hash join, semi-join in the OceanBase, this paper puts forward the optimization idea of using hardware resources reasonably and using multithread to execute join operations in parallel. We implement experiment on OceanBase with nested loop join algorithm, Hash join algorithm, semi-join algorithm respectively. The experimental results confirm that the efficiency of join algorithm is positively related to parallelism in a certain number of threads.
XU Shi-lei
,
WANG Lei
,
HU Hui-qi
,
QIAN Wei-ning
,
ZHOU Ao-ying
. Parallel join based on distributed system OceanBase[J]. Journal of East China Normal University(Natural Science), 2017
, 2017(5)
: 1
-10
.
DOI: 10.3969/j.issn.1000-5641.2017.05.001
[1] 杨传辉. 大规模分布式存储系统[M]. 北京:机械工业出版社, 2013.
[2] BERNSTEIN P A, GOODMAN N, WONG E, et al. Query processing in a system for distributed databases (SDD-1)[J]. ACM Transactions on Database Systems, 1981, 6(4):602-625.
[3] ZHANG X F, CHEN L, WANG M. Efficient multi-way theta-join processing using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 11(5):1184-1195.
[4] BLASGEN M W, ESWARAN K P. Storage and access in relational databases[J]. IBM Systems Journal, 1977, 16(4):363-377.
[5] ZHOU J R, ROSS K A. Implementing database operations using SIMD instructions[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002:145-156.
[6] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:Cluster computing with working sets[C/OL]//Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. (2010-06-25)[2017-04-01]. https://www.usenix.org/legacy/events/hotcloud10/tech/fullpapers/Zaharia.pdf?CFID=973306186&CFTOKEN=67460167.
[7] MERRETT T H. Why sort-merge gives the best implementation of the natural join[J]. ACM Sigmod Record, 1983, 13(2):39-51.
[8] KIM C, PARK J, SATISH N, et al. CloudRAMSort:Fast and efficient large-scale distributed RAM sort on shared-nothing cluster[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2012:841-850.
[9] BABB E. Implementing a relational database by means of specialzed hardware[J]. ACM Transactions on Database Systems, 1979, 4(1):1-29.
[10] BONCZ P A, ZUKOWSKI M, NES N. MonetDB/X100:Hyper-pipelining query execution[C/OL]//Proceedings of the 2005 CIDR Conference on Innovative Data Systems Research. 2005:225-237[2017-04-01]. https://www.researchgate.net/publication/45338800 MonetDBX 100 Hyper-Pipelining Query Execution.