随着应用数据的飞速增长以及分布式数据库系统的不断涌现,数据存储在物理独立的节点已经成为一种趋势.在这种情况下,当应用需要进行复杂join查询时,就会不可避免地产生非常多的网络传输代价.所以,如何提高分布式系统中join查询的效率成为研究热点.本文在分析分布式数据库系统OceanBase执行nested loop join、Hashjoin、semi-join等算法的基础上,提出了合理利用硬件资源采用多线程并行执行join操作的优化思想,并在OceanBase数据库中分别对nested loop join、Hashjoin、semi-join等算法进行了并行改造.实验结果表明,在一定线程数内join算法执行效率与并行度呈正相关.
With the rapid growth of application data and the continued development of distributed database systems, data storage in physical independent nodes has become a trend. In this trend, when the application needs to perform complex join queries, it inevitably generates a lot of network traffic. Therefore, improving the efficiency of join query in distributed system is a hot topic. Based on the analysis of the nested loop join, Hash join, semi-join in the OceanBase, this paper puts forward the optimization idea of using hardware resources reasonably and using multithread to execute join operations in parallel. We implement experiment on OceanBase with nested loop join algorithm, Hash join algorithm, semi-join algorithm respectively. The experimental results confirm that the efficiency of join algorithm is positively related to parallelism in a certain number of threads.
[1] 杨传辉. 大规模分布式存储系统[M]. 北京:机械工业出版社, 2013.
[2] BERNSTEIN P A, GOODMAN N, WONG E, et al. Query processing in a system for distributed databases (SDD-1)[J]. ACM Transactions on Database Systems, 1981, 6(4):602-625.
[3] ZHANG X F, CHEN L, WANG M. Efficient multi-way theta-join processing using MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 11(5):1184-1195.
[4] BLASGEN M W, ESWARAN K P. Storage and access in relational databases[J]. IBM Systems Journal, 1977, 16(4):363-377.
[5] ZHOU J R, ROSS K A. Implementing database operations using SIMD instructions[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002:145-156.
[6] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:Cluster computing with working sets[C/OL]//Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. (2010-06-25)[2017-04-01]. https://www.usenix.org/legacy/events/hotcloud10/tech/fullpapers/Zaharia.pdf?CFID=973306186&CFTOKEN=67460167.
[7] MERRETT T H. Why sort-merge gives the best implementation of the natural join[J]. ACM Sigmod Record, 1983, 13(2):39-51.
[8] KIM C, PARK J, SATISH N, et al. CloudRAMSort:Fast and efficient large-scale distributed RAM sort on shared-nothing cluster[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2012:841-850.
[9] BABB E. Implementing a relational database by means of specialzed hardware[J]. ACM Transactions on Database Systems, 1979, 4(1):1-29.
[10] BONCZ P A, ZUKOWSKI M, NES N. MonetDB/X100:Hyper-pipelining query execution[C/OL]//Proceedings of the 2005 CIDR Conference on Innovative Data Systems Research. 2005:225-237[2017-04-01]. https://www.researchgate.net/publication/45338800 MonetDBX 100 Hyper-Pipelining Query Execution.