Equi-join optimization on spark

doi:10.3969/j.issn.10005641.2014.05.023

Journal of East China Normal University(Natural Sc ›› 2014, Vol. 2014 ›› Issue (5): 261-270.doi: 10.3969/j.issn.10005641.2014.05.023

• Article • Previous Articles Next Articles

Equi-join optimization on spark

BIAN Hao-Qiong^1,2, CHEN Yue-Guo^1,2, DU Xiao-Yong^1,2, GAO Yan-Jie^1,2

1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
2. School of Information, Renmin University of China, Beijing 100872, China

Online:2014-09-25 Published:2014-11-27

Abstract

Abstract: Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Key words: Spark, SQL, big data analysis, equi-join, in-memory computation

CLC Number:

TP392

BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 261-270.

[1]	WANG Yan-zhao, HU Hui-qi, ZHANG Zhao, LIU Xiao-bing, DUAN Hui-chao. Sublink elimination and optimization in data storage and processing separation architecture [J]. Journal of East China Normal University(Natural Sc, 2018, 2018(4): 90-98.
[2]	SONG Guang-xuan, ZHAO Da-peng, WANG Xiao-ling. IM²: Improved MIN/MAX window functions optimizer in relational database [J]. Journal of East China Normal University(Natural Sc, 2018, 2018(1): 103-116.
[3]	WANG Shan-lei, YUE Kun, WU Hao, TIAN Kai-lin. Modeling multi-dimensional user preference based on the latent variable model [J]. Journal of East China Normal University(Natural Sc, 2017, 2017(5): 138-153.
[4]	LONG Fei, WENG Hai-xing, GAO Ming, ZHANG Zhao. Distributed secondary index based on LSM Tree [J]. Journal of East China Normal University(Natural Sc, 2016, 2016(5): 36-44.
[5]	ZHU Jun, LIU Bai-zhong, YU Sheng-jun, GONG Xue-qing, ZHOU Min-qi. Designs and implementations of stored procedure in OceanBase [J]. Journal of East China Normal University(Natural Sc, 2016, 2016(5): 144-152.
[6]	KONG Chao, QIAN Wei-Ning, ZHOU Ao-Ying. Fault tolerance in NoSQL systems: Principles and system analysis [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 1-16.
[7]	PEI Ou-Ya, LIU Wen-Jie, LI Zhan-Huai, TIAN Zheng. A nested query strategy oriented massive distributed database [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 271-280.
[8]	ZHU Tao, ZHOU Min-Qi, ZHANG Zhao. Study on stored procedure implementation oriented to OceanBase [J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 281-289.

Equi-join optimization on spark

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 8

Recommended Articles

Metrics

Comments