Equi-join optimization on spark

BIAN  Hao-Qiong; CHEN  Yue-Guo; DU  Xiao-Yong; GAO  Yan-Jie

doi:10.3969/j.issn.10005641.2014.05.023

Journal of East China Normal University(Natural Science) >

2014 , Vol. 2014 >Issue 5: 261 - 270

DOI: https://doi.org/10.3969/j.issn.10005641.2014.05.023

Article

Equi-join optimization on spark

BIAN Hao-Qiong ,
CHEN Yue-Guo ,
DU Xiao-Yong ,
GAO Yan-Jie

Expand

1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
2. School of Information, Renmin University of China, Beijing 100872, China

Online published: 2014-11-27

Fold

Abstract

Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Key words： Spark; SQL; big data analysis; equi-join; in-memory computation

Cite this article

BIAN Hao-Qiong , CHEN Yue-Guo , DU Xiao-Yong , GAO Yan-Jie . Equi-join optimization on spark[J]. Journal of East China Normal University(Natural Science), 2014 , 2014(5) : 261 -270 . DOI: 10.3969/j.issn.10005641.2014.05.023

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article