Journal of East China Normal University(Natural Sc ›› 2014, Vol. 2014 ›› Issue (5): 261-270.doi: 10.3969/j.issn.10005641.2014.05.023

• Article • Previous Articles     Next Articles

Equi-join optimization on spark

 BIAN  Hao-Qiong1,2, CHEN  Yue-Guo1,2, DU  Xiao-Yong1,2, GAO  Yan-Jie1,2   

  1. 1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
    2. School of Information, Renmin University of China, Beijing 100872, China
  • Online:2014-09-25 Published:2014-11-27

Abstract: Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Key words: Spark, SQL, big data analysis, equi-join, in-memory computation

CLC Number: