Article

Equi-join optimization on spark

  • BIAN Hao-Qiong ,
  • CHEN Yue-Guo ,
  • DU Xiao-Yong ,
  • GAO Yan-Jie
Expand
  • 1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
    2. School of Information, Renmin University of China, Beijing 100872, China

Online published: 2014-11-27

Abstract

Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Cite this article

BIAN Hao-Qiong , CHEN Yue-Guo , DU Xiao-Yong , GAO Yan-Jie . Equi-join optimization on spark[J]. Journal of East China Normal University(Natural Science), 2014 , 2014(5) : 261 -270 . DOI: 10.3969/j.issn.10005641.2014.05.023

Outlines

/