华东师范大学学报(自然科学版) ›› 2014, Vol. 2014 ›› Issue (5): 261-270.doi: 10.3969/j.issn.10005641.2014.05.023

• 计算机科学与技术 • 上一篇    下一篇

Spark上的等值连接优化

卞昊穹1,2,陈跃国1,2,杜小勇1,2,高彦杰1,2   

  1. 1. 数据工程与知识工程教育部重点实验室(中国人民大学);  2. 中国人民大学 信息学院, 北京 100872
  • 出版日期:2014-09-25 发布日期:2014-11-27
  • 通讯作者: 陈跃国,男,副教授,硕士生导师,研究方向为数据库、信息检索 E-mail:chenyueguo@gmail.com
  • 作者简介:卞昊穹,男,博士研究生,研究方向为数据库. E-mail: bianhaoqiong@gmal.com.
  • 基金资助:

    中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)(10XNI018)

Equi-join optimization on spark

 BIAN  Hao-Qiong1,2, CHEN  Yue-Guo1,2, DU  Xiao-Yong1,2, GAO  Yan-Jie1,2   

  1. 1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
    2. School of Information, Renmin University of China, Beijing 100872, China
  • Online:2014-09-25 Published:2014-11-27

摘要: 等值连接作为数据分析中最常用、代价最高的操作之一,在Spark上的实现和优化与传统并行数据库有很大的差别,传统并行数据仓库中基于数据预划分的连接算法在Spark上难以实现,而目前被广泛采用的Broadcast Join和Repartition Join性能较差,如何提高连接性能成为基于Spark的海量数据分析的关键.本研究将SimiJoin与Partition Join的优势相结合,并基于Spark上的特性提出了一种优化的等值连接算法.代价分析和实验表明本算法比现有基于Spark的数据分析系统中的连接算法性能提升1~2倍.

关键词: Spark, SQL, 大数据分析, 等值连接, 内存计算

Abstract: Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Key words: Spark, SQL, big data analysis, equi-join, in-memory computation

中图分类号: