Spark上的等值连接优化

doi:10.3969/j.issn.10005641.2014.05.023

华东师范大学学报(自然科学版) ›› 2014, Vol. 2014 ›› Issue (5): 261-270.doi: 10.3969/j.issn.10005641.2014.05.023

Spark上的等值连接优化

卞昊穹^1,2，陈跃国^1,2，杜小勇^1,2，高彦杰^1,2

1. 数据工程与知识工程教育部重点实验室（中国人民大学）；  2. 中国人民大学信息学院，北京 100872

出版日期:2014-09-25 发布日期:2014-11-27
通讯作者: 陈跃国，男，副教授，硕士生导师，研究方向为数据库、信息检索 E-mail:chenyueguo@gmail.com
作者简介:卞昊穹，男，博士研究生，研究方向为数据库. E-mail: bianhaoqiong@gmal.com.
基金资助:
中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)（10XNI018)

Equi-join optimization on spark

BIAN Hao-Qiong^1,2, CHEN Yue-Guo^1,2, DU Xiao-Yong^1,2, GAO Yan-Jie^1,2

1. Key Laboratory of Data Engineering and Knowledge Engineering(Renmin University of China), MOE, Beijing 100872, China; 
2. School of Information, Renmin University of China, Beijing 100872, China

Online:2014-09-25 Published:2014-11-27

摘要/Abstract

摘要： 等值连接作为数据分析中最常用、代价最高的操作之一，在Spark上的实现和优化与传统并行数据库有很大的差别，传统并行数据仓库中基于数据预划分的连接算法在Spark上难以实现，而目前被广泛采用的Broadcast Join和Repartition Join性能较差，如何提高连接性能成为基于Spark的海量数据分析的关键.本研究将SimiJoin与Partition Join的优势相结合，并基于Spark上的特性提出了一种优化的等值连接算法.代价分析和实验表明本算法比现有基于Spark的数据分析系统中的连接算法性能提升1~2倍.

关键词: Spark, SQL, 大数据分析, 等值连接, 内存计算

Abstract: Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.

Key words: Spark, SQL, big data analysis, equi-join, in-memory computation

中图分类号:

TP392

卞昊穹, 陈跃国, 杜小勇, 高彦杰. Spark上的等值连接优化[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 261-270.

BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University(Natural Sc, 2014, 2014(5): 261-270.

[1]	王彦朝, 胡卉芪, 张召, 刘小兵, 段惠超. 数据存储与处理分离架构下的子链接消除及优化[J]. 华东师范大学学报(自然科学版), 2018, 2018(4): 90-98.
[2]	宋光旋, 赵大鹏, 王晓玲. IM²:一种改进的MIN/MAX窗口函数优化技术[J]. 华东师范大学学报(自然科学版), 2018, 2018(1): 103-116.
[3]	王珊蕾, 岳昆, 武浩, 田凯琳. 基于隐变量模型的多维用户偏好建模[J]. 华东师范大学学报(自然科学版), 2017, 2017(5): 138-153.
[4]	隆飞, 翁海星, 高明, 张召. 基于LSM Tree的分布式索引实现[J]. 华东师范大学学报(自然科学版), 2016, 2016(5): 36-44.
[5]	祝君, 刘柏众, 余晟隽, 宫学庆, 周敏奇. 面向OceanBase的存储过程设计与实现[J]. 华东师范大学学报(自然科学版), 2016, 2016(5): 144-152.
[6]	孔超, 钱卫宁, 周傲英. NoSQL系统的容错机制：原理与系统示例[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 1-16.
[7]	潘巍, 李战怀. 大数据环境下并行计算模型的研究进展[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 43-54.
[8]	裴欧亚, 刘文洁, 李战怀, 田征. 一种面向海量分布式数据库的嵌套查询策略[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 271-280.
[9]	朱涛, 周敏奇, 张召. 面向OceanBase的存储过程实现技术研究[J]. 华东师范大学学报(自然科学版), 2014, 2014(5): 281-289.

Spark上的等值连接优化

Equi-join optimization on spark

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价