华东师范大学学报(自然科学版) ›› 2017, Vol. 2017 ›› Issue (5): 11-19.doi: 10.3969/j.issn.1000-5641.2017.05.002

• 数据管理 • 上一篇    下一篇

支持非等值连接的分布式数据流处理系统

陈明珠, 王晓桐, 房俊华, 张蓉   

  1. 华东师范大学 计算机科学与软件工程学院 上海高可信计算重点实验室, 上海 200062
  • 收稿日期:2017-06-28 出版日期:2017-09-25 发布日期:2017-09-25
  • 通讯作者: 张蓉,女,教授,研究方向为分布式数据管理.E-mail:rzhang@sei.ecnu.edu.cn E-mail:rzhang@sei.ecnu.edu.cn
  • 作者简介:陈明珠,女,本科生,专业为计算机科学.E-mail:101521300140@stu.ecnu.edu.cn
  • 基金资助:
    国家大学生创新创业训练计划(20160269127);国家自然科学基金(61232002);国家863计划(2015AA015307);国家自然基金委项目(61672233)

Distributed stream processing system for join operations

CHEN Ming-zhu, WANG Xiao-tong, FANG Jun-hua, ZHANG Rong   

  1. School of Computer Science and Software Engineering, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China
  • Received:2017-06-28 Online:2017-09-25 Published:2017-09-25

摘要: 实时处理的分布式数据流系统在当今大数据时代扮演着越来越重要的角色.其中,连接查询是大数据分析处理中最为重要且开销较大的操作之一.然而,由于现实应用产生的数据普遍存在倾斜分布现象,加之数据流本身的无界性与不可预知性,给在分布式数据流系统上进行连接查询处理提出了严峻的挑战.目前工业界较为主流的数据流系统处理连接查询的通用性较低,没有提供专门针对连接操作的接口;学术界推出的数据流连接查询原型系统虽然提供了接口,但大多面向等值连接,或仅能支持部分theta连接,且存在资源开销大、负载均衡性能低等问题.本文对比分析三种典型数据流系统,将基于Join-Matrix的连接处理技术与Storm系统相结合,设计并实现了通用的、可支持任意连接查询的数据流处理系统.实验展示了本文设计的系统具有更加良好的吞吐量与资源优化表现.

关键词: 数据流处理系统, 连接处理, 分布式计算

Abstract: Real-time stream processing system plays an increasingly important role in practical applications. Stream Join constitutes one of the most important and expensive operation in big data analysis. However, skewed data distribution in real-world applications and inherent features of streaming data, such as infinity and unpredictability, put great pressure on the join processing in distributed stream systems. Mainstream industrial stream systems have low versatility on join processing, providing no programming interface; though several academic stream prototype systems solve such a problem to a certain extent, they support equi-join processing only, or results in high resource utilization and severe load imbalance. In this paper, after analyzing three typical distributed stream systems, we integrate the techniques based on Join-Matrix into Storm, design and implement a general stream processing system which supports arbitrary theta joins. Experiments demonstrate that the system proposed in this paper outperforms the static-of-the-art strategies.

Key words: stream processing system, join processing, distributed computing

中图分类号: