数据管理

支持非等值连接的分布式数据流处理系统

  • 陈明珠 ,
  • 王晓桐 ,
  • 房俊华 ,
  • 张蓉
展开
  • 华东师范大学 计算机科学与软件工程学院 上海高可信计算重点实验室, 上海 200062
陈明珠,女,本科生,专业为计算机科学.E-mail:101521300140@stu.ecnu.edu.cn

收稿日期: 2017-06-28

  网络出版日期: 2017-09-25

基金资助

国家大学生创新创业训练计划(20160269127);国家自然科学基金(61232002);国家863计划(2015AA015307);国家自然基金委项目(61672233)

Distributed stream processing system for join operations

  • CHEN Ming-zhu ,
  • WANG Xiao-tong ,
  • FANG Jun-hua ,
  • ZHANG Rong
Expand
  • School of Computer Science and Software Engineering, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China

Received date: 2017-06-28

  Online published: 2017-09-25

摘要

实时处理的分布式数据流系统在当今大数据时代扮演着越来越重要的角色.其中,连接查询是大数据分析处理中最为重要且开销较大的操作之一.然而,由于现实应用产生的数据普遍存在倾斜分布现象,加之数据流本身的无界性与不可预知性,给在分布式数据流系统上进行连接查询处理提出了严峻的挑战.目前工业界较为主流的数据流系统处理连接查询的通用性较低,没有提供专门针对连接操作的接口;学术界推出的数据流连接查询原型系统虽然提供了接口,但大多面向等值连接,或仅能支持部分theta连接,且存在资源开销大、负载均衡性能低等问题.本文对比分析三种典型数据流系统,将基于Join-Matrix的连接处理技术与Storm系统相结合,设计并实现了通用的、可支持任意连接查询的数据流处理系统.实验展示了本文设计的系统具有更加良好的吞吐量与资源优化表现.

本文引用格式

陈明珠 , 王晓桐 , 房俊华 , 张蓉 . 支持非等值连接的分布式数据流处理系统[J]. 华东师范大学学报(自然科学版), 2017 , 2017(5) : 11 -19 . DOI: 10.3969/j.issn.1000-5641.2017.05.002

Abstract

Real-time stream processing system plays an increasingly important role in practical applications. Stream Join constitutes one of the most important and expensive operation in big data analysis. However, skewed data distribution in real-world applications and inherent features of streaming data, such as infinity and unpredictability, put great pressure on the join processing in distributed stream systems. Mainstream industrial stream systems have low versatility on join processing, providing no programming interface; though several academic stream prototype systems solve such a problem to a certain extent, they support equi-join processing only, or results in high resource utilization and severe load imbalance. In this paper, after analyzing three typical distributed stream systems, we integrate the techniques based on Join-Matrix into Storm, design and implement a general stream processing system which supports arbitrary theta joins. Experiments demonstrate that the system proposed in this paper outperforms the static-of-the-art strategies.

参考文献

[1] ANKIT T, SIDDARTH T, AMIT S, et al. Storm@Twitter[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2014:147-156.
[2] LEONARDO N, BRUCE R, ANISH N, et al. S4:Distributed stream computing platform[C]//Proceedings of the International Conference on Data Mining Workshops, 2010:170-177.
[3] CHEN G J, WIENER J L, IYER S, et al. Realtime data processing at Facebook[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2016:1087-1098.
[4] WILSCHUT A N, APERS P M G. Dataflow query execution in a parallel main-memory environment[J]. Distributed and Parallel Databases, 1993(1):103-123.
[5] URHAN T, FRANKLINM J. Dynamic pipeline scheduling for improving interactive query performance[C]//Proceedings of International Conference on Very Large Data Bases. 2001:501-510.
[6] IVES Z G, FLORESCU D, FRIEDMAN M, et al. An adaptive query execution system for data integration[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 1999:299-310.
[7] TAO Y F, YIU M L, PAPADIAS D, et al. RPJ:Producing fast join results on streams through rate-based optimization[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2005:371-382.
[8] MOKBEL M F, LU M, AREF W G. Hash-merge join:A non-blocking join algorithm for producing fast and early join results[C]//Proceedings of the 20th International Conference on Data Engineering. 2004:251-262.
[9] ANANTHANARAYANAN R, BASKER V, DAS S, et al. Photon:Fault-tolerant and scalable joining of continuous data streams[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2013:577-588.
[10] ZAHARIA M, DAS T, LI H Y, et al. Discretized streams:Fault-tolerant streaming computation at scale[C]//Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013:423-438.
[11] QIAN Z P, HE Y, SU C Z, et al. TimeStream:Reliable stream computation in the cloud[C]//Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013:1-14.
[12] ELSEIDY M, ELGUINDY A, VITOROVIC A, et al. Scalable and adaptive online joins[C]//Proceedings of International Conference on Very Large Data Bases, 2014(7):441-452.
[13] LIN Q, OOI B C, WANG Z K, et al. Scalable distributed stream join processing[C]//Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, 2015:811-825.
[14] GOODHOPE K, KOSHY J, KREPS J, et al. Building linkedin's real-time activity data pipeline[J]. IEEE Data Eng Bull, 2012, 35(2):33-45.
[15] REDIS.[DB/OL].[2017-06-01]. https://redis.io/.
[16] ANGULAR JS.[EB/OL].[2017/06-01]. https://angularjs.org/.
[17] FANG J H, ZHANG R, WANG X T, et al. Distributed stream join under workload variance[J]. World Wide Web Journal, 2017:1-22.
[18] FANG J H, WANG X T, ZHANG R, et al. Flexible and adaptive stream join algorithm[C]//Proceedings of International Conference on Asia-Pacific Web, 2016:3-16.
[19] FANG J H, ZHANG R, WANG X T, et al. Cost-effective stream join algorithm on cloud system[C]//Proceedings of CIKM International Conference on Information and Knowledge Management. ACM, 2016:1773-1782.
[20] TPC-H BENCHMARK.[EB/OL].[2017-06-01]. http://www.tpc.org/tpch.
文章导航

/