Data Management

Distributed stream processing system for join operations

  • CHEN Ming-zhu ,
  • WANG Xiao-tong ,
  • FANG Jun-hua ,
  • ZHANG Rong
Expand
  • School of Computer Science and Software Engineering, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China

Received date: 2017-06-28

  Online published: 2017-09-25

Abstract

Real-time stream processing system plays an increasingly important role in practical applications. Stream Join constitutes one of the most important and expensive operation in big data analysis. However, skewed data distribution in real-world applications and inherent features of streaming data, such as infinity and unpredictability, put great pressure on the join processing in distributed stream systems. Mainstream industrial stream systems have low versatility on join processing, providing no programming interface; though several academic stream prototype systems solve such a problem to a certain extent, they support equi-join processing only, or results in high resource utilization and severe load imbalance. In this paper, after analyzing three typical distributed stream systems, we integrate the techniques based on Join-Matrix into Storm, design and implement a general stream processing system which supports arbitrary theta joins. Experiments demonstrate that the system proposed in this paper outperforms the static-of-the-art strategies.

Cite this article

CHEN Ming-zhu , WANG Xiao-tong , FANG Jun-hua , ZHANG Rong . Distributed stream processing system for join operations[J]. Journal of East China Normal University(Natural Science), 2017 , 2017(5) : 11 -19 . DOI: 10.3969/j.issn.1000-5641.2017.05.002

References

[1] ANKIT T, SIDDARTH T, AMIT S, et al. Storm@Twitter[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2014:147-156.
[2] LEONARDO N, BRUCE R, ANISH N, et al. S4:Distributed stream computing platform[C]//Proceedings of the International Conference on Data Mining Workshops, 2010:170-177.
[3] CHEN G J, WIENER J L, IYER S, et al. Realtime data processing at Facebook[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2016:1087-1098.
[4] WILSCHUT A N, APERS P M G. Dataflow query execution in a parallel main-memory environment[J]. Distributed and Parallel Databases, 1993(1):103-123.
[5] URHAN T, FRANKLINM J. Dynamic pipeline scheduling for improving interactive query performance[C]//Proceedings of International Conference on Very Large Data Bases. 2001:501-510.
[6] IVES Z G, FLORESCU D, FRIEDMAN M, et al. An adaptive query execution system for data integration[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 1999:299-310.
[7] TAO Y F, YIU M L, PAPADIAS D, et al. RPJ:Producing fast join results on streams through rate-based optimization[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2005:371-382.
[8] MOKBEL M F, LU M, AREF W G. Hash-merge join:A non-blocking join algorithm for producing fast and early join results[C]//Proceedings of the 20th International Conference on Data Engineering. 2004:251-262.
[9] ANANTHANARAYANAN R, BASKER V, DAS S, et al. Photon:Fault-tolerant and scalable joining of continuous data streams[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2013:577-588.
[10] ZAHARIA M, DAS T, LI H Y, et al. Discretized streams:Fault-tolerant streaming computation at scale[C]//Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013:423-438.
[11] QIAN Z P, HE Y, SU C Z, et al. TimeStream:Reliable stream computation in the cloud[C]//Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013:1-14.
[12] ELSEIDY M, ELGUINDY A, VITOROVIC A, et al. Scalable and adaptive online joins[C]//Proceedings of International Conference on Very Large Data Bases, 2014(7):441-452.
[13] LIN Q, OOI B C, WANG Z K, et al. Scalable distributed stream join processing[C]//Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, 2015:811-825.
[14] GOODHOPE K, KOSHY J, KREPS J, et al. Building linkedin's real-time activity data pipeline[J]. IEEE Data Eng Bull, 2012, 35(2):33-45.
[15] REDIS.[DB/OL].[2017-06-01]. https://redis.io/.
[16] ANGULAR JS.[EB/OL].[2017/06-01]. https://angularjs.org/.
[17] FANG J H, ZHANG R, WANG X T, et al. Distributed stream join under workload variance[J]. World Wide Web Journal, 2017:1-22.
[18] FANG J H, WANG X T, ZHANG R, et al. Flexible and adaptive stream join algorithm[C]//Proceedings of International Conference on Asia-Pacific Web, 2016:3-16.
[19] FANG J H, ZHANG R, WANG X T, et al. Cost-effective stream join algorithm on cloud system[C]//Proceedings of CIKM International Conference on Information and Knowledge Management. ACM, 2016:1773-1782.
[20] TPC-H BENCHMARK.[EB/OL].[2017-06-01]. http://www.tpc.org/tpch.
Outlines

/