Journal of East China Normal University(Natural Science) >
Distributed and scalable stream join algorithm
Received date: 2016-06-24
Online published: 2016-11-29
Join-Matrix is a high-performance model for stream join processing in a parallel shared-nothing environment, which supports arbitrary join operations and is
resilient to data skew for taking random tuple distribution as its routing policy. To evenly distribute workload and minimize network communication cost, designing an efficient partitioning policy on the matrix is particularly essential. In this paper, we propose a novel stream join operator that continuously adjust its partitioning scheme to real-time data dynamics. Specifically, based on the sample statistics of streams and rated load of each physical machine, a lightweight scheme generator produces a partitioning scheme; then the corresponding solutions for state relocation are generated by a migration plan generator to minimize migration cost while ensuring result correctness.Our experiments on different kinds of data sets demonstrate that our operator outperforms the static-of-the-art strategies in resource utilization, throughput and system latency.
WANG Xiao-tong , FANG Jun-hua , ZHANG Rong . Distributed and scalable stream join algorithm[J]. Journal of East China Normal University(Natural Science), 2016 , 2016(5) : 81 -88 . DOI: 10.3969/j.issn.1000-5641.2016.05.010
[ 1 ] DITTRICH J-P, SEEGER B, TAYLOR D S, et al. Progressive merge join: A generic and non-blocking sort-based join algorithm [C]//Proceedings of the 28th VLDB Conference. 2002: 299-310.
[ 2 ] URHAN T, FRANKLIN M J. XJoin: A reactively-scheduled pipelined join operator [J]. IEEE Data Eng Bull, 2000, 23(2): 27-33.
[ 3 ] WANG S, RUNDENSTEINER E. Scalable stream join processing with expensive predicates: Workload distribution and adaptation by time-slicing [C]//Proceedings of the 12th Conference on EDBT. 2009: 299-310.
[ 4 ] GOUNARIS A, TSAMOURA E, MANOLOPOULOS Y. Adaptive query processing in distributed settings [J]. Intelligent Systems Reference Library, 2013, 36: 211-236.
[ 5 ] LIU B, JBANTOVA M, RUNDENSTEINER E A. Optimizing state-intensive non-blocking queries using run-time adaptation [C]//Proceedings of the 2007 IEEE 23rd ICDEW. IEEE, 2007: 614-623.
[ 6 ] PATON N W, BUENABAD-CHAVEZ J, CHEN M, et al. Autonomic query parallelization using non-dedicated computers: An evaluation of adaptivity options [J]. The VLDB Journal, 2009, 18(1): 119-140.
[ 7 ] STAMOS J W, YOUNG H C. A symmetric fragment and replicate algorithm for distributed joins [J]. IEEE Transactions on Parallel & Distributed Systems, 1993, 4(12): 1345-1354.
[ 8 ] EPSTEIN R, STONEBRAKER M, WONG E. Distributed query processing in a relational data base system [C]//Proceedings of ACM SIGMOD Conference on Management of Data. 1978: 169-180.
[ 9 ] OKCAN A, RIEDEWALD M. Processing theta-joins using MapReduce [C]//Proceedings of ACM SIGMOD Conference on Management of Data. 2011: 949-960.
[10] ELSEIDY M, ELGUINDY A. Scalable and adaptive online joins [J]. The VLDB Endowment, 2014, 7(6): 441-452.
[11] GEDIK B. Partitioning functions for stateful data parallelism in stream processing [J]. The VLDB Journal, 2013, 23(4): 517-539.
[12] Apache storm[EB/OL]. [2016-06-10]. http://storm.apache.org.
[13] The TPC-H benchmark[EB/OL]. [2016-06-10]. http://www.tpc.org/tpch.
/
〈 |
|
〉 |