Journal of East China Normal University(Natural Sc ›› 2014, Vol. 2014 ›› Issue (5): 228-239.doi: 10.3969/j.issn.10005641.2014.05.020

• Article • Previous Articles     Next Articles

LCDJ: Locality conscious join algorithm for in-memory cluster computing

 ZHANG  Lei, ZHOU  Min-Qi, WANG  Li   

  1. Data Science and Engineering Institute, Software Engineering Institute, East China Normal University, Shanghai 200062, China
  • Online:2014-09-25 Published:2014-11-27

Abstract: Equal join is one of the most important operators in database systems. Hash join is an efficient algorithm for equal join. In distributed inmemory database system, data tables are partitioned across multiple nodes. Hash join needs two input tables to be repartitioned on the joined attributes under the same hash function before local join, to make sure that tuples from the two tables with the same join values are transferred to the same node. Since the speed of data processing in memory is much faster than the speed of network, data repartition occupies a large amount of time and becomes the bottleneck of equal join in distributed in-memory database. This paper proposes a novel equal join algorithm, which takes full advantages of in-memory computing and reduces the volume of data to be transferred. The algorithm first accumulates accurate statistics on the joined attributes of two tables, and then builds a cost model to measure the cost of different schedule strategies, and generates the optimized schedule strategy. Furthermore, the degree of parallelism and computing load balance are taken into consideration in our cost model. The proposed algorithm is implemented on our in-memory distributed prototype system Claims. Extensive experiment confirms that the algorithm effectively reduces the cost of network communication, improves the query response time by a huge margin, and gets a higher performance than Hive and Shark.

Key words: distributed hash join, in-memory database, network communication optimization, computing load balance, distributed system

CLC Number: