Select

Parallel join based on distributed system OceanBase

XU Shi-lei, WANG Lei, HU Hui-qi, QIAN Wei-ning, ZHOU Ao-ying

Journal of East China Normal University(Natural Sc 2017, 2017 (5): 1-10. DOI: 10.3969/j.issn.1000-5641.2017.05.001

Abstract （594）

HTML （18）

PDF （771KB）（941）

With the rapid growth of application data and the continued development of distributed database systems, data storage in physical independent nodes has become a trend. In this trend, when the application needs to perform complex join queries, it inevitably generates a lot of network traffic. Therefore, improving the efficiency of join query in distributed system is a hot topic. Based on the analysis of the nested loop join, Hash join, semi-join in the OceanBase, this paper puts forward the optimization idea of using hardware resources reasonably and using multithread to execute join operations in parallel. We implement experiment on OceanBase with nested loop join algorithm, Hash join algorithm, semi-join algorithm respectively. The experimental results confirm that the efficiency of join algorithm is positively related to parallelism in a certain number of threads.

Reference | Related Articles | Metrics

Select

Distributed stream processing system for join operations

CHEN Ming-zhu, WANG Xiao-tong, FANG Jun-hua, ZHANG Rong

Journal of East China Normal University(Natural Sc 2017, 2017 (5): 11-19. DOI: 10.3969/j.issn.1000-5641.2017.05.002

Abstract （400）

HTML （90）

PDF （641KB）（613）

Real-time stream processing system plays an increasingly important role in practical applications. Stream Join constitutes one of the most important and expensive operation in big data analysis. However, skewed data distribution in real-world applications and inherent features of streaming data, such as infinity and unpredictability, put great pressure on the join processing in distributed stream systems. Mainstream industrial stream systems have low versatility on join processing, providing no programming interface; though several academic stream prototype systems solve such a problem to a certain extent, they support equi-join processing only, or results in high resource utilization and severe load imbalance. In this paper, after analyzing three typical distributed stream systems, we integrate the techniques based on Join-Matrix into Storm, design and implement a general stream processing system which supports arbitrary theta joins. Experiments demonstrate that the system proposed in this paper outperforms the static-of-the-art strategies.

Reference | Related Articles | Metrics

Select

Storage and load balancing for large-scale comment data on heterogeneous Redis cluster

ZHANG Jing-wei, DING Zhi-jun, YANG Qing, ZHANG Hui-bing, ZHANG Hai-tao, ZHOU Ya

Journal of East China Normal University(Natural Sc 2017, 2017 (5): 20-29. DOI: 10.3969/j.issn.1000-5641.2017.05.003

Abstract （672）

HTML （19）

PDF （643KB）（771）

The storage and query performance for large-scale comment data have a great influence on those applications built on the above data. In a heterogeneous computing environment, each node has different performance on storage and computation, it presents a key challenge for optimizing the storage and query performance for large-scale comment data by taking full advantage of the performance of each node. Based on the ability of Redis cluster, we design a storage model for large-scale comment data in a homogeneous Redis cluster, which provides the storage balancing in Redis slots. And then, we discuss the relationship between the number of Redis slots and query efficiency to design a method for allocating storage on the real load of each computing node for heterogeneous Redis clusters, which can make full use of the performance of each node and can guide to allocate slots to nodes by balancing the query performance and storage loading. Our experimental results show that the proposed model has a good effect on storage loading and improve the query efficiency of the heterogeneous Redis cluster.

Reference | Related Articles | Metrics

Select

Design and implementation of Smart materialization for column-store in CLAIMS

ZHANG Han, ZHOU Min-qi

Journal of East China Normal University(Natural Sc 2017, 2017 (5): 30-39. DOI: 10.3969/j.issn.1000-5641.2017.05.004

Abstract （393）

HTML （17）

PDF （621KB）（505）

Materialization is a necessary operation in the process of query execution. Materialization strategy and materialization technology play an important role in the process of query execution. Therefore, it is necessary to design a materialization strategy for column-store database. According to the shortcomings of early materialization and later materialization, we provide a strategy named Smart materialization that are different from the two strategies mentioned above. Here we need to define a concept in the logical query plan-projection, the structure is used to select the desired attributes, the physical table is cut by column, to ensure that the structure at the beginning of the query can reduce the direct load to memory of the amount of data, to avoid additional overhead. In the logical query plan, the projection is divided by columns, and the next required columns are predicted according to the relevance of the query in a set of queries, and the required columns are stabilized in one of the most appropriate projection. We use the data set of TPC-H to verify its validity worked on the disturbed in-memory database-CLAIMS.

Reference | Related Articles | Metrics

Select

An outer join algorithm based on Cuckoo filter

YU Yang, ZHOU Min-qi, FANG Zhu-he

Journal of East China Normal University(Natural Sc 2017, 2017 (5): 40-51. DOI: 10.3969/j.issn.1000-5641.2017.05.005

Abstract （531）

HTML （23）

PDF （750KB）（597）

In recent years, due to the development of the Internet, data size has been increased rapidly. At the era of big data, the analysis efficiency of distributed database system needs to be optimized urgently. Nevertheless, the join operation is the main performance bottleneck of a distributed database system. Join operations are mainly divided into inner join and outer join, and outer join is widely used in the business situations. Distributed join algorithm involves a large amount of network transmission, which affects the performance of the system severely. Although there are some studies in the literature of inner join optimized, these optimization methods cannot be directly applied to outer join. This paper proposes a distributed outer join algorithm based on Cuckoo filter. By building a Cuckoo filter with replication subdivision technology for data filtering and allocation, it reduces the amount of data transmission and improves the degree of parallelism accordingly. Finally, it improves the query performance. We implement this algorithm in Ginkgo. Based on the given extensive experimental verification, the algorithm largely improves the efficiency of the outer join.

Reference | Related Articles | Metrics

Content of Data Management in our journal