Select

Efficient data loading for log-structured data stores

DING Guo-hao, XU Chen, QIAN Wei-ning

Journal of East China Normal University(Natural Sc 2019, 2019 (5): 143-158. DOI: 10.3969/j.issn.1000-5641.2019.05.012

Abstract （388）

HTML （15）

PDF （1535KB）（256）

With the rapid development of Internet technology in recent years, the number of users and the data processed by Internet companies and traditional financial institutions are growing rapidly. Traditionally, businesses have tackled this scalability problem by adding servers and adopting methods based on database sharding; however, this can lead to significant manual maintenance expenses and hardware overhead. To reduce overhead and the problems caused by database sharding, businesses commonly replace heritage equipment with new database systems. In this context, new databases based on log-structured merge tree storage (such as OceanBase) are being widely used; the data blocks stored on the disks of such systems exhibit global orderly features. In the process of switching from a traditional database to a new database, a large amount of data must be transferred, and database node downtime may occur when there is extended loading. In order to reduce the total time for loading and failure recovery, we propose a data loading method that supports load balancing and efficient fault tolerance. To support balanced data loading, we pre-calculate the number of partitions based on the file size and the default block size of a target system rather than using a pre-determined number of partitions. In addition, we use the feature that the data exported from the sharding database is usually sorted to determine the split points between partitions by selecting partial sampling blocks and selecting samples in sampling blocks at equal intervals, avoiding the high overhead caused by global sampling and selecting samples randomly or at the head in sampling blocks. To speed up the recovery process, we propose a replica-based partial recovery to avoid restart-based complete reloading; this method uses the multi-replica of an LSM-tree system to reduce the amount of reloaded data. Experimental results show that by pre-calculating the number of partitions and partial sampling blocks and by using equal-interval sampling, we can accelerate data loading relative to pre-determining the number of partitions and global sampling blocks as well as relative to random or head sampling strategies. Hence, we demonstrate the efficiency of replica-based partial failure recovery compared to restart-based complete reloading.

Reference | Related Articles | Metrics

Select

Implementation of LevelDB-based secondary index on two-dimensional data

LIU Zi-hao, HU Hui-qi, XU Rui, ZHOU Xuan

Journal of East China Normal University(Natural Sc 2019, 2019 (5): 159-167. DOI: 10.3969/j.issn.1000-5641.2019.05.013

Abstract （542）

HTML （17）

PDF （459KB）（402）

With the growth of spatial data generated by scientific research, especially two-dimensional data and the ongoing development of NoSQL-type database technology, more and more spatial data is now stored in NoSQL databases. LevelDB is an open source Key-Value NoSQL database that is widely used because it offers excellent write performance based on LSM architecture. Given the limitations of the Key-Value structure, it is impossible to index spatial data effectively. For this problem, a LevelDB and R-treebased secondary index was proposed to support spatial two-dimensional data indexing and neighbor queries. Experimental results show that the structure has good usability.

Reference | Related Articles | Metrics

Select

Implementation and optimization of a distributed consistency algorithm based on Paxos

ZHU Chao-fan, GUO Jin-wei, CAI Peng

Journal of East China Normal University(Natural Sc 2019, 2019 (5): 168-177. DOI: 10.3969/j.issn.1000-5641.2019.05.014

Abstract （351）

HTML （16）

PDF （494KB）（354）

With the ongoing development of the Internet, the degree of informationization in enterprises is continuously increasing, and more and more data needs to be processed in a timely manner. In this context, the instability of network environments may lead to data loss and node downtime, which can have potentially serious consequences. Therefore, building distributed fault-tolerant storage systems is becoming increasingly popular. In order to ensure high availability and consistency across the system, a distributed consistency algorithm needs to be introduced. To improve the performance of unstable networks, traditional distributed systems based on Paxos allow for the existence of holes in the log. However, when a node enters a recovery state, these systems typically require a large amount of network interaction to complete the holes in the log; this greatly increases the time for node recovery and thereby affects system availability. To address the complexity of the node recovery process after completing a hole log, this paper proposes a redesigned log entry structure and optimized data recovery process. The effectiveness of the improved Paxos-based consistency algorithm is verified with experimental simulation.

Reference | Related Articles | Metrics

Select

Implementation and optimization of GPU-based relational streaming processing systems

HUANG Hao, LI Zhi-fang, WANG Jia-lun, WENG Chu-liang

Journal of East China Normal University(Natural Sc 2019, 2019 (5): 178-189. DOI: 10.3969/j.issn.1000-5641.2019.05.015

Abstract （332）

HTML （11）

PDF （1148KB）（290）

State-of-the-art CPU-based streaming processing systems support complex queries on large-scale datasets. However, limited by CPU computational capability, these systems suffer from the performance tradeoff between throughput and response time, and cannot achieve the best of both. In this paper, we propose a GPU-based streaming processing system, named Serval, that co-utilizes CPU and GPU resources and efficiently processes streaming queries by micro-batching. Serval adopts the pipeline model and uses streaming execution cache to optimize throughput and response time on large scale datasets. To meet the demands of various scenarios, Serval implements multiple tuning policies by scaling the micro-batch size dynamically. Experiments show that a single-server Serval outperforms a 3-server distributed Spark Streaming by 3.87x throughput with a 91% response time on average, reflecting the efficiency of the optimization.

Reference | Related Articles | Metrics

Select

Woodpecker+: Customized workload performance evaluation based on data characteristics

ZHANG Tao, ZHANG Xiao-lei, LI Yu-ming, ZHANG Chun-xi, ZHANG Rong

Journal of East China Normal University(Natural Sc 2019, 2019 (5): 190-202. DOI: 10.3969/j.issn.1000-5641.2019.05.016

Abstract （344）

HTML （12）

PDF （1529KB）（330）

There are a number of performance testing tools, like Sysbench and OLTPBench, that can be used to benchmark the testing of database performance. However, because the standard benchmark workload is fixed and application scenarios for users are not always representative, it is impossible to accurately determine system performance. Moreover, if users are required to use a high-level programming language to implement a test workload separately for each application, this will undoubtedly introduce a substantial amount of repetitive work, resulting in inefficient testing. To address these issues, this paper designs and implements a user-defined performance test workload tool. The main benefits of this tool can be summarized as follows:It is easy to use and expandable; it provides a test definition language (TDL) for efficient construction of test cases; and it offers flexible control for mixed execution of transactions, data access distribution, lightweight and granular statistical information collection and analysis, and support for multiple mainstream DBMSs and other databases that provide database access interfaces. We highlight the tool's features through a detailed set of customized workload experiments running on the mainstream DBMS.

Reference | Related Articles | Metrics

Content of Data Management Techniques In the New Era in our journal