J* E* C* N* U* N* S*

Persistent memory- and shared cache architecture-based high-performance database

Congcong WANG, Huiqi HU

2023, 2023 (5): 1-10. doi: 10.3969/j.issn.1000-5641.2023.05.001

Abstract ( 467 )

HTML ( 25 )

PDF (1228KB) ( 335 )

Save

The upsurge in cloud-native databases has been drawing attention to shared architectures. Although a shared cache architecture can effectively address cache consistency issues among multiple read-write nodes, problems still exist, such as slow persistence speed, high latency in maintaining cache directories, and timestamp bottlenecks. To address these issues, this study proposes a shared cache architecture-based solution that is combined with novel persistent memory hardware, to realize a three-layer shared architecture database—TampoDB, which includes memory, persistent memory, and storage layers. The transaction execution process was redesigned based on this architecture with optimized timestamps and directories, thereby resolving the aforementioned problems. Experimental results show that TampoDB effectively enhances the persistence speed of transactions.

Figures and Tables | References | Related Articles | Metrics

An HTAP database prototype with an adaptive data synchronization

Rong YU, Panfei YANG, Qingshuai WANG, Rong ZHANG

2023, 2023 (5): 11-25. doi: 10.3969/j.issn.1000-5641.2023.05.002

Abstract ( 475 )

HTML ( 12 )

PDF (2638KB) ( 164 )

Save

In HTAP (hybrid transactional and analytical processing) database, resource isolation and data sharing is a difficult problem. Although different vendors achieve resource isolation through different architectures, the freshness of user concerns, that is, the gap between online transactional processing (OLTP) write and online analytical processing (OLAP) read versions, is determined by the consistency model of data sharing. However, existing HTAP databases apply only one consistency synchronization model for an easy implementation, which is contradictory to the multiple consistency requirements of user applications, and the overall system performance is sacrificed for the highest consistency upward compatibility. In this paper, by constructing a cost model of freshness and performance tradeoff, proposing a consistency switching algorithm and a processing strategy for synchronized data before and after switching, and realizing an HTAP database prototype with adaptive switching between sequential consistency synchronization and linear consistency synchronization, which makes it possible to support query loads with different consistency (freshness) requirements and maximize the system performance without adjusting the HTAP architecture. The effectiveness of adaptive switching is also verified by extensive experiments.s of adaptive switching is also verified by extensive experiments.

Figures and Tables | References | Related Articles | Metrics

Hybrid granular buffer management scheme for storage and computing separation architecture

Wenjuan MEI, Peng CAI

2023, 2023 (5): 26-39. doi: 10.3969/j.issn.1000-5641.2023.05.003

Abstract ( 433 )

HTML ( 11 )

PDF (1669KB) ( 236 )

Save

The architecture of storage-compute separation has emerged as a solution for improving the performance and efficiency of large-scale data processing. However, there are notable performance bottlenecks in this approach, primarily due to the low access efficiency of object storage and the significant network overhead. Additionally, object storage exhibits low storage efficiency for small-sized files. For instance, ClickHouse, a MergeTree-based database, generates a plethora of small-sized files when storing data. To address these challenges, HG-Buffer (hybrid granularity buffer) is introduced as an SSD (solid state driver)-based caching management solution for optimizing the storage-compute separation in ClickHouse and S3, while also tackling the small-file issue in object storage. The primary objective of HG-Buffer is to minimize network transmission overhead and enhance system access efficiency. This is achieved by introducing SSD as a caching layer between the compute and storage layers and organizing the SSD buffer into two granularities: object buffer and block buffer. The object buffer granularity corresponds to the data granularity in object storage, while the block buffer granularity represents the data granularity accessed by the system, with the block buffer granularity being a subset of the object buffer granularity. By statistically analyzing data hotness information, HG-Buffer adaptively selects the storage location for data, improving SSD space utilization and system performance. Experimental evaluations conducted on ClickHouse and S3 demonstrate the effectiveness and robustness of HG-Buffer.

Figures and Tables | References | Related Articles | Metrics

Separate management strategies for Part metadata under the storage-computing separation architecture

Danqi LIU, Peng CAI

2023, 2023 (5): 40-50. doi: 10.3969/j.issn.1000-5641.2023.05.004

Abstract ( 369 )

HTML ( 9 )

PDF (1081KB) ( 165 )

Save

To address the deficiencies of ClickHouse, including underutilization of hardware resources, lack of flexibility, and slow node startup, this paper proposes metadata management strategies under the storage-compute separation architecture, which focuses on the description of data information through Part metadata. Part metadata are the most crucial component of metadata. To effectively manage data on remote shared storage, this study collected all Part metadata files and merged them. After key-value mapping, serialization, and deserialization processes, the merged metadata were stored in a distributed key-value database. Furthermore, a synchronization strategy was designed to ensure consistency between the data on remote shared storage and the metadata in the distributed key-value database. By implementing the above strategies, a metadata management system was developed for Part metadata, which effectively addressed the slow node startup issue in ClickHouse and supported efficient dynamic scaling of nodes.

Figures and Tables | References | Related Articles | Metrics

Generating diverse database isolation level test cases with fuzzy testing

Xiyu LU, Wei LIU, Siyang WENG, Keqiang LI, Rong ZHANG

2023, 2023 (5): 51-64. doi: 10.3969/j.issn.1000-5641.2023.05.005

Abstract ( 359 )

HTML ( 5 )

PDF (2266KB) ( 237 )

Save

Database management systems play a vital role in modern information systems. Isolation level testing is important for database management systems to ensure the isolation of concurrent operations and data consistency to prevent data corruption, inconsistency and security risks, and to provide reliable data access to users. Fuzzy testing is a method widely used in software and system testing. By searching the test space and generating diverse test cases, it explores the boundary conditions, anomalies and potential problems of the system to find possible vulnerabilities. This article introduces SilverBlade, a tool for fuzzy testing of database isolation levels, that aims to improve the diversity of generated test cases and explore the isolation level test space in depth-wise. To effectively search the huge test space, this study designed a structured test input that splits the test space into two subspaces of concurrent transaction combination and execution interaction modes for searching. To test the isolation-level core implementation test space more comprehensively, an adaptive search method based on depth and breadth was also designed for effective mutation test cases. The experimental results show that SilverBlade is able to generate diverse test cases and provide broader coverage of the core implementation code of the database isolation level in the popular database management system PostgreSQL. Compared to similar tools, SilverBlade performed better at improving test coverage in critical areas of the isolation level.

Figures and Tables | References | Related Articles | Metrics

FeaDB: In-memory based multi-version online feature store

Ge GAO, Huiqi HU

2023, 2023 (5): 65-76. doi: 10.3969/j.issn.1000-5641.2023.05.006

Abstract ( 382 )

HTML ( 7 )

PDF (1103KB) ( 213 )

Save

Feature management plays an important role in the AI(artificial intelligence) pipeline. Feature stores are designed to offer effective versioning of features during the model training and inference stages. Feature stores must ensure real-time feature updates and version management to collaborate with the upstream data ingestion tasks and power the model serving system. In AI-powered online decision augmentation applications, the model serving system responds to requests in real time to provide better user experience, and feature stores face the challenge of low-latency online feature retrieval. Focusing on this challenge, we developed FeaDB, an in-memory based multi-version online feature store, which adopts a time series model and provides feature versioning semantics to automatically manage features from ingestion to serving. Moreover, an append-write operation was applied to ensure ingestion performance, and version indexing was optimized to improve read operations. A snapshot mechanism is proposed, and it was experimentally proven that snapshot read operations improve performance of lookup and range lookup.

Figures and Tables | References | Related Articles | Metrics

Privacy-preserving cloud-end collaborative training

Xiangyun GAO, Dan MENG, Mingkai LUO, Jun WANG, Liping ZHANG, Chao KONG

2023, 2023 (5): 77-89. doi: 10.3969/j.issn.1000-5641.2023.05.007

Abstract ( 628 )

HTML ( 10 )

PDF (2331KB) ( 411 )

Save

China has the advantages of scale and diversity in data resources, and mobile internet data applications, which generate massive amounts of data in diverse application scenarios, recommendation systems have the capability to extract valuable information from this massive amounts of data, thereby mitigating the problem of information overload. Most existing research on recommendation systems focused on centralized recommender systems, training the data on the cloud centrally. However, with increasingly prominent data security and privacy protection issues, collecting user data has become increasingly difficult, making centralized recommendation methods infeasible. This study focuses on privacy-preserving cloud-end collaborative training in a decentralized manner for personalized recommender systems. To fully utilize the advantages of end devices and cloud servers while considering privacy and security issues, a cloud-end collaborative training method named FedMNN (federated machine learning and mobile neural network) is proposed for recommender systems based on federated machine learning (FedML) and a mobile neural network (MNN). The proposed method was divided into three parts: First, cloud-based models implemented in various deep learning frameworks were converted into general MNN models for end-device training using the ONNX (open neural network exchange) intermediate framework and a MNN model conversion tool. Second, the cloud server sends the model to the end-side devices, which initialized and obtain local data for training and loss calculation, followed by gradient back-propagation. Finally, the end-side models are fed back to the cloud server for model aggregation and updating. Depending on different requirements, the cloud model was deployed on end-side devices as required, achieving end-cloud collaboration. Experiments comparing power consumption of the proposed FedMNN and FLTFlite (flower and TensorFlow lite) frameworks on benchmark tasks identified that FedMNN is 32% to 51% lower than FLTFlite. Using DSSM (deep structured semantic model) and deep and wide recommendation models, the experimental results demonstrated the effectiveness of the proposed cloud-end collaborative training method.

Figures and Tables | References | Related Articles | Metrics

Acceleration technique for heterogeneous operators based on openGauss

Xiansen CHEN, Chen XU

2023, 2023 (5): 90-99. doi: 10.3969/j.issn.1000-5641.2023.05.008

Abstract ( 338 )

HTML ( 5 )

PDF (1093KB) ( 181 )

Save

The high parallelism and throughput of graphics processing unit (GPU) can improve the performance of on-line analytical processing (OLAP) queries in databases. However, openGauss currently cannot take advantage of the benefits of heterogeneous computing hardware such as GPU. Therefore, in this study, we explore using GPU to accelerate the OLAP processing in the system and achieve higher performance. The focus is on how to implement and optimize GPU acceleration modules for openGauss. To address the difference in execution granularity between openGauss and PostgreSQL, we propose a CPU (central processing unit)-GPU collaborative parallel solution based on chunked reading and key distribution. This solution can reduce the I/O (input/output) time of the GPU Scan operator to reduce idle waiting time, and run multiple instances of GPU Join to support multi-GPU environments. To address the architectural differences between openGauss and PostgreSQL, a heterogeneous operator acceleration technology compatible with vectorized engines is proposed. A custom operator framework is implemented that can embed a vectorized execution engine, and a vectorized GPU Scan operator capable of processing openGauss columnar data is employed based on this framework. A prototype system is implemented to verify the effectiveness of the proposed approach.

Figures and Tables | References | Related Articles | Metrics

Automatic generation of Web front-end code based on UI images

Jin GE, Xuesong LU

2023, 2023 (5): 100-109. doi: 10.3969/j.issn.1000-5641.2023.05.009

Abstract ( 1126 )

HTML ( 257 )

PDF (1748KB) ( 467 )

Save

User interfaces (UIs) play a vital role in the interactions between an application and its users. The current popularity of mobile Internet has led to the large-scale migration of web-based applications from desktop to mobile. Web front-end development has become more extensive and in-depth in application development. Traditional web front-end development relies on designers to give initial design drafts and then programmers to write the corresponding UI code. This method has high industry barriers and slow development, which are not conducive to rapid product iteration. The development of deep learning makes it possible to automatically generate web front-end code based on UI images. Existing methods poorly capture the features of UI images, and the accuracy of the generated code is low. To mitigate these problems, we propose an encoder–decoder model, called image2code, based on the Swin Transformer, which is used to generate web front-end code from UI images. Image2code regards the process of generating web front-end code from UI images as an image captioning task and uses Swin Transformer with a sliding window design as the backbone network of the encoder and decoder. The sliding window operation limits the attention calculation to one window, which reduces the amount of calculation by the attention mechanism while simultaneously ensuring that feature connections remain across windows. In addition, image2code generates Emmet code, which is much simpler and can be directly converted to HTML code, improving the efficiency of model training. Experimental results show that image2code performs better than existing representative models, such as pix2code and image2emmet, in the task of web front-end code generation on existing and newly constructed datasets.

Figures and Tables | References | Related Articles | Metrics

Heterogeneous coding-based federated learning

Hongwei SHI, Daocheng HONG, Lianmin SHI, Yingyao YANG

2023, 2023 (5): 110-121. doi: 10.3969/j.issn.1000-5641.2023.05.010

Abstract ( 433 )

HTML ( 4 )

PDF (1117KB) ( 112 )

Save

In heterogeneous federated learning systems, among a variety of edge devices such as personal computers and embedded devices, resource-constrained devices, i.e. stragglers, reduce the training efficiency of the federated learning system. This paper proposes a heterogeneous coded federated learning (HCFL) system to ① improve the training efficiency of the system and speed up the training of heterogeneous federated learning (FL) for multiple stragglers, ② provide a certain level of data privacy protection. The HCFL scheme designs scheduling strategies from the perspective of client and server to satisfy the accelerated calculation of multiple stragglers model in the general environment. In addition, a linear coded computing (LCC) scheme is designed to provide data protection for task distribution. The experimental results show that HCFL can reduce training time by 89.85% when the performance difference between devices is large.

Figures and Tables | References | Related Articles | Metrics

Parallel deep-forest-based abnormal traffic detection for power distribution communication networks

Zhenglei ZHOU, Jun CHEN, Juntao PAN, Peisen YUAN

2023, 2023 (5): 122-134. doi: 10.3969/j.issn.1000-5641.2023.05.011

Abstract ( 370 )

HTML ( 3 )

PDF (1167KB) ( 101 )

Save

With the continuous development of network attack methods, it is becoming increasingly difficult to protect the security of power communication networks. Currently, the detection accuracy of abnormal traffic in distribution communication networks is insufficient and the efficiency of abnormal traffic detection is low. To address these issues, a new method for abnormal traffic detection in distribution communication networks is proposed, in which feature extraction and traffic classification are improved. The proposed method utilizes a time-frequency domain feature extraction method, using an adaptive redundancy boosting multiwavelet packet transform to quickly extract frequency-domain features, while time-domain features are extracted using the communication characteristics of the distribution network. To improve traffic classification and detection, a parallel deep forest classification algorithm is proposed based on a distributed computing framework, and the training and classification task scheduling strategies are optimized. The experimental results show that the false alarm rate of the proposed method is only 2.63% and the accuracy rate for the detection of abnormal traffic in distribution networks is 98.29%.

Figures and Tables | References | Related Articles | Metrics

Research on Autoformer-based electricity load forecasting and analysis

Litao TANG, Zhiyong ZHANG, Jun CHEN, Linna XU, Jiachen ZHONG, Peisen YUAN

2023, 2023 (5): 135-146. doi: 10.3969/j.issn.1000-5641.2023.05.012

Abstract ( 762 )

HTML ( 13 )

PDF (1298KB) ( 1007 )

Save

Next-generation power grids is the main direction of future smart grid development, and the accurate prediction of power loads is an important basic task of smart grids. To improve the accuracy of load prediction in smart power systems, this work characterized the load dataset based on an Autoformer, a prediction model with an autocorrelation mechanism; adds a feature extraction layer to the original model; optimized the model parameters in terms of the number of coding layers, decoding layers, learning rate, and batch size; and achieved cycle-flexible load prediction. The experimental results show that the model performs better in prediction, with an MAE, MSE, and coefficient of determination of 0.2512, 0.1915, and 0.9832, respectively. Compared with other methods, this method has better load prediction results.

Figures and Tables | References | Related Articles | Metrics

Smoke detection based on spatial and frequency domain methods

Lianjun SHENG, Zhixuan TANG, Xiaoliang MAO, Fan BAI, Dingjiang HUANG

2023, 2023 (5): 147-163. doi: 10.3969/j.issn.1000-5641.2023.05.013

Abstract ( 494 )

HTML ( 4 )

PDF (3802KB) ( 144 )

Save

In industrial scenarios such as substations, video-based visual smoke detection has been adopted as a new environmental monitoring method to assist or replace smoke sensors. However, in industrial applications, visual smoke detection algorithms are required to maintain a low false detection rate while minimizing the missed detection rate. To address this, this study proposes a smoke detection algorithm based on spatial and frequency domain methods, which perform smoke detection in both domains. In the spatial domain, in addition to extracting smoke motion characteristics, this study designed a method for extracting smoke mask characteristics, which effectively ensures a low missed detection rate. In the frequency domain, this study combined filtering and neural network modules to further reduce the false detection rate. Finally, a fusion domain post-processing strategy was designed to obtain the final detection results. In experiments conducted on a test dataset, the smoke detection algorithm achieved a false detection rate of 0.053 and missed detection rate of 0.113, demonstrating a good balance between false alarms and missed detections, which is suitable for smoke detection in substation industrial scenes.

Figures and Tables | References | Related Articles | Metrics

A multimode data management method based on Data Fabric

Xinjun ZHENG, Guoliang TIAN, Feihu HUANG

2023, 2023 (5): 164-181. doi: 10.3969/j.issn.1000-5641.2023.05.014

Abstract ( 645 )

HTML ( 9 )

PDF (804KB) ( 228 )

Save

In the process of government and enterprise evolution, as information technology deepens from informatization into digitization, the data generated by various applications are becoming increasingly multimode, multisource, and massive, thereby posing new challenges to data management. To address these challenges, many new technologies and concepts have emerged in the field of data management. Data Fabric is a method that integrates distributed data storage, processing, and applications into a whole, providing a set of visual interfaces for management. First, we analyzed the technical architecture, characteristics, value, and complete process of managing and applying the multimode data of Data Fabric. Subsequently, we proposed anomaly monitoring methods based on time series indicators as well as log data for multimode and multisource data, whereby the processing speed improved by 33.3% and 42.2%, and F1 score improved by 12.2 pps (percentage points) and 14.8 pps, respectively, using Data Fabric technology. This further demonstrates the efficiency and application value of Data Fabric technology in the newly proposed methods.

Figures and Tables | References | Related Articles | Metrics

Network security assessment based on hidden Markov and artificial immunization in new power systems

Zhi XU, Jun CHEN, Zhiyong ZHANG, Junling WAN, Peisen YUAN

2023, 2023 (5): 182-192. doi: 10.3969/j.issn.1000-5641.2023.05.015

Abstract ( 394 )

HTML ( 3 )

PDF (1141KB) ( 129 )

Save

Advanced metering infrastructure is an important component in the construction of new power systems; however, advanced measurement systems rely on network information infrastructures and present major security issues. This study evaluates the network security posture of advanced metering systems by combining a hidden Markov model with an artificial immunity algorithm. First, a counting algorithm is used to obtain the security observation data in the power network. Subsequently, the Markov model is used to describe the change in the network security state, whereby the artificial immunity algorithm calculates the transfer probability matrix between different states. The state transfer matrix is then modified based on the state assessment error. Upon calculating the probability of being at different security states at different times and combining it with the risk loss vector, the final value of safety posture is obtained. The experiments demonstrate that the method proposed in this study has a good assessment effect and can capture the safety defects in the system more accurately, ensuring the safe operation of the advanced measurement system by accurately identifying the relevant safety defects in the system for a safe, smooth, and reliable operation of the new grid environment.

Figures and Tables | References | Related Articles | Metrics

Identifying electricity theft based on residual network and depthwise separable convolution enhanced self attention

Zhishang DUAN, Yi RAN, Duliang LYU, Jie QI, Jiachen ZHONG, Peisen YUAN

2023, 2023 (5): 193-204. doi: 10.3969/j.issn.1000-5641.2023.05.016

Abstract ( 477 )

HTML ( 7 )

PDF (1288KB) ( 174 )

Save

Power theft seriously endangers power equipment and personal safety, and causes significant economic losses for energy suppliers. Hence, it is important for these suppliers to accurately identify instances of power theft to reduce losses and increase efficiency. In this paper, based on the residual network (ResNet) structure, a 2D convolutional neural network is combined with a depthwise separable convolution enhanced self-attentive (DSCAttention) mechanism to improve the number of correctly-classified electricity theft users. In addition, electricity theft data often contains missing values, outliers, and positive and negative sample imbalance. Each of the above problems are treated separately using the zero-completion method, quantile transformation, and hierarchical splitting method, respectively. The proposed model has been extensively tested using real power theft data sets. The results show that the area under curve (AUC) index of the proposed model reaches a value of 91.92%, while mean average precision values MAP@100 and MAP@200 are measured reaching 98.58% and 96.77%, respectively. Compared with other electricity theft classification models, the proposed model performs the electricity theft classification task better. The method in this paper can be extended to electricity theft intelligent identification.

Figures and Tables | References | Related Articles | Metrics

Table of Content