Table of Content

    25 September 2021, Volume 2021 Issue 5 Previous Issue   
    For Selected: Toggle Thumbnails
    Financial Knowledge Graph
    Research progress in Chinese named entity recognition in the financial field
    Qiurong XU, Peng ZHU, Yifeng LUO, Qiwen DONG
    2021, 2021 (5):  1-13.  doi: 10.3969/j.issn.1000-5641.2021.05.001
    Abstract ( 455 )   HTML ( 623 )   PDF (821KB) ( 226 )   Save

    As one of the basic components of natural language processing, named entity recognition (NER) has been an active area of research both domestically in China and abroad. With the rapid development of financial applications, Chinese NER has improved over time and been applied successfully throughout the financial industry. This paper provides a summary of the current state of research and future development trends for Chinese NER methods in the financial field. Firstly, the paper introduces concepts related to NER and the characteristics of Chinese NER in the financial field. Then, based on the development process, the paper provides an overview of detailed characteristics and typical models for dictionary and rule-based methods, statistical machine learning-based methods, and deep learning-based methods. Next, the paper summarizes public data collection tools, evaluation methods, and applications of Chinese NER in the financial industry. Finally, the paper explores current challenges and future development trends.

    Figures and Tables | References | Related Articles | Metrics
    Data augmentation technology for named entity recognition
    Xiaoqin MA, Xiaohe GUO, Yufeng XUE, Lin YANG, Yuanzhe CHEN
    2021, 2021 (5):  14-23.  doi: 10.3969/j.issn.1000-5641.2021.05.002
    Abstract ( 246 )   HTML ( 416 )   PDF (689KB) ( 143 )   Save

    A named entity recognition task is as a task that involves extracting instances of a named entity from continuous natural language text. Named entity recognition plays an important role in information extraction and is closely related to other information extraction tasks. In recent years, deep learning methods have been widely used in named entity recognition tasks; the methods, in fact, have achieved a good performance level. The most common named entity recognition models use sequence tagging, which relies on the availability of a high quality annotation corpus. However, the annotation cost of sequence data is high; this leads to the use of small training sets and, in turn, seriously limits the final performance of named entity recognition models. To enlarge the size of training sets for named entity recognition without increasing the associated labor cost, this paper proposes a data augmentation method for named entity recognition based on EDA, distant supervision, and bootstrap. Using experiments on the FIND-2019 dataset, this paper illustrates that the proposed data augmentation techniques and combinations thereof can significantly improve the overall performance of named entity recognition models.

    Figures and Tables | References | Related Articles | Metrics
    Joint extraction of entities and relations for domain knowledge graph
    Rui FU, Jianyu LI, Jiahui WANG, Kun YUE, Kuang HU
    2021, 2021 (5):  24-36.  doi: 10.3969/j.issn.1000-5641.2021.05.003
    Abstract ( 226 )   HTML ( 48 )   PDF (842KB) ( 218 )   Save

    Extraction of entities and relationships from text data is used to construct and update domain knowledge graphs. In this paper, we propose a method to jointly extract entities and relations by incorporating the concept of active learning; the proposed method addresses problems related to the overlap of vertical domain data and the lack of labeled samples in financial technology domain text data using the traditional approach. First, we select informative samples incrementally as training data sets. Next, we transform the exercise of joint extraction of entities and relations into a sequence labeling problem by labelling the main entities. Finally, we fulfill the joint extraction using the improved BERT-BiGRU-CRF model for construction of a knowledge graph, and thus facilitate financial analysis, investment, and transaction operations based on domain knowledge, thereby reducing investment risks. Experimental results with finance text data shows the effectiveness of our proposed method and verifies that the method can be successfully used to construct financial knowledge graphs.

    Figures and Tables | References | Related Articles | Metrics
    Key Technologies for System
    Optimization of LSM-tree storage systems based on non-volatile memory
    Yang YU, Huiqi HU, Xuan ZHOU
    2021, 2021 (5):  37-47.  doi: 10.3969/j.issn.1000-5641.2021.05.004
    Abstract ( 244 )   HTML ( 56 )   PDF (1368KB) ( 287 )   Save

    With the advent of the big data era, the financial industry has been generating increasing volumn of data, exerting pressure on database systems. LevelDB is a key-value database, developed by Google, based on the LSM-tree architecture. It offers fast writing and a small footprint, and is widely used in the financial industry. In this paper, we propose a design method for the L0layer, based on non-volatile memory and machine learning, with the aim of addressing the shortcomings of the LSM-tree architecture, including write pause, write amplification, and unfriendly reading. The proposed solution can slow down or even solve the aforementioned problems; the experimental results demonstrate that the design can achieve better read and write performance.

    Figures and Tables | References | Related Articles | Metrics
    Erasure code partition storage based on the CITA blockchain
    Furong YIN, Chengyu ZHU, Bin ZHAO, Zhao ZHANG
    2021, 2021 (5):  48-59.  doi: 10.3969/j.issn.1000-5641.2021.05.005
    Abstract ( 280 )   HTML ( 63 )   PDF (1572KB) ( 188 )   Save

    Blockchain system adopts full replication data storage mechanism, which retains a complete copy of the whole block chain for each node. The scalability of the system is poor. Due to the existence of Byzantine nodes in the blockchain system, the shard scheme used in the traditional distributed system cannot be directly applied in the blockchain system. In this paper, the storage consumption of each block is reduced from O(n) to O(1) by combining erasure code and Byzantine fault-tolerant algorithm, and the scalability of the system is enhanced. This paper proposes a method to partition block data, which can reduce the storage redundancy and affect the query efficiency less. A coding block storage method without network communication is proposed to reduce the system storage and communication overhead. In addition, a dynamic recoding method for entry and exit of blockchain nodes is proposed, which not only ensures the reliability of the system, but also reduces the system recoding overhead. Finally, the system is implemented on the open source blockchain system CITA, and through sufficient experiments, it is proved that the system has improved scalability, availability and storage efficiency.

    Figures and Tables | References | Related Articles | Metrics
    Blockchain-oriented data management middleware
    Sijia DENG, Xing TONG, Haibo TANG, Zhao ZHANG, Cheqing JIN
    2021, 2021 (5):  60-73.  doi: 10.3969/j.issn.1000-5641.2021.05.006
    Abstract ( 233 )   HTML ( 45 )   PDF (1224KB) ( 84 )   Save

    As a decentralized distributed ledger, blockchain technology is widely used to share data between untrusted parties. Compared with traditional databases that have been refined over many years, blockchains cannot support rich queries, are limited to single query interfaces, and suffer from slow response. Simple organizational structures and discrete storage limits are the main barriers that limit the expression of transaction data. In order to make up for the shortcomings of existing blockchain technology and achieve efficient application development, users can build abstract models, encapsulate easy-to-use interfaces, and improve the efficiency of queries. We also propose a general data management middleware for blockchain, which has the following characteristics: ① Support for custom construction of data models and the flexibility to add new abstractions to transaction data; ② Provide multiple data access interfaces to support rich queries and use optimization methods such as synchronous caching mechanisms to improve query efficiency; ③ Design advance hash calculation and asynchronous batch processing strategies to optimize transaction latency and throughput. We integrated the proposed data management middleware with the open source blockchain CITA and verified its ease of use and efficiency through experiments.

    Figures and Tables | References | Related Articles | Metrics
    A fuzzer for query processing functionality of OLAP databases
    Zhaokun XIANG, Ting CHEN, Qian SU, Rong ZHANG
    2021, 2021 (5):  74-83.  doi: 10.3969/j.issn.1000-5641.2021.05.007
    Abstract ( 309 )   HTML ( 57 )   PDF (831KB) ( 132 )   Save

    Query processing, including optimization and execution, is one of the most critical functionalities of modern relational database management systems (DBMS). The complexity of query processing functionalities, however, leads to high testing costs. It hinders rapid iterations during the development process and can lead to severe errors when deployed in production environments. In this paper, we propose a tool to better serve the testing and evaluation of DBMS query processing functionalities; the tool uses a fuzzing approach to generate random data that is highly associated with primary keys and generates valid complex analytical queries. The tool constructs constrained optimization problems to efficiently compute the exact cardinalities of operators in queries and furnish the results. We launched small-scale testing of our method on different versions of TiDB and demonstrated that the tool can effectively detect bugs in different versions of TiDB.

    Figures and Tables | References | Related Articles | Metrics
    Partition-based concurrency control in a multi-master database
    Wenxin LIU, Peng CAI
    2021, 2021 (5):  84-93.  doi: 10.3969/j.issn.1000-5641.2021.05.008
    Abstract ( 138 )   HTML ( 41 )   PDF (1299KB) ( 76 )   Save

    In the era of big data, the single-write multi-read process with separate storage and computing architectures can no longer meet the demands for efficient reading and writing of massive datasets. Multiple computing nodes providing write services concurrently can also cause cache inconsistencies. Some studies have proposed a global ordered transaction log to detect conflicts and maintain data consistency for the whole system using broadcast and playback of the transaction log. However, this scheme has poor scalability because it maintains the global write log at each write node. To solve this problem, this paper proposes a partition-based concurrency control scheme, which reduces the transaction log maintained by each write node by partitioning, and effectively improves the system’s overall expansion ability.

    Figures and Tables | References | Related Articles | Metrics
    Query optimization technology based on an LSM-tree
    Jiabo SUN, Peng CAI
    2021, 2021 (5):  94-103.  doi: 10.3969/j.issn.1000-5641.2021.05.009
    Abstract ( 208 )   HTML ( 46 )   PDF (746KB) ( 146 )   Save

    Given challenges with poor query performance for databases using LSM-trees, the present research explores the use of index and cache technologies to improve the query performance of LSM-trees. First, the paper introduces the basic structure of an LSM-tree and analyzes the factors that affect query performance. Second, we analyze current query optimization technologies for LSM-trees, including index optimization technology and cache optimization technology. Third, we analyze how index and cache, in particular, can improve the query performance of databases using LSM-trees and summarize existing research in this area. Finally, we present possible avenues for further research.

    Figures and Tables | References | Related Articles | Metrics
    Data Analysis and Applications
    Electricity theft detection based on t-LeNet and time series classification
    Xiaoqin MA, Xiaohui XUE, Hongjiao LUO, Tongyu LIU, Peisen YUAN
    2021, 2021 (5):  104-114.  doi: 10.3969/j.issn.1000-5641.2021.05.010
    Abstract ( 237 )   HTML ( 44 )   PDF (855KB) ( 83 )   Save

    Electricity theft results in significant losses in both electric energy and economic benefits for electric power enterprises. This paper proposes a method to detect electricity theft based on t-LeNet and time series classification. First, a user’s power consumption time series data is obtained, and down-sampling is used to generate a training set. A t-LeNet neural network can then be used to train and predict classification results for determining whether the user exhibits behavior reflective of electricity theft. Lastly, real user power consumption data from the state grid can be used to conduct experiments. The results show that compared with the time series classification method based on Time-CNN (Time Convolutional Neural Network) and MLP (Muti-Layer Perception), the proposed method offers improvements in the comprehensive evaluation index, accuracy rate, and recall rate index. Hence, the proposed method can successfully detect electricity theft.

    Figures and Tables | References | Related Articles | Metrics
    Survey of early time series classification methods
    Mengchen YANG, Xudong CHEN, Peng CAI, Lyu NI
    2021, 2021 (5):  115-133.  doi: 10.3969/j.issn.1000-5641.2021.05.011
    Abstract ( 425 )   HTML ( 397 )   PDF (1503KB) ( 237 )   Save

    With the increasing popularity of sensors, time-series data have attracted significant attention. Early time series classification (ETSC) aims to classify time-series data with the highest level of accuracy and smallest possible size. ETSC, in particular, plays a critical role in fintech. First, this paper summarizes the common classifiers for time-series data and reviews the current research progress on minimum prediction length-based, shapelet-based, and model-based ETSC frameworks. There are pivotal technologies, advantages, and disadvantages of the representative ETSC methods in separate frameworks. Next, we review public time-series datasets in fintech and commonly used performance evaluation criteria. Lastly, we explore future research directions pertinent to ETSC.

    Figures and Tables | References | Related Articles | Metrics
    YOLO-S: A new lightweight helmet wearing detection model
    Hongcheng ZHAO, Xiuxia TIAN, Zesen YANG, Wanrong BAI
    2021, 2021 (5):  134-145.  doi: 10.3969/j.issn.1000-5641.2021.05.012
    Abstract ( 290 )   HTML ( 53 )   PDF (1182KB) ( 202 )   Save

    Traditional worker helmet wearing detection models commonly used at construction sites suffer from long processing times and high hardware requirements; the limited number of available training data sets for complex and changing environments, moreover, contributes to poor model robustness. In this paper, we propose a lightweight helmet wearing detection model—named YOLO-S—to address these challenges. First, for the case of unbalanced data set categories, a hybrid scene data augmentation method is used to balance the categories and improve the robustness of the model for complex construction environments; the original YOLOv5s backbone network is changed to MobileNetV2, which reduces the network computational complexity. Second, the model is compressed, and a scaling factor is introduced in the BN layer for sparse training. The importance of each channel is judged, redundant channels are pruned, and the volume of model inference calculations is further reduced; these changes help increase the overall model detection speed. Finally, YOLO-S is achieved by fine-tuning the auxiliary model for knowledge distillation. The experimental results show that the recall rate of YOLO-S is increased by 1.9% compared with YOLOv5s, the mAP of YOLO-S is increased by 1.4% compared with YOLOv5s, the model parameter is compressed to 1/3 of YOLOv5s, the model volume is compressed to 1/4 of YOLOv5s, FLOPs are compressed to 1/3 of YOLOv5s, the reasoning speed is faster than other models, and the portability is higher.

    Figures and Tables | References | Related Articles | Metrics
    Adaptive competitive equilibrium optimizer for power system customer classification
    Sida ZHENG, Yan LIU, Xiaokun YANG, Chengfei QI, Peisen YUAN
    2021, 2021 (5):  146-156.  doi: 10.3969/j.issn.1000-5641.2021.05.013
    Abstract ( 121 )   HTML ( 29 )   PDF (914KB) ( 75 )   Save

    Accurate classification of power system customers can enable differentiated management and personalized services for customers. In order to address the challenges associated with accurate customer classification, this paper proposes a classification method based on an equilibrium optimizer and an extreme learning machine. In this method, an adaptive competition mechanism is proposed to balance the global exploration and local mining ability of an equilibrium optimizer, improving the performance of algorithms in finding optimal solutions. Thereafter, the proposed equilibrium optimizer is integrated with an extreme learning machine to classify the customers of a power system. Experiments on real data sets showed that the proposed algorithm integrated with an extreme learning machine offers more accurate performance for different classification indexes; hence, the proposed method can provide an effective technical means for power system customer management and service.

    Figures and Tables | References | Related Articles | Metrics
    Query processing of large-scale product knowledge in a CPU-GPU heterogeneous environment
    Chuangxin FANG, Hao SONG, Yuming LIN, Ya ZHOU
    2021, 2021 (5):  157-168.  doi: 10.3969/j.issn.1000-5641.2021.05.014
    Abstract ( 155 )   HTML ( 28 )   PDF (943KB) ( 68 )   Save

    Knowledge graphs are an effective way to structurally represent and organize unstructured knowledgeare; in fact, these graphs are commonly used to support many intelligent applications. However, product-related knowledge is typically massive in scale, heterogeneous, and hierarchical; these characteristics present a challenge for traditional knowledge query processing methods based on relational and graph models. In this paper, we propose a solution to address these challenges by designing and implementing a product knowledge query processing method using CPU and GPU collaborative computing. Firstly, in order to leverage the full parallel computing capability of GPU, a product knowledge storage strategy based on a sparse matrix is proposed and optimized for the scale of the task. Secondly, based on the storage structure of the sparse matrix, a query conversion method is designed, which transforms the SPARQL query into a corresponding matrix calculation, and extends the join query algorithm to the GPU for acceleration. In order to verify the effectiveness of the proposed method, we conducted a series of experiments on an LUBM dataset and a semisynthetic dataset of products. The experimental results showed that the proposed method not only improves retrieval efficiency for large-scale product knowledge datasets compared with existing RDF query engines, but also achieves better retrieval performance on a general RDF standard dataset.

    Figures and Tables | References | Related Articles | Metrics
    Power marketing big data access control scheme based on a multi-level strategy
    Yue ZHANG, Xiuxia TIAN, Yuncheng YAN, Guanyu LU
    2021, 2021 (5):  169-184.  doi: 10.3969/j.issn.1000-5641.2021.05.015
    Abstract ( 131 )   HTML ( 26 )   PDF (1720KB) ( 74 )   Save

    With the rapid proliferation of technology, the degree of informatization in the financial industry continues to increase. The integration of financial data with power marketing platforms, moreover, is accelerating the interaction between users and power marketing platform data (e.g., basic customer details, energy metering data, electricity fee recovery data). The increased interaction, however, leads to higher data transmission leakage which can result in incorrect formulation of power usage strategies and electricity prices. Therefore, to satisfy the security requirements for data interaction in power marketing systems and ensure economic benefits for the power company, we propose an Ordered Binary Decision Diagram (OBDD) based on Ciphertext Policy Attribute Based Encryption (CP-ABE). This multi-level access approach can reduce the autonomy of shared data authority control in the remote terminal unit and improve the efficiency of data access. In addition, based on security and performance analysis, the proposed access control scheme is both more efficient and more secure than other schemes.

    Figures and Tables | References | Related Articles | Metrics
    Research on multi-objective cargo allocation based on an improved genetic algorithm
    Ping YU, Huiqi HU, Weining QIAN
    2021, 2021 (5):  185-198.  doi: 10.3969/j.issn.1000-5641.2021.05.016
    Abstract ( 215 )   HTML ( 46 )   PDF (1063KB) ( 82 )   Save

    In this paper, we propose a mathematical model to solve the multi-objective cargo allocation problem with greater stability and efficiency; the model for cargo allocation maximizes the total cargo weight, minimizes the total number of trips, minimizes the number of cargo loading and unloading points, and offers fast convergence based on the elitism genetic algorithm (FEGA). First, a hierarchical structure with the Pareto dominance relation and an elitism retention strategy were added on the basis of the genetic algorithm. This helped to improve the population diversity while accelerating the local search ability of the algorithm. Then, the random structure of the initial population was modified, and a double population strategy was designed. An adaptive operation was subsequently added to sequentially improve the global search ability of the algorithm and accelerate the convergence speed of the population. Based on the new algorithm, real cargo data were used to demonstrate the feasibility and optimization potential of the new method. The results show that compared with the traditional genetic algorithm, the proposed algorithm has a better optimization effect in solving the cargo allocation process with strong constraints and a large search space; the search performance and convergence, moreover, are also improved.

    Figures and Tables | References | Related Articles | Metrics