Journal of East China Normal University(Natural Science)

Research progress in Chinese named entity recognition in the financial field

Qiurong XU, Peng ZHU, Yifeng LUO, Qiwen DONG

2021, 2021 (5): 1-13. doi: 10.3969/j.issn.1000-5641.2021.05.001

Abstract ( 2055 )

HTML ( 675 )

PDF (821KB) ( 3834 )

Save

As one of the basic components of natural language processing, named entity recognition (NER) has been an active area of research both domestically in China and abroad. With the rapid development of financial applications, Chinese NER has improved over time and been applied successfully throughout the financial industry. This paper provides a summary of the current state of research and future development trends for Chinese NER methods in the financial field. Firstly, the paper introduces concepts related to NER and the characteristics of Chinese NER in the financial field. Then, based on the development process, the paper provides an overview of detailed characteristics and typical models for dictionary and rule-based methods, statistical machine learning-based methods, and deep learning-based methods. Next, the paper summarizes public data collection tools, evaluation methods, and applications of Chinese NER in the financial industry. Finally, the paper explores current challenges and future development trends.

Figures and Tables | References | Related Articles | Metrics

Data augmentation technology for named entity recognition

Xiaoqin MA, Xiaohe GUO, Yufeng XUE, Lin YANG, Yuanzhe CHEN

2021, 2021 (5): 14-23. doi: 10.3969/j.issn.1000-5641.2021.05.002

Abstract ( 1274 )

HTML ( 434 )

PDF (689KB) ( 557 )

Save

A named entity recognition task is as a task that involves extracting instances of a named entity from continuous natural language text. Named entity recognition plays an important role in information extraction and is closely related to other information extraction tasks. In recent years, deep learning methods have been widely used in named entity recognition tasks; the methods, in fact, have achieved a good performance level. The most common named entity recognition models use sequence tagging, which relies on the availability of a high quality annotation corpus. However, the annotation cost of sequence data is high; this leads to the use of small training sets and, in turn, seriously limits the final performance of named entity recognition models. To enlarge the size of training sets for named entity recognition without increasing the associated labor cost, this paper proposes a data augmentation method for named entity recognition based on EDA, distant supervision, and bootstrap. Using experiments on the FIND-2019 dataset, this paper illustrates that the proposed data augmentation techniques and combinations thereof can significantly improve the overall performance of named entity recognition models.

Figures and Tables | References | Related Articles | Metrics

Joint extraction of entities and relations for domain knowledge graph

Rui FU, Jianyu LI, Jiahui WANG, Kun YUE, Kuang HU

2021, 2021 (5): 24-36. doi: 10.3969/j.issn.1000-5641.2021.05.003

Abstract ( 1505 )

HTML ( 77 )

PDF (842KB) ( 1146 )

Save

Extraction of entities and relationships from text data is used to construct and update domain knowledge graphs. In this paper, we propose a method to jointly extract entities and relations by incorporating the concept of active learning; the proposed method addresses problems related to the overlap of vertical domain data and the lack of labeled samples in financial technology domain text data using the traditional approach. First, we select informative samples incrementally as training data sets. Next, we transform the exercise of joint extraction of entities and relations into a sequence labeling problem by labelling the main entities. Finally, we fulfill the joint extraction using the improved BERT-BiGRU-CRF model for construction of a knowledge graph, and thus facilitate financial analysis, investment, and transaction operations based on domain knowledge, thereby reducing investment risks. Experimental results with finance text data shows the effectiveness of our proposed method and verifies that the method can be successfully used to construct financial knowledge graphs.

Figures and Tables | References | Related Articles | Metrics

Optimization of LSM-tree storage systems based on non-volatile memory

Yang YU, Huiqi HU, Xuan ZHOU

2021, 2021 (5): 37-47. doi: 10.3969/j.issn.1000-5641.2021.05.004

Abstract ( 984 )

HTML ( 67 )

PDF (1368KB) ( 711 )

Save

With the advent of the big data era, the financial industry has been generating increasing volumn of data, exerting pressure on database systems. LevelDB is a key-value database, developed by Google, based on the LSM-tree architecture. It offers fast writing and a small footprint, and is widely used in the financial industry. In this paper, we propose a design method for the L₀layer, based on non-volatile memory and machine learning, with the aim of addressing the shortcomings of the LSM-tree architecture, including write pause, write amplification, and unfriendly reading. The proposed solution can slow down or even solve the aforementioned problems; the experimental results demonstrate that the design can achieve better read and write performance.

Figures and Tables | References | Related Articles | Metrics

Erasure code partition storage based on the CITA blockchain

Furong YIN, Chengyu ZHU, Bin ZHAO, Zhao ZHANG

2021, 2021 (5): 48-59. doi: 10.3969/j.issn.1000-5641.2021.05.005

Abstract ( 873 )

HTML ( 71 )

PDF (1572KB) ( 624 )

Save

Blockchain system adopts full replication data storage mechanism, which retains a complete copy of the whole block chain for each node. The scalability of the system is poor. Due to the existence of Byzantine nodes in the blockchain system, the shard scheme used in the traditional distributed system cannot be directly applied in the blockchain system. In this paper, the storage consumption of each block is reduced from O(n) to O(1) by combining erasure code and Byzantine fault-tolerant algorithm, and the scalability of the system is enhanced. This paper proposes a method to partition block data, which can reduce the storage redundancy and affect the query efficiency less. A coding block storage method without network communication is proposed to reduce the system storage and communication overhead. In addition, a dynamic recoding method for entry and exit of blockchain nodes is proposed, which not only ensures the reliability of the system, but also reduces the system recoding overhead. Finally, the system is implemented on the open source blockchain system CITA, and through sufficient experiments, it is proved that the system has improved scalability, availability and storage efficiency.

Figures and Tables | References | Related Articles | Metrics

Blockchain-oriented data management middleware

Sijia DENG, Xing TONG, Haibo TANG, Zhao ZHANG, Cheqing JIN

2021, 2021 (5): 60-73. doi: 10.3969/j.issn.1000-5641.2021.05.006

Abstract ( 743 )

HTML ( 52 )

PDF (1224KB) ( 408 )

Save

As a decentralized distributed ledger, blockchain technology is widely used to share data between untrusted parties. Compared with traditional databases that have been refined over many years, blockchains cannot support rich queries, are limited to single query interfaces, and suffer from slow response. Simple organizational structures and discrete storage limits are the main barriers that limit the expression of transaction data. In order to make up for the shortcomings of existing blockchain technology and achieve efficient application development, users can build abstract models, encapsulate easy-to-use interfaces, and improve the efficiency of queries. We also propose a general data management middleware for blockchain, which has the following characteristics: ① Support for custom construction of data models and the flexibility to add new abstractions to transaction data; ② Provide multiple data access interfaces to support rich queries and use optimization methods such as synchronous caching mechanisms to improve query efficiency; ③ Design advance hash calculation and asynchronous batch processing strategies to optimize transaction latency and throughput. We integrated the proposed data management middleware with the open source blockchain CITA and verified its ease of use and efficiency through experiments.

Figures and Tables | References | Related Articles | Metrics

A fuzzer for query processing functionality of OLAP databases

Zhaokun XIANG, Ting CHEN, Qian SU, Rong ZHANG

2021, 2021 (5): 74-83. doi: 10.3969/j.issn.1000-5641.2021.05.007

Abstract ( 1081 )

HTML ( 67 )

PDF (831KB) ( 381 )

Save

Query processing, including optimization and execution, is one of the most critical functionalities of modern relational database management systems (DBMS). The complexity of query processing functionalities, however, leads to high testing costs. It hinders rapid iterations during the development process and can lead to severe errors when deployed in production environments. In this paper, we propose a tool to better serve the testing and evaluation of DBMS query processing functionalities; the tool uses a fuzzing approach to generate random data that is highly associated with primary keys and generates valid complex analytical queries. The tool constructs constrained optimization problems to efficiently compute the exact cardinalities of operators in queries and furnish the results. We launched small-scale testing of our method on different versions of TiDB and demonstrated that the tool can effectively detect bugs in different versions of TiDB.

Figures and Tables | References | Related Articles | Metrics

Partition-based concurrency control in a multi-master database

Wenxin LIU, Peng CAI

2021, 2021 (5): 84-93. doi: 10.3969/j.issn.1000-5641.2021.05.008

Abstract ( 522 )

HTML ( 46 )

PDF (1299KB) ( 392 )

Save

In the era of big data, the single-write multi-read process with separate storage and computing architectures can no longer meet the demands for efficient reading and writing of massive datasets. Multiple computing nodes providing write services concurrently can also cause cache inconsistencies. Some studies have proposed a global ordered transaction log to detect conflicts and maintain data consistency for the whole system using broadcast and playback of the transaction log. However, this scheme has poor scalability because it maintains the global write log at each write node. To solve this problem, this paper proposes a partition-based concurrency control scheme, which reduces the transaction log maintained by each write node by partitioning, and effectively improves the system’s overall expansion ability.

Figures and Tables | References | Related Articles | Metrics

Query optimization technology based on an LSM-tree

Jiabo SUN, Peng CAI

2021, 2021 (5): 94-103. doi: 10.3969/j.issn.1000-5641.2021.05.009

Abstract ( 878 )

HTML ( 58 )

PDF (746KB) ( 564 )

Save

Given challenges with poor query performance for databases using LSM-trees, the present research explores the use of index and cache technologies to improve the query performance of LSM-trees. First, the paper introduces the basic structure of an LSM-tree and analyzes the factors that affect query performance. Second, we analyze current query optimization technologies for LSM-trees, including index optimization technology and cache optimization technology. Third, we analyze how index and cache, in particular, can improve the query performance of databases using LSM-trees and summarize existing research in this area. Finally, we present possible avenues for further research.

Figures and Tables | References | Related Articles | Metrics

Electricity theft detection based on t-LeNet and time series classification

Xiaoqin MA, Xiaohui XUE, Hongjiao LUO, Tongyu LIU, Peisen YUAN

2021, 2021 (5): 104-114. doi: 10.3969/j.issn.1000-5641.2021.05.010

Abstract ( 1083 )

HTML ( 79 )

PDF (855KB) ( 581 )

Save

Electricity theft results in significant losses in both electric energy and economic benefits for electric power enterprises. This paper proposes a method to detect electricity theft based on t-LeNet and time series classification. First, a user’s power consumption time series data is obtained, and down-sampling is used to generate a training set. A t-LeNet neural network can then be used to train and predict classification results for determining whether the user exhibits behavior reflective of electricity theft. Lastly, real user power consumption data from the state grid can be used to conduct experiments. The results show that compared with the time series classification method based on Time-CNN (Time Convolutional Neural Network) and MLP (Muti-Layer Perception), the proposed method offers improvements in the comprehensive evaluation index, accuracy rate, and recall rate index. Hence, the proposed method can successfully detect electricity theft.

Figures and Tables | References | Related Articles | Metrics

Survey of early time series classification methods

Mengchen YANG, Xudong CHEN, Peng CAI, Lyu NI

2021, 2021 (5): 115-133. doi: 10.3969/j.issn.1000-5641.2021.05.011

Abstract ( 1764 )

HTML ( 445 )

PDF (1503KB) ( 1773 )

Save

With the increasing popularity of sensors, time-series data have attracted significant attention. Early time series classification (ETSC) aims to classify time-series data with the highest level of accuracy and smallest possible size. ETSC, in particular, plays a critical role in fintech. First, this paper summarizes the common classifiers for time-series data and reviews the current research progress on minimum prediction length-based, shapelet-based, and model-based ETSC frameworks. There are pivotal technologies, advantages, and disadvantages of the representative ETSC methods in separate frameworks. Next, we review public time-series datasets in fintech and commonly used performance evaluation criteria. Lastly, we explore future research directions pertinent to ETSC.

Figures and Tables | References | Related Articles | Metrics

YOLO-S: A new lightweight helmet wearing detection model

Hongcheng ZHAO, Xiuxia TIAN, Zesen YANG, Wanrong BAI

2021, 2021 (5): 134-145. doi: 10.3969/j.issn.1000-5641.2021.05.012

Abstract ( 1086 )

HTML ( 69 )

PDF (1182KB) ( 968 )

Save

Traditional worker helmet wearing detection models commonly used at construction sites suffer from long processing times and high hardware requirements; the limited number of available training data sets for complex and changing environments, moreover, contributes to poor model robustness. In this paper, we propose a lightweight helmet wearing detection model—named YOLO-S—to address these challenges. First, for the case of unbalanced data set categories, a hybrid scene data augmentation method is used to balance the categories and improve the robustness of the model for complex construction environments; the original YOLOv5s backbone network is changed to MobileNetV2, which reduces the network computational complexity. Second, the model is compressed, and a scaling factor is introduced in the BN layer for sparse training. The importance of each channel is judged, redundant channels are pruned, and the volume of model inference calculations is further reduced; these changes help increase the overall model detection speed. Finally, YOLO-S is achieved by fine-tuning the auxiliary model for knowledge distillation. The experimental results show that the recall rate of YOLO-S is increased by 1.9% compared with YOLOv5s, the mAP of YOLO-S is increased by 1.4% compared with YOLOv5s, the model parameter is compressed to 1/3 of YOLOv5s, the model volume is compressed to 1/4 of YOLOv5s, FLOPs are compressed to 1/3 of YOLOv5s, the reasoning speed is faster than other models, and the portability is higher.

Figures and Tables | References | Related Articles | Metrics

Adaptive competitive equilibrium optimizer for power system customer classification

Sida ZHENG, Yan LIU, Xiaokun YANG, Chengfei QI, Peisen YUAN

2021, 2021 (5): 146-156. doi: 10.3969/j.issn.1000-5641.2021.05.013

Abstract ( 535 )

HTML ( 31 )

PDF (914KB) ( 219 )

Save

Accurate classification of power system customers can enable differentiated management and personalized services for customers. In order to address the challenges associated with accurate customer classification, this paper proposes a classification method based on an equilibrium optimizer and an extreme learning machine. In this method, an adaptive competition mechanism is proposed to balance the global exploration and local mining ability of an equilibrium optimizer, improving the performance of algorithms in finding optimal solutions. Thereafter, the proposed equilibrium optimizer is integrated with an extreme learning machine to classify the customers of a power system. Experiments on real data sets showed that the proposed algorithm integrated with an extreme learning machine offers more accurate performance for different classification indexes; hence, the proposed method can provide an effective technical means for power system customer management and service.

Figures and Tables | References | Related Articles | Metrics

Query processing of large-scale product knowledge in a CPU-GPU heterogeneous environment

Chuangxin FANG, Hao SONG, Yuming LIN, Ya ZHOU

2021, 2021 (5): 157-168. doi: 10.3969/j.issn.1000-5641.2021.05.014

Abstract ( 719 )

HTML ( 32 )

PDF (943KB) ( 254 )

Save

Knowledge graphs are an effective way to structurally represent and organize unstructured knowledgeare; in fact, these graphs are commonly used to support many intelligent applications. However, product-related knowledge is typically massive in scale, heterogeneous, and hierarchical; these characteristics present a challenge for traditional knowledge query processing methods based on relational and graph models. In this paper, we propose a solution to address these challenges by designing and implementing a product knowledge query processing method using CPU and GPU collaborative computing. Firstly, in order to leverage the full parallel computing capability of GPU, a product knowledge storage strategy based on a sparse matrix is proposed and optimized for the scale of the task. Secondly, based on the storage structure of the sparse matrix, a query conversion method is designed, which transforms the SPARQL query into a corresponding matrix calculation, and extends the join query algorithm to the GPU for acceleration. In order to verify the effectiveness of the proposed method, we conducted a series of experiments on an LUBM dataset and a semisynthetic dataset of products. The experimental results showed that the proposed method not only improves retrieval efficiency for large-scale product knowledge datasets compared with existing RDF query engines, but also achieves better retrieval performance on a general RDF standard dataset.

Figures and Tables | References | Related Articles | Metrics

Power marketing big data access control scheme based on a multi-level strategy

Yue ZHANG, Xiuxia TIAN, Yuncheng YAN, Guanyu LU

2021, 2021 (5): 169-184. doi: 10.3969/j.issn.1000-5641.2021.05.015

Abstract ( 509 )

HTML ( 30 )

PDF (1720KB) ( 381 )

Save

With the rapid proliferation of technology, the degree of informatization in the financial industry continues to increase. The integration of financial data with power marketing platforms, moreover, is accelerating the interaction between users and power marketing platform data (e.g., basic customer details, energy metering data, electricity fee recovery data). The increased interaction, however, leads to higher data transmission leakage which can result in incorrect formulation of power usage strategies and electricity prices. Therefore, to satisfy the security requirements for data interaction in power marketing systems and ensure economic benefits for the power company, we propose an Ordered Binary Decision Diagram (OBDD) based on Ciphertext Policy Attribute Based Encryption (CP-ABE). This multi-level access approach can reduce the autonomy of shared data authority control in the remote terminal unit and improve the efficiency of data access. In addition, based on security and performance analysis, the proposed access control scheme is both more efficient and more secure than other schemes.

Figures and Tables | References | Related Articles | Metrics

Research on multi-objective cargo allocation based on an improved genetic algorithm

Ping YU, Huiqi HU, Weining QIAN

2021, 2021 (5): 185-198. doi: 10.3969/j.issn.1000-5641.2021.05.016

Abstract ( 968 )

HTML ( 56 )

PDF (1063KB) ( 713 )

Save

In this paper, we propose a mathematical model to solve the multi-objective cargo allocation problem with greater stability and efficiency; the model for cargo allocation maximizes the total cargo weight, minimizes the total number of trips, minimizes the number of cargo loading and unloading points, and offers fast convergence based on the elitism genetic algorithm (FEGA). First, a hierarchical structure with the Pareto dominance relation and an elitism retention strategy were added on the basis of the genetic algorithm. This helped to improve the population diversity while accelerating the local search ability of the algorithm. Then, the random structure of the initial population was modified, and a double population strategy was designed. An adaptive operation was subsequently added to sequentially improve the global search ability of the algorithm and accelerate the convergence speed of the population. Based on the new algorithm, real cargo data were used to demonstrate the feasibility and optimization potential of the new method. The results show that compared with the traditional genetic algorithm, the proposed algorithm has a better optimization effect in solving the cargo allocation process with strong constraints and a large search space; the search performance and convergence, moreover, are also improved.

Figures and Tables | References | Related Articles | Metrics

Table of Content