Table of Content

    25 September 2019, Volume 2019 Issue 5 Previous Issue    Next Issue
    For Selected: Toggle Thumbnails
    Data-driven Computational Education
    A review of knowledge tracking
    LIU Heng-yu, ZHANG Tian-cheng, WU Pei-wen, YU Ge
    2019, 2019 (5):  1-15.  doi: 10.3969/j.issn.1000-5641.2019.05.001
    Abstract ( 1108 )   HTML ( 405 )   PDF (1936KB) ( 1409 )   Save
    In the field of education, scientifically and purposefully tracking the progression of student knowledge is a topic of great significance. With a student's historical learning trajectory and a model for the interaction process between students and exercises, knowledge tracking can automatically track the progression of a student's learning at each stage. This provides a technical basis for predicting student performance and achieving personalized guidance and adaptive learning. This paper first introduces the background of knowledge tracking and summarizes the pedagogy and data mining theory involved in knowledge tracking. Then, the paper summarizes the research status of knowledge tracking based on probability graphs, matrix factorization, and deep learning; we use these tools to classify the tracking methods according to different characteristics. Finally, the paper analyzes and compares the latest knowledge tracking technologies, and looks ahead to the future direction of ongoing research.
    References | Related Articles | Metrics
    A survey on coreference resolution
    CHEN Yuan-zhe, KUANG Jun, LIU Ting-ting, GAO Ming, ZHOU Ao-ying
    2019, 2019 (5):  16-35.  doi: 10.3969/j.issn.1000-5641.2019.05.002
    Abstract ( 647 )   HTML ( 256 )   PDF (671KB) ( 690 )   Save
    Coreference resolution is the task of finding all expressions that point to the same entity in a text; this technique is widely used for text summarization, machine translation, question answering systems, and knowledge graphs. As a classic problem in natural language processing, it is considered NP-Hard. This paper first introduces the basic concepts of coreference resolution, analyzes some confusing concepts related thereto, and discusses the research significance and difficulties of the technique. Then, we summarize research advances in coreference resolution, divide them into stages from a technical standpoint, introduce the representative approaches for each stage, and discuss the advantages and disadvantages of various methods. The summarized approaches are five-fold:rule-based, machine learning, global optimization, knowledge base, and deep learning. Next, we introduce benchmark conferences for the problem of coreference resolution; in this context, we explain and compare their corpus and common evaluation metrics. Finally, this paper highlights the open problems for coreference resolution, and discusses trends and directions of future research.
    References | Related Articles | Metrics
    A review of machine reading comprehension for automatic QA
    YANG Kang, HANG Ding-jiang, GAO Ming
    2019, 2019 (5):  36-52.  doi: 10.3969/j.issn.1000-5641.2019.05.003
    Abstract ( 548 )   HTML ( 251 )   PDF (1811KB) ( 491 )   Save
    Artificial Intelligence (AI) is affecting every industry. Applying AI to education accelerates the structural reform of education and transforms traditional education into intelligent adaptive education. The automatic Question Answer system, based on deep learning, not only helps students to answer questions and acquire knowledge in real-time, but can also quickly gather student behavioral data and accelerate personalization of the educational process. Machine reading comprehension is the core module of an automatic Question Answer system, and it is an important technology to understand student problems, document content, and acquire knowledge quickly. With the revival of deep learning and the availability of large-scale reading comprehension datasets, a number of neural network-based machine reading models have been proposed over the past few years. The purpose of this review is three-fold:to introduce and review progress in machine reading comprehension; to compare and analyze the advantages and disadvantages between various neural machine reading models; and to summarize the relevant datasets and evaluation methods in the field of machine reading.
    References | Related Articles | Metrics
    Research on knowledge point relationship extraction for elementary mathematics
    YANG Dong-ming, YANG Da-wei, GU Hang, HONG Dao-cheng, GAO Ming, WANG Ye
    2019, 2019 (5):  53-65.  doi: 10.3969/j.issn.1000-5641.2019.05.004
    Abstract ( 492 )   HTML ( 202 )   PDF (897KB) ( 381 )   Save
    With the development of Internet technology, online education has changed the learning style of students. However, given the lack of a complete knowledge system, online education has a low degree of intelligence and a/knowledge trek0problem. The relation-extraction concept is one of the key elements of knowledge system construction. Therefore, building knowledge systems has become the core technology of online education platforms. At present, the more efficient relationship extraction algorithms are usually supervised. However, such methods suffer from low text quality, scarcity of corpus, difficulty in labeling data, low efficiency of feature engineering, and difficulty in extracting directional relationships. Therefore, this paper studies the relation-extraction algorithm between concepts based on an encyclopedic corpus and distant supervision methods. An attention mechanism based on relational representation is proposed, which can extract the forward relationship information between knowledge points. Combining the advantages of GCN and LSTM, GCLSTM is proposed, which better extracts multipoint information in sentences. Based on the attention mechanism of Transform architecture and relational representation, a BTRE model suitable for the extraction of directional relationships is proposed, which reduces the complexity of the model. Hence, a knowledge point relationship extraction system is designed and implemented. The performance and efficiency of the model are verified by designing three sets of comparative experiments.
    References | Related Articles | Metrics
    Performance prediction based on fuzzy clustering and support vector regression
    SHEN Hang-jie, JU Sheng-gen, SUN Jie-ping
    2019, 2019 (5):  66-73,84.  doi: 10.3969/j.issn.1000-5641.2019.05.005
    Abstract ( 410 )   HTML ( 202 )   PDF (603KB) ( 355 )   Save
    Existing performance prediction models tend to overuse different types of attributes, leading to either overly complex prediction methods or models that require manual participation. To improve the accuracy and interpretation of student performance prediction, a method based on fuzzy clustering and support vector regression is proposed. Firstly, fuzzy logic is introduced to calculate the membership matrix, and students are clustered according to their past performance. Then, we use Support Vector Regression (SVR) theory to fit and model performance trajectory for each cluster. Lastly, the final prediction results are adjusted in combination with the students' learning behavior and other related attributes. Experimental results on several baseline datasets demonstrate the validity of the proposed approach.
    References | Related Articles | Metrics
    Computational Intelligence in Emergent Applications
    Transfer learning based QA model of FAQ using CQA data
    SHAO Ming-rui, MA Deng-hao, CHEN Yue-guo, QIN Xiong-pai, DU Xiao-yong
    2019, 2019 (5):  74-84.  doi: 10.3969/j.issn.1000-5641.2019.05.006
    Abstract ( 429 )   HTML ( 251 )   PDF (1735KB) ( 366 )   Save
    Building an intelligent customer service system based on FAQ (frequent asked questions) is a technique commonly used in industry. Question answering systems based on FAQ offer numerous advantages including stability, reliability, and quality. However, given the practical limitations of scaling a manually annotated knowledge base, models often have limited recognition ability and can easily encounter bottlenecks. In order to address the problem of limited scale with FAQ datasets, this paper offers a solution at both the data level and the model level. At the data level, we use Baidu Knows to crawl relevant data and mine semantically equivalent questions, ensuring the relevance and consistency of the data. At the model level, we propose a deep neural network with transAT oriented transfer learning, which combines a transformer network and an attention network, and is suitable for semantic similarity calculations between sentence pairs. Experiments show that the proposed solution can significantly improve the impact of the model on FAQ datasets and to a certain extent resolve the issues with the limited scale of FAQ datasets.
    References | Related Articles | Metrics
    Research on artificial intelligence assisted decision-making algorithms for lawyers based on legal-computing theory
    CHEN Liang, GUO Jia-wen, WU Jian-gong, WANG Zhan-quan, SHI Ling
    2019, 2019 (5):  85-99.  doi: 10.3969/j.issn.1000-5641.2019.05.007
    Abstract ( 451 )   HTML ( 196 )   PDF (698KB) ( 292 )   Save
    At present, there is a lack of intelligent decision-making tools applied to legal theory and practice. Given the characteristics of data in this field, we establish an intelligent decision-making algorithm using a variety of data analysis models. Legal-computing is focused on data-based mechanization of legal reasoning. It establishes a relationship between legal research and applications using the characteristics and data features of computer science. On this basis, the method of "implication classification" is formed, the decision tree and Naive Bayes algorithms are improved for application to the legal arena, and a coordinate system of legal relationships is established to transfer traditional legal relationship analysis into a spatial geometric system. Experimental results show that the algorithm is consistent with a lawyer's handling strategy and results, and has the feasibility of assisting lawyers more broadly in decision-making.
    References | Related Articles | Metrics
    Optimal route search based on user preferences
    JIANG Qun, DAI Ge-nan, ZHANG Sen, GE You-ming, LIU Yu-bao
    2019, 2019 (5):  100-112.  doi: 10.3969/j.issn.1000-5641.2019.05.008
    Abstract ( 464 )   HTML ( 242 )   PDF (1712KB) ( 362 )   Save
    This paper studies the methodology for optimal route search based on user preferences, such as keyword and weight preferences, under a constraint. The research problem is NP-hard. To solve the query efficiently, we propose two new index building methods and select candidate nodes for retrieving the established indices. This paper subsequently proposes an A* based route search algorithm to identify the optimal route and use several effective pruning strategies to speed up execution. Experimental results on two real-world check-in datasets demonstrates the effectiveness of the proposed method. When the budget ranges from 4 hours to 7 hours, our algorithm performs better than the state-of-the-art PACER algorithm.
    References | Related Articles | Metrics
    Self-attention based neural networks for product titles compression
    FU Yu, LI You, LIN Yu-ming, ZHOU Ya
    2019, 2019 (5):  113-122,167.  doi: 10.3969/j.issn.1000-5641.2019.05.009
    Abstract ( 467 )   HTML ( 257 )   PDF (857KB) ( 396 )   Save
    E-commerce product title compression has received significant attention in recent years, since it can facilitate more specific information for cross-platform knowledge alignment and multi-source data fusion. Product titles usually contain redundant descriptions, which can lead to inconsistencies. In this paper, we propose self-attention based neural networks for this task. Given the fact that self-attention mechanism networks cannot directly capture sequence features of product names, we enhance the mapping networks with a dot-attention structure, which was computed for the query and key-value pairs by a gated recurrent unit (GRU) based recurrent neural network. The proposed method improves the analytical capability of the model at a lower relative computational cost. Based on data from LESD4EC, we built two E-commerce datasets of product core phrases named LESD4EC L and LESD4EC S; we subsequently tested the model on these two datasets. A series of experiments show that the proposed model achieves better performance in product title compression than existing techniques.
    References | Related Articles | Metrics
    Electric energy abnormal data detection based on Isolation Forests
    HUANG Fu-xing, ZHOU Guang-shan, DING Hong, ZHANG Luo-ping, QIAN Shu-yun, YUAN Pei-sen
    2019, 2019 (5):  123-132.  doi: 10.3969/j.issn.1000-5641.2019.05.010
    Abstract ( 454 )   HTML ( 225 )   PDF (612KB) ( 282 )   Save
    With the development of power information systems, users' requirements for the quality of power data has gradually increased. Hence, it is important to ensure the accuracy, reliability, and integrity of massive power data. In this paper, an anomaly detection algorithm based on Isolation Forests is used to realize anomaly detection of large-scale electric energy data. Isolation Forest algorithms generate random binary trees and isolated forest models by dividing training samples and detecting abnormal data points. The algorithm can not only process massive data quickly, but it also offers accurate results and a high degree of reliability. In this paper, the positive active total power (PAP) and reverse active total power (RAP) fields of large-scale electric energy data are determined. The experimental results show that the algorithm has high detection efficiency and accuracy.
    References | Related Articles | Metrics
    Prediction of power network stability based on an adaptive neural network
    ZHAO Bo, TIAN Xiu-xia, LI Can
    2019, 2019 (5):  133-142.  doi: 10.3969/j.issn.1000-5641.2019.05.011
    Abstract ( 447 )   HTML ( 180 )   PDF (751KB) ( 311 )   Save
    The safety and stability of the power grid serves as the basis for reform, development, and stability of power enterprises as well as for broader society. With the increasing complexity of power grid structures, safety and stability of the power grid is important for ensuring the rapid and effective development of the national economy. In this paper, we propose an optimal neural network stability prediction model and compare performance with classical machine learning methods. By analyzing the UCI2018 grid stability simulation dataset, the experimental results show that the proposed method can achieve higher prediction accuracy and provide a new approach for research of power big data.
    References | Related Articles | Metrics
    Data Management Techniques In the New Era
    Efficient data loading for log-structured data stores
    DING Guo-hao, XU Chen, QIAN Wei-ning
    2019, 2019 (5):  143-158.  doi: 10.3969/j.issn.1000-5641.2019.05.012
    Abstract ( 398 )   HTML ( 15 )   PDF (1535KB) ( 270 )   Save
    With the rapid development of Internet technology in recent years, the number of users and the data processed by Internet companies and traditional financial institutions are growing rapidly. Traditionally, businesses have tackled this scalability problem by adding servers and adopting methods based on database sharding; however, this can lead to significant manual maintenance expenses and hardware overhead. To reduce overhead and the problems caused by database sharding, businesses commonly replace heritage equipment with new database systems. In this context, new databases based on log-structured merge tree storage (such as OceanBase) are being widely used; the data blocks stored on the disks of such systems exhibit global orderly features. In the process of switching from a traditional database to a new database, a large amount of data must be transferred, and database node downtime may occur when there is extended loading. In order to reduce the total time for loading and failure recovery, we propose a data loading method that supports load balancing and efficient fault tolerance. To support balanced data loading, we pre-calculate the number of partitions based on the file size and the default block size of a target system rather than using a pre-determined number of partitions. In addition, we use the feature that the data exported from the sharding database is usually sorted to determine the split points between partitions by selecting partial sampling blocks and selecting samples in sampling blocks at equal intervals, avoiding the high overhead caused by global sampling and selecting samples randomly or at the head in sampling blocks. To speed up the recovery process, we propose a replica-based partial recovery to avoid restart-based complete reloading; this method uses the multi-replica of an LSM-tree system to reduce the amount of reloaded data. Experimental results show that by pre-calculating the number of partitions and partial sampling blocks and by using equal-interval sampling, we can accelerate data loading relative to pre-determining the number of partitions and global sampling blocks as well as relative to random or head sampling strategies. Hence, we demonstrate the efficiency of replica-based partial failure recovery compared to restart-based complete reloading.
    References | Related Articles | Metrics
    Implementation of LevelDB-based secondary index on two-dimensional data
    LIU Zi-hao, HU Hui-qi, XU Rui, ZHOU Xuan
    2019, 2019 (5):  159-167.  doi: 10.3969/j.issn.1000-5641.2019.05.013
    Abstract ( 572 )   HTML ( 17 )   PDF (459KB) ( 412 )   Save
    With the growth of spatial data generated by scientific research, especially two-dimensional data and the ongoing development of NoSQL-type database technology, more and more spatial data is now stored in NoSQL databases. LevelDB is an open source Key-Value NoSQL database that is widely used because it offers excellent write performance based on LSM architecture. Given the limitations of the Key-Value structure, it is impossible to index spatial data effectively. For this problem, a LevelDB and R-treebased secondary index was proposed to support spatial two-dimensional data indexing and neighbor queries. Experimental results show that the structure has good usability.
    References | Related Articles | Metrics
    Implementation and optimization of a distributed consistency algorithm based on Paxos
    ZHU Chao-fan, GUO Jin-wei, CAI Peng
    2019, 2019 (5):  168-177.  doi: 10.3969/j.issn.1000-5641.2019.05.014
    Abstract ( 371 )   HTML ( 16 )   PDF (494KB) ( 365 )   Save
    With the ongoing development of the Internet, the degree of informationization in enterprises is continuously increasing, and more and more data needs to be processed in a timely manner. In this context, the instability of network environments may lead to data loss and node downtime, which can have potentially serious consequences. Therefore, building distributed fault-tolerant storage systems is becoming increasingly popular. In order to ensure high availability and consistency across the system, a distributed consistency algorithm needs to be introduced. To improve the performance of unstable networks, traditional distributed systems based on Paxos allow for the existence of holes in the log. However, when a node enters a recovery state, these systems typically require a large amount of network interaction to complete the holes in the log; this greatly increases the time for node recovery and thereby affects system availability. To address the complexity of the node recovery process after completing a hole log, this paper proposes a redesigned log entry structure and optimized data recovery process. The effectiveness of the improved Paxos-based consistency algorithm is verified with experimental simulation.
    References | Related Articles | Metrics
    Implementation and optimization of GPU-based relational streaming processing systems
    HUANG Hao, LI Zhi-fang, WANG Jia-lun, WENG Chu-liang
    2019, 2019 (5):  178-189.  doi: 10.3969/j.issn.1000-5641.2019.05.015
    Abstract ( 354 )   HTML ( 11 )   PDF (1148KB) ( 308 )   Save
    State-of-the-art CPU-based streaming processing systems support complex queries on large-scale datasets. However, limited by CPU computational capability, these systems suffer from the performance tradeoff between throughput and response time, and cannot achieve the best of both. In this paper, we propose a GPU-based streaming processing system, named Serval, that co-utilizes CPU and GPU resources and efficiently processes streaming queries by micro-batching. Serval adopts the pipeline model and uses streaming execution cache to optimize throughput and response time on large scale datasets. To meet the demands of various scenarios, Serval implements multiple tuning policies by scaling the micro-batch size dynamically. Experiments show that a single-server Serval outperforms a 3-server distributed Spark Streaming by 3.87x throughput with a 91% response time on average, reflecting the efficiency of the optimization.
    References | Related Articles | Metrics
    Woodpecker+: Customized workload performance evaluation based on data characteristics
    ZHANG Tao, ZHANG Xiao-lei, LI Yu-ming, ZHANG Chun-xi, ZHANG Rong
    2019, 2019 (5):  190-202.  doi: 10.3969/j.issn.1000-5641.2019.05.016
    Abstract ( 355 )   HTML ( 12 )   PDF (1529KB) ( 341 )   Save
    There are a number of performance testing tools, like Sysbench and OLTPBench, that can be used to benchmark the testing of database performance. However, because the standard benchmark workload is fixed and application scenarios for users are not always representative, it is impossible to accurately determine system performance. Moreover, if users are required to use a high-level programming language to implement a test workload separately for each application, this will undoubtedly introduce a substantial amount of repetitive work, resulting in inefficient testing. To address these issues, this paper designs and implements a user-defined performance test workload tool. The main benefits of this tool can be summarized as follows:It is easy to use and expandable; it provides a test definition language (TDL) for efficient construction of test cases; and it offers flexible control for mixed execution of transactions, data access distribution, lightweight and granular statistical information collection and analysis, and support for multiple mainstream DBMSs and other databases that provide database access interfaces. We highlight the tool's features through a detailed set of customized workload experiments running on the mainstream DBMS.
    References | Related Articles | Metrics