Table of Content

    25 September 2018, Volume 2018 Issue 5 Previous Issue    Next Issue
    For Selected: Toggle Thumbnails
    Survey on scene text detection based on deep learning
    YU Ruo-nan, HUANG Ding-jiang, DONG Qi-wen
    2018, 2018 (5):  1-16.  doi: 10.3969/j.issn.1000-5641.2018.05.001
    Abstract ( 482 )   HTML ( 75 )   PDF (3451KB) ( 915 )   Save
    With improvements in computer hardware performance, object detection, and image segmentation algorithms (based on deep learning) have broken the bottlenecks posed by traditional algorithms in big data-driven applications and become the mainstream algorithms in the field of computer vision. In this context, scene text detection algorithms have made great breakthroughs in recent years. The objectives of this survey are three-fold:introduce the progress of scene text detection over the past 5 years, compare and analyze the advantages and limitations of advanced algorithms, and summarize the relevant benchmark datasets and evaluation methods in the field.
    References | Related Articles | Metrics
    A review of non-intrusive sensing based personalized resource recommendations for help-seekers in education
    TANG Lu-min, YU Ruo-nan, DONG Qi-wen, HONG Dao-cheng, FU Yun-bin
    2018, 2018 (5):  17-29.  doi: 10.3969/j.issn.1000-5641.2018.05.002
    Abstract ( 406 )   HTML ( 8 )   PDF (1505KB) ( 487 )   Save
    Mobile devices, data storage, and computing platforms of modern information technology have accelerated the integration of the information technology and education disciplines, promoted the "Education Informatization 2.0" Plan, and provided a solid technical foundation for academic help-seeking. With the help of new sensing mechanisms and techniques, non-intrusive sensing of help-seeking and personalized recommendation methods can now be used for teaching practices in academia. This study reviews the research progress of non-intrusive sensing based personalized resource recommendations, offers detailed analysis, and lists possible directions for research; potential future research topics include non-intrusive sensing for help-seeking, continuous association analysis and integration for multidimensional data, and personalized resource recommendations for help-seekers. This study also makes contributions to precision education and personalized education for the China Education Informatization 2.0 Plan by providing solutions for non-intrusive sensing based personalized resource recommendations for help-seekers.
    References | Related Articles | Metrics
    Data processing software technology for new hardware
    TU Yun-shan, CHU Jia-jia, ZHANG Yao, WENG Chu-liang
    2018, 2018 (5):  30-40,78.  doi: 10.3969/j.issn.1000-5641.2018.05.003
    Abstract ( 500 )   HTML ( 12 )   PDF (978KB) ( 1530 )   Save
    The rapid development of computer hardware in recent years has brought profound technological progress. New hardware for high-performance and low-latency applications (e.g., heterogeneous processors, programmable high-speed NICs/switches, and volatile/non-volatile memory) have been emerging, which bring both new opportunities and challenges to traditional computer architectures and systems. In the case of big data processing, it is difficult to adapt traditional software technology directly onto new hardware, this makes realizing the full potential brought by breakthroughs in hardware technology challenging. Hence, a rethink of traditional software technology can help unlock the benefits brought by advancements in hardware technology. This paper reviews:data processing technology for new hardware from the perspective of computing, transmission, and storage; an analysis of related work in the field; a summary of progress made to date; and new problems and challenges that exist. The study also provides a valuable reference point for future research on exploring the performance ceiling of systems for data processing.
    References | Related Articles | Metrics
    A survey of entity matching algorithms in heterogeneous networks
    LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming
    2018, 2018 (5):  41-55.  doi: 10.3969/j.issn.1000-5641.2018.05.004
    Abstract ( 451 )   HTML ( 13 )   PDF (2253KB) ( 676 )   Save
    The continuous integration of Internet, Internet of Things, and cloud computing technologies has been improving digitization across different industries, but it has also introduced increased data fragmentation. Data fragmentation is characterized by mass, heterogeneity, privacy, dependence, and low quality, resulting in poor data availability. As a result, it is often difficult to obtain accurate and complete information for many analytical tasks. To make effective use of data, entity matching, fusion, and disambiguation are of particular significance. In this paper, we summarize data preprocessing, similarity measurements, and entity matching algorithms of heterogeneous networks. In addition, particularly for large datasets, we investigate scalable entity matching algorithms. Existing entity matching algorithms can be categorized into two groups:supervised and unsupervised learning-based algorithms. We conclude the study with research progress on entity matching and topics for future research.
    References | Related Articles | Metrics
    Implementation of the parallel GroupBy and Aggregation functions in a distributed database system
    XU Shi-lei, WEI Xing, JIANG Hong, QIAN Wei-ning, ZHOU Ao-ying
    2018, 2018 (5):  56-66.  doi: 10.3969/j.issn.1000-5641.2018.05.005
    Abstract ( 457 )   HTML ( 11 )   PDF (1950KB) ( 618 )   Save
    With the increase in demand for data statistics and analysis in new Internet applications, data grouping and aggregation have become amongst the most common operations in data analysis applications. This paper analyzes the operating principles of the Aggregation and GroupBy functions commonly used in analytical applications. Based on the disadvantages of sort grouping for general-transactional databases, two kinds of Hash GroupBy implementations are proposed; in addition,a strategy for dynamically determining the number of Hash buckets and Hash GroupBy schemes, based on statistical information, is proposed. Based on the characteristics of distributed clusters, implementation of the Hash GroupBy operator push down is proposed. Experiments have shown that the use of statistical information to dynamically determine the Hash group option improves efficiency.
    References | Related Articles | Metrics
    The designs and implementations of columnar storage in Cedar
    YU Wen-qian, HU Shuang, HU Hui-qi
    2018, 2018 (5):  67-78.  doi: 10.3969/j.issn.1000-5641.2018.05.006
    Abstract ( 424 )   HTML ( 11 )   PDF (1692KB) ( 607 )   Save
    With the growing size of data and analytical needs, the query performance of databases for OLAP (On-Line Analytical Processing) applications has become increasingly important. Cedar is a distributed relational database based on read-write decoupled architecture. Since Cedar is mainly oriented to the needs of OLTP (On-Line Transaction Processing) applications, it has insufficient performance for handling analytical processing workloads. To address this issue, many studies have shown that column storage technology can effectively improve the efficiency of I/O (Input/Output) and enhance the performance of analytical processing. This paper presents a column-based storage mechanism in Cedar. The study analyzes applicable scenarios and improves Cedar's data query and batch update methods for this mechanism. The results of an experiment demonstrate that the proposed mechanism can enhance the performance of analytical processing substantially, while limiting the negative impacts on transaction processing performance to within 10%.
    References | Related Articles | Metrics
    Primary key management in distributed log-structured database systems
    HUANG Jian-wei, ZHANG Zhao, QIAN Wei-ning
    2018, 2018 (5):  79-90,119.  doi: 10.3969/j.issn.1000-5641.2018.05.007
    Abstract ( 1082 )   HTML ( 11 )   PDF (1197KB) ( 439 )   Save
    At present, there are a large number of writing-intensive loads (e.g., secondkilling of e-commerce, social user-generated data streams) in many applications such as e-commerce, social networking, mobile Internet and so on, which makes log-structured storage a popular technique for back-end storage of modern database systems. However, log-structured storage only supports the append operation, efficient primary key management (primary key generation and update) functions can improve the performance of database append operations. In the distributed and concurrent environment, implementing primary key maintenance faces challenges, such as primary key unique constraints, transactional maintenance, and high-performance requirements. In light of the characteristics of log-structured storage, this paper explores how to implement efficient primary key management in distributed log-structured database systems. First, we propose two kinds of concurrency control models for WAR (Write After Read) operations; second, we adopt these two models to design efficient primary key management algorithms; and finally, we integrate these algorithms into our distributed log-structured database, CEDAR, and verify the effectiveness of the proposed methods by a series of experiments.
    References | Related Articles | Metrics
    Application of the consistency protocol in distributed database systems
    ZHAO Chun-yang, XIAO Bing, GUO Jin-wei, QIAN Wei-ning
    2018, 2018 (5):  91-106.  doi: 10.3969/j.issn.1000-5641.2018.05.008
    Abstract ( 460 )   HTML ( 10 )   PDF (1419KB) ( 505 )   Save
    In recent years, many distributed database products have emerged in the market; yet, distributed databases are still more complex than centralized databases. In order to make the system useable, designers need to adopt the consistency protocol to ensure two important features of distributed database systems:availability and consistency. The protocol ensures consistency by determining the global execution order of operations for concurrent transactions and by coordinating local and global states to achieve continuous dynamic agreement; The consistency protocol ensures availability by coordinating consistency between multiple copies to achieve seamless switching between master and standby nodes. Hence, the distributed consensus protocol is the fundamental basis for the distributed database system. This paper reviews, in detail, the classic distributed consistency protocol and the application of the consistency protocol to current mature distributed database. The study also provides analysis and a comparison between the two approaches considering factors like read-write operation, node type, and network communication.
    References | Related Articles | Metrics
    Technology and implementation of a new OLTP system
    HE Xiao-long, MA Hai-xin, HE Yu-kun, PANG Tian-ze, ZHAO Qiong
    2018, 2018 (5):  107-119.  doi: 10.3969/j.issn.1000-5641.2018.05.009
    Abstract ( 485 )   HTML ( 10 )   PDF (1174KB) ( 493 )   Save
    Since the 1970s, there has been considerable progress in hardware development; in particular, high-performance servers are now equipped with TB-level memory capacity and dozens of physical cores. Traditional OLTP systems, however, are still based on disk storage and designed for hardware with a small number of physical cores; hence, these systems are unable to effectively and fully exploit the computing power offered by new hardware. With the development of the Internet, applications commonly have high performance requirements for transactional systems. In extreme cases, some applications service millions of concurrent access requests, which traditional database systems cannot satisfy. Hence, the redesign and implementation of a transactional database system on high performance hardware has become an important research topic. In this study, we focused on recent work on transaction database systems on large memory and multi-core environments. We used OceanBase, an open source database developed by Alibaba, as an example to analyze the design of a new OLTP system.
    References | Related Articles | Metrics
    Distributed spatio-textual analytics based on the Spark platform
    XU Yang, WANG Zhi-jie, QIAN Shi-you
    2018, 2018 (5):  120-134,153.  doi: 10.3969/j.issn.1000-5641.2018.05.010
    Abstract ( 305 )   HTML ( 9 )   PDF (1180KB) ( 488 )   Save
    With the rapid development of location-based services, spatio-textual data analytics is becoming increasingly important. For instance, it is widely used in social recommendation applications. However, performing efficient analysis on large spatio-textual datasets in a central environment remains a big challenge. This paper explored distributed algorithms for spatio-textual analytics based on the Spark platform. Speciffically, we proposed a scalable two-level index framework, which processes spatio-textual queries in two steps. The global index is highly scalable and it can retrieve candidate partitions with only a few false positives. The local index is designed based on pruning ability of infrequent keywords and used for each candidate partition. We implemented the proposed distributed algorithms in Spark. Extensive experiments demonstrated promising performance for the proposed solution.
    References | Related Articles | Metrics
    Blockchain-based smart meter authentication scheme
    TIAN Fu-liang, TIAN Xiu-xia, CHEN Xi
    2018, 2018 (5):  135-143,171.  doi: 10.3969/j.issn.1000-5641.2018.05.011
    Abstract ( 567 )   HTML ( 93 )   PDF (1326KB) ( 545 )   Save
    The development of power networks is a trend of the future. These networks allow the two-way flow of power resources between users and power systems. As the key node for connecting users and energy systems, smart meters contain a large volume of user's power transaction data and identity information, which leaves the potential for a breach of private user data. To protect the privacy of the user, an identity authentication scheme for smart meters, based on blockchain technology, is proposed. The smart meter's identity information is processed using the Merkle-tree principle and stored in the blockchain; with this, the authentication of the smart meter's identity is realized and the identity information cannot be modified. Additionally, it breaks the connection between the user's identity and their power data, preventing internal and external hackers from obtaining private user data. The integrity and validity of the transaction data are guaranteed by using the characteristics of blockchain technology.
    References | Related Articles | Metrics
    A warehouse receipt management system based on blockchain technology
    QI Xue-cheng, ZHU Yan-chao, SHAO Qi-feng, ZHANG Zhao, JIN Che-qing
    2018, 2018 (5):  144-153.  doi: 10.3969/j.issn.1000-5641.2018.05.012
    Abstract ( 1329 )   HTML ( 121 )   PDF (2297KB) ( 779 )   Save
    Currently, the authenticity of warehouse receipts requires endorsement by a third party in the warehouse receipt e-business. However, fraud cases where receipts are repeatedly pledged often occur, causing huge losses to the country. The data, moreover, is often centrally managed and unavailable to the public, making it difficult to trace records for goods. To solve these problems, this paper designs and implements a warehouse receipt management system, based on blockchain technology, which offers a high degree of transparency, decentralization, trust, and an unchangeable historical record. The system can ensure the accuracy and authenticity of the warehouse receipt. The paper builds an inverted index structure on the blockchain system to improve the efficiency of queries and also support complex queries. The paper proposes a REST(Representational State Transfer)-based service architecture, which provides a flexible, multiple access interface, making it easy to integrate with existing systems and provide support for the development of web and mobile applications.
    References | Related Articles | Metrics
    Optimization of the Levenshtein algorithm and its application in repeatability judgment for test bank
    ZHANG Heng, CHEN Liang-yu
    2018, 2018 (5):  154-163.  doi: 10.3969/j.issn.1000-5641.2018.05.013
    Abstract ( 412 )   HTML ( 13 )   PDF (1017KB) ( 443 )   Save
    In order to overcome the disadvantages of the Levenshtein distance algorithm for long text and large-scale matching, we propose an early termination strategy for the Levenshtein distance algorithm. Firstly, according to the intrinsic relationship between elements in the Levenshtein distance matrix, we sum up a recurrence relation. Based on this relation, an early termination strategy is proposed to determine early-on whether two texts satisfy the predefined similarity threshold. Through several tests on different subjects, it is demonstrated that the early termination strategy can significantly reduce calculation time.
    References | Related Articles | Metrics
    A K-Means based clustering algorithm with balanced constraints
    TANG Hai-bo, LIN Yu-ming, LI You
    2018, 2018 (5):  164-171.  doi: 10.3969/j.issn.1000-5641.2018.05.014
    Abstract ( 405 )   HTML ( 13 )   PDF (932KB) ( 561 )   Save
    Clustering is a type of core data processing technology, which has been applied widely in various applications. Traditional clustering algorithms, however, cannot guarantee the balance of clustering results, which is inconsistent with the needs of real-world applications. To address this problem, this paper proposes a K-Means based clustering algorithm with balanced constraints. The proposed clustering algorithm changes the data point assignment process, in each iteration, to achieve balanced data blocks. The maximum cluster size threshold can be customized to meet different division requirements. The algorithm is simple and efficient-only two parameters need to be specified:the number of clusters and the threshold for the maximum cluster size. A series of experiments carried on six real UCI (University of California Irvine) datasets show that the algorithm proposed in this paper has better clustering performance and efficiency compared to other balanced clustering algorithms.
    References | Related Articles | Metrics
    Research on trademark image retrieval based on deep Hashing
    YUAN Pei-sen, ZHANG Yong, LI Mei-ling, GU Xing-jian
    2018, 2018 (5):  172-182.  doi: 10.3969/j.issn.1000-5641.2018.05.015
    Abstract ( 517 )   HTML ( 90 )   PDF (1654KB) ( 594 )   Save
    Large-scale image retrieval has great potential for a vast number of applications. The fundamentals of image retrieval lie in image feature extraction and high-efficiency similarity evaluation. Deep learning has great capability for feature representation in image objects, while the Hashing technique has better efficiency for high-dimensional data approximation queries. At present, hash learning technology has been widely researched and applied in large-scale image retrieval for the similarity query. This paper extracts trademark image features using convolutional neural network techniques; the data objects are then encoded with bit codes and an approximate query is applied in Hamming space with high efficiency. In this paper, the convolutional neural network is employed and a deep learning based Hash algorithm is proposed; in addition, the loss function and optimizer for the trademark dataset are studied. By obtaining the bit codes that satisfy the Hash coding criterion, the retrieval of trademark data is efficient. Our method can be divided into offline deep Hash learning and online query stages. Experiments are conducted on real trademark data sets, and the results show that our method can obtain high-quality bit code, which has high retrieval accuracy and online query efficiency.
    References | Related Articles | Metrics
    Extraction of social media data based on the knowledge graph and LDA model
    MA You, YUE Kun, ZHANG Zi-chen, WANG Xiao-yi, GUO Jian-bin
    2018, 2018 (5):  183-194.  doi: 10.3969/j.issn.1000-5641.2018.05.016
    Abstract ( 458 )   HTML ( 21 )   PDF (1117KB) ( 669 )   Save
    Social media data extraction forms the basis of research and applications related to public opinion, news dissemination, corporate brand promotion, commercial marketing development, etc. Accurate extraction results are critical to guarantee the effectiveness of the data analysis. In this paper, we analyze the underlying topics in data based on the LDA (Latent Dirichlet Allocation) model; we further implement data extraction in specific domains by adopting featured word sequences and knowledge graphs that describe entities and relevant relationships. Experimental results using "Headline Today" news and Sina Weibo data show that our proposed method can be used to extract social media data effectively.
    References | Related Articles | Metrics