Journal of East China Normal University(Natural Science)

Parallel join based on distributed system OceanBase

XU Shi-lei, WANG Lei, HU Hui-qi, QIAN Wei-ning, ZHOU Ao-ying

2017, 2017 (5): 1-10. doi: 10.3969/j.issn.1000-5641.2017.05.001

Abstract ( 770 )

HTML ( 18 )

PDF (771KB) ( 1068 )

Save

With the rapid growth of application data and the continued development of distributed database systems, data storage in physical independent nodes has become a trend. In this trend, when the application needs to perform complex join queries, it inevitably generates a lot of network traffic. Therefore, improving the efficiency of join query in distributed system is a hot topic. Based on the analysis of the nested loop join, Hash join, semi-join in the OceanBase, this paper puts forward the optimization idea of using hardware resources reasonably and using multithread to execute join operations in parallel. We implement experiment on OceanBase with nested loop join algorithm, Hash join algorithm, semi-join algorithm respectively. The experimental results confirm that the efficiency of join algorithm is positively related to parallelism in a certain number of threads.

References | Related Articles | Metrics

Distributed stream processing system for join operations

CHEN Ming-zhu, WANG Xiao-tong, FANG Jun-hua, ZHANG Rong

2017, 2017 (5): 11-19. doi: 10.3969/j.issn.1000-5641.2017.05.002

Abstract ( 574 )

HTML ( 90 )

PDF (641KB) ( 649 )

Save

Real-time stream processing system plays an increasingly important role in practical applications. Stream Join constitutes one of the most important and expensive operation in big data analysis. However, skewed data distribution in real-world applications and inherent features of streaming data, such as infinity and unpredictability, put great pressure on the join processing in distributed stream systems. Mainstream industrial stream systems have low versatility on join processing, providing no programming interface; though several academic stream prototype systems solve such a problem to a certain extent, they support equi-join processing only, or results in high resource utilization and severe load imbalance. In this paper, after analyzing three typical distributed stream systems, we integrate the techniques based on Join-Matrix into Storm, design and implement a general stream processing system which supports arbitrary theta joins. Experiments demonstrate that the system proposed in this paper outperforms the static-of-the-art strategies.

References | Related Articles | Metrics

Storage and load balancing for large-scale comment data on heterogeneous Redis cluster

ZHANG Jing-wei, DING Zhi-jun, YANG Qing, ZHANG Hui-bing, ZHANG Hai-tao, ZHOU Ya

2017, 2017 (5): 20-29. doi: 10.3969/j.issn.1000-5641.2017.05.003

Abstract ( 816 )

HTML ( 19 )

PDF (643KB) ( 868 )

Save

The storage and query performance for large-scale comment data have a great influence on those applications built on the above data. In a heterogeneous computing environment, each node has different performance on storage and computation, it presents a key challenge for optimizing the storage and query performance for large-scale comment data by taking full advantage of the performance of each node. Based on the ability of Redis cluster, we design a storage model for large-scale comment data in a homogeneous Redis cluster, which provides the storage balancing in Redis slots. And then, we discuss the relationship between the number of Redis slots and query efficiency to design a method for allocating storage on the real load of each computing node for heterogeneous Redis clusters, which can make full use of the performance of each node and can guide to allocate slots to nodes by balancing the query performance and storage loading. Our experimental results show that the proposed model has a good effect on storage loading and improve the query efficiency of the heterogeneous Redis cluster.

References | Related Articles | Metrics

Design and implementation of Smart materialization for column-store in CLAIMS

ZHANG Han, ZHOU Min-qi

2017, 2017 (5): 30-39. doi: 10.3969/j.issn.1000-5641.2017.05.004

Abstract ( 518 )

HTML ( 17 )

PDF (621KB) ( 616 )

Save

Materialization is a necessary operation in the process of query execution. Materialization strategy and materialization technology play an important role in the process of query execution. Therefore, it is necessary to design a materialization strategy for column-store database. According to the shortcomings of early materialization and later materialization, we provide a strategy named Smart materialization that are different from the two strategies mentioned above. Here we need to define a concept in the logical query plan-projection, the structure is used to select the desired attributes, the physical table is cut by column, to ensure that the structure at the beginning of the query can reduce the direct load to memory of the amount of data, to avoid additional overhead. In the logical query plan, the projection is divided by columns, and the next required columns are predicted according to the relevance of the query in a set of queries, and the required columns are stabilized in one of the most appropriate projection. We use the data set of TPC-H to verify its validity worked on the disturbed in-memory database-CLAIMS.

References | Related Articles | Metrics

An outer join algorithm based on Cuckoo filter

YU Yang, ZHOU Min-qi, FANG Zhu-he

2017, 2017 (5): 40-51. doi: 10.3969/j.issn.1000-5641.2017.05.005

Abstract ( 672 )

HTML ( 23 )

PDF (750KB) ( 691 )

Save

In recent years, due to the development of the Internet, data size has been increased rapidly. At the era of big data, the analysis efficiency of distributed database system needs to be optimized urgently. Nevertheless, the join operation is the main performance bottleneck of a distributed database system. Join operations are mainly divided into inner join and outer join, and outer join is widely used in the business situations. Distributed join algorithm involves a large amount of network transmission, which affects the performance of the system severely. Although there are some studies in the literature of inner join optimized, these optimization methods cannot be directly applied to outer join. This paper proposes a distributed outer join algorithm based on Cuckoo filter. By building a Cuckoo filter with replication subdivision technology for data filtering and allocation, it reduces the amount of data transmission and improves the degree of parallelism accordingly. Finally, it improves the query performance. We implement this algorithm in Ginkgo. Based on the given extensive experimental verification, the algorithm largely improves the efficiency of the outer join.

References | Related Articles | Metrics

Survey on distributed word embeddings based on neural network language models

YU Ke-ren, FU Yun-bin, DONG Qi-wen

2017, 2017 (5): 52-65,79. doi: 10.3969/j.issn.1000-5641.2017.05.006

Abstract ( 712 )

HTML ( 20 )

PDF (515KB) ( 1543 )

Save

Distributed word embedding is one of the most important research topics in the field of Natural Language Processing, whose core idea is using lower dimensional vectors to represent words in text. There are many ways to generate such vectors, among which the methods based on neural network language models perform best. And the respective case is Word2vec, which is an open source tool developed by Google inc. in 2012. Distributed word embeddings can be used to solve many Natural Language Processing tasks such as text clusting, named entity tagging, part of speech analysing and so on. Distributed word embeddings rely heavily on the performance of the neural network language model it based on and the specific task it processes. This paper gives an overview of the distributed word embeddings based on neural network and can be summarized from three aspects, including the construction of classical neural network language models, the optimization method for multi-classification problem in language model, and how to use auxiliary structure to train word embeddings.

References | Related Articles | Metrics

The auto-question answering system based on convolution neural network

JING Li-jiao, FU Yun-bin, DONG Qi-wen

2017, 2017 (5): 66-79. doi: 10.3969/j.issn.1000-5641.2017.05.007

Abstract ( 552 )

HTML ( 15 )

PDF (707KB) ( 928 )

Save

The question-answering is a hot research field in natural language processing, which can give users concise and precise answer to the question presented in natural language and provide the users with more accurate information service. There are two key questions to be solved in the question answering system:one is to realize the semantic representation of natural language question and answer, and the other is to realize the semantic matching learning between question and answer. Convolution neural network is a classic deep network structure which has a strong ability to express semantics in the field of natural language processing in recent years, and is widely used in the field of automatic question and answer. This paper reviews some techniques in the question answering system that is based on the convolution neural network, the paper focuses on the knowledge-based and the text-oriented Q&A techniques from the two main perspectives of semantic representation and semantic matching, and indicates the current research difficulties.

References | Related Articles | Metrics

Study of click through rate prediction in online advertisement

XIAO YAO, BI Jun-fang, HAN YI, DONG Qi-wen

2017, 2017 (5): 80-86,100. doi: 10.3969/j.issn.1000-5641.2017.05.008

Abstract ( 1046 )

HTML ( 33 )

PDF (548KB) ( 824 )

Save

With the development of the Internet and the growth of users, the advertising industry originated from the traditional offline advertising model, is gradually transforming into online advertising model. At the same time, due to the use of large data analysis technology, online advertising shows great advantages when compared with traditional advertising. The advertisers deliver their advertisements to the platform's specific positions by competition auction of counterparts. Therefore, it is important to predict the click through rate (CTR) of a given advertisement before auction, which is important for advertisers to reduce costs and expand their likely revenue.This paper introduces the commonly used ad click rate prediction model, uses the information from different advertisers, advertisements and media platforms as the features of machine learning, and uses real data sets to illustrate the advantages of various models,and the impact of different features on the ad click rate.

References | Related Articles | Metrics

Smart meter:Privacy-preserving power request scheme

TIAN Xiu-xia, LI Li-sha, ZHAO Chuan-qiang, TIAN Fu-liang, SONG Qian

2017, 2017 (5): 87-100. doi: 10.3969/j.issn.1000-5641.2017.05.009

Abstract ( 539 )

HTML ( 12 )

PDF (781KB) ( 648 )

Save

A privacy-preserving power request scheme was proposed. The proposed scheme combined Shamir (t,n) threshold secret sharing scheme with Laplace noise perturbation algorithm effectively to achieve paying TOU billing as well as protecting user privacy. Experiments were performed from four aspects:analyzing the security quantitatively and determining the optimal threshold t, giving the experiment on efficiency test, verifying the ε-differential privacy by introducing the Laplace noise perturbation and conducting the scheme feasibility comparison. Experimental results show that the proposed scheme is effective and feasible.

References | Related Articles | Metrics

Techniques for cross-domain recommendation:A survey

CHEN Lei-hui, KUANG Jun, CHEN Hui, ZENG Wei, ZHENG Jian-bing, GAO Ming

2017, 2017 (5): 101-116,137. doi: 10.3969/j.issn.1000-5641.2017.05.010

Abstract ( 860 )

HTML ( 15 )

PDF (865KB) ( 1556 )

Save

With the rapid development of information technology and Internet, the available information on the Internet has overwhelmed the human processing capabilities in some commercial applications. Personalized recommendation system is a popular technology to deal with the information overload and recommendation algorithms are the core of it. In the past decades, collaborative filtering recommendation algorithm based on single domain has been widely used in many applications. However, the problems of cold start and data sparsity usually result in overfitting and fail to give desirable performance. The cross-domain recommendation techniques have been a hot topic in the field of recommender systems, which aim to utilize knowledge from related domains to perform or improve recommendation in the target domain. This paper carries out a systematic study and analysis of cross-domain recommendation techniques. First, we summarize the related concepts and the technical difficulties of cross-domain recommendation algorithms. Second, we present a general categorization of cross-domain recommendation techniques and sum up their respective advantages and disadvantages. Finally we introduce the method of performance analysis of cross-domain recommendation algorithm in detail.

References | Related Articles | Metrics

Research of personalized knowledge search for food safety system

YUAN Pei-sen, REN Wu-bei, REN Shou-gang, ZHU Shu-xin, XU Huan-liang

2017, 2017 (5): 117-124,137. doi: 10.3969/j.issn.1000-5641.2017.05.011

Abstract ( 621 )

HTML ( 13 )

PDF (904KB) ( 637 )

Save

In the era of big data, knowledge discovery from the mass of data is an important research problem, especially for the user's customized knowledge. In this paper, an integrated search system aiming at personalized re-ranking of food safety knowledge system, PROSK for short, is designed and implemented. Firstly, using the existing search engines, the meta-search engine technique is employed for integrating the results of multiple search engines; then according to the results of the users' click through and the ontology of food safety domain, ranking-based learning algorithm is applied to sort search results adaptively according to the preference profiles. The system integrates the agricultural information from multi-engineers and ranks the query results adaptively and intelligently. This study proposes a feasible solution for ranking of information and knowledge of food safety from multi-engineers adaptively.

References | Related Articles | Metrics

Fraudulent medical behavior detection based on hybrid approach

PAN Song-song, ZHANG Wei-jia

2017, 2017 (5): 125-137. doi: 10.3969/j.issn.1000-5641.2017.05.012

Abstract ( 653 )

HTML ( 61 )

PDF (1192KB) ( 735 )

Save

With continuous improvement of medical insurance system, coverage of medical insurance continues to expand. The normal operation of medical insurance funds has been closely related with the vital interests of the people. However, frequent occurrence of fraudulent behaviors such as frequent hospitalization, hospitalization decomposition, abnormal fees threaten the normal operation of funds. This paper firstly used random forest method to select different features according to different diseases. Then the paper applied CBLOF-based and improved CBLOF methods to detect abnormal fees. What's more, we utilized rule-based method to identity frequent hospitalization and hospitalization decomposition. Extensive experiments on real medical claim datasets demonstrate the effectiveness and efficiency of our proposal. Finally, this paper proposed a medical insurance fund supervisory system, which can display results of pivot analysis with the help of Echarts.

References | Related Articles | Metrics

Modeling multi-dimensional user preference based on the latent variable model

WANG Shan-lei, YUE Kun, WU Hao, TIAN Kai-lin

2017, 2017 (5): 138-153. doi: 10.3969/j.issn.1000-5641.2017.05.013

Abstract ( 775 )

HTML ( 16 )

PDF (1000KB) ( 597 )

Save

Modeling user preference from user behavior data is the basis of personalization service, score prediction, user behavior targeting, etc. In this paper, multi-dimensional preferences from rating data are described by multiple latent variables and the Bayesian network with multiple latent variables is adopted as the preliminary knowledge framework of user preference. Constraint conditions are given according to the inherence of user preference and latent variables, upon which we propose a method for modeling user preference. Parameters are computed by EM algorithm and structure is established by SEM algorithm with respect to the given constraints. In the case of multiple latent variables, a large amount of intermediate data is generated in modeling, which causes the increasing computational complexity. Therefore, we implement the modeling method with Spark computing framework. Experiments results on the Movielens dataset verify that the method proposed in this paper is effective.

References | Related Articles | Metrics

A hybrid collaborative filtering recommendation model based on complex attribute of goods

ZHOU Lan-feng, MA Shuang-ke, FU Zheng, ZHANG Qing

2017, 2017 (5): 154-161,185. doi: 10.3969/j.issn.1000-5641.2017.05.014

Abstract ( 519 )

HTML ( 79 )

PDF (614KB) ( 679 )

Save

Collaborative filtering as the most widely used, the most recommendation algorithm, the shortcomings inherent in the data sparse, cold startpoor data quality and others, and few studies based on commodity price to improve the prediction accuracy. At the same time, facing the full e-commerce market network Navy, the ratings and reviews also indirectly led to the predict a decline in accuracy. Therefore, this paper comprehensive consideration of the user subjective ratings and objective product score, and on this basis, combined with situation pre filtering, social network theory and expert opinions put forward a hybrid collaborative filtering recommendation model, to some extent alleviate the above shortcomings. And through experiment with real online car sales data, the model has higher forecast accuracy than the traditional collaborative filtering, and is more suitable for the commodity with complex attributes.

References | Related Articles | Metrics

Constrained route planning based on the regular expression

WANG Jing, LIU Hui-ping, JIN Che-qing

2017, 2017 (5): 162-173,235. doi: 10.3969/j.issn.1000-5641.2017.05.015

Abstract ( 554 )

HTML ( 16 )

PDF (651KB) ( 831 )

Save

Traditional route planning algorithms, which mainly focus on metrics such as the distance, time, cost, etc. to find the optimal route from source to destination, are not suitable for solving route planning requirements with location constraints. For example, finding the shortest path passing the whole or a part of user-defined location categories in order or disorder. Mainly focusing on these scenarios, this paper formalizes the constrained route planning problem on the basis of the regular expression generated by user requirements and gives a general framework to solve this problem. Based on this, a basic constrained route planning algorithm (BCRP) and an improved constrained route planning algorithm (ICRP) are proposed while ICRP reduces the search space using pruning rules. Finally, extensive experiments on real road network datasets demonstrate the efficiency of our proposal.

References | Related Articles | Metrics

Privacy preserving method of spatio-temporal data based on k-generalization technology

YANG Zi, NING Bo, LI Yi

2017, 2017 (5): 174-185. doi: 10.3969/j.issn.1000-5641.2017.05.016

Abstract ( 616 )

HTML ( 30 )

PDF (844KB) ( 601 )

Save

In recent years, more and more devices based on location system, resulting in a large amount of location information by the mobile device users to access and use, from the perspective of data mining, the data is of immeasurable value, but in terms of personal privacy, people don't want their information to be leaked and used to sparked strong privacy concerns. At present, many papers have proposed privacy protection technology to solve this problem. Generally speaking, there are several categories of interference, suppression and generalization. In order to protect the privacy of personal spatio-temporal data, this paper proposes a method of k-generalization. To limit the scope of the user may appear, improve the availability of data; selection of nodes to generalization so that the user's maximum security; considers multiple sensitive node solutions exist under the condition, and for the purpose of improving the data utility on a number of sensitive nodes are optimized. Finally, the performance of the algorithm is evaluated by experiments, and it is proved that the algorithm is effective to protect personal privacy.

References | Related Articles | Metrics

Top-k hotspots recommendation algorithm based on real-time traffic

WU Tao, MAO Jia-li, XIE Qing-cheng, YANG Yan-qiu, WANG Jin

2017, 2017 (5): 186-200. doi: 10.3969/j.issn.1000-5641.2017.05.017

Abstract ( 537 )

HTML ( 14 )

PDF (1033KB) ( 617 )

Save

To cut down the no-load rate of taxis and relieve the traffic pressure, an effective hotspot recommendation method of picking up passenger is necessitated. Aiming at the problem of lower recommendation precision of traditional recommendation technique due to ignoring the actual road situation, we propose a two-phase real-time hotspot recommendation approach for picking up passenger. In the phase of offline mining, timebased hotspots are extracted by mining the history taxi trajectory dataset. In the phase of online recommendation, according to the position and time of taxi requests, a potential no-passenger time cost evaluation function that based on real-time road situation is presented to evaluate and rank hotspots, and obtain top-k hotspots of picking up passenger.Experimental results on taxi trajectory data show that, our proposal ensure smaller potential no-load time overhead due to considering real-time traffic conditions, and hence has good effectiveness and robustness as compared to the traditional recommendation approached.

References | Related Articles | Metrics

Individual station estimation from smart card transactions

WANG Yi-lin, ZHANG Zhi-gang, JIN Che-qing

2017, 2017 (5): 201-212. doi: 10.3969/j.issn.1000-5641.2017.05.018

Abstract ( 707 )

HTML ( 12 )

PDF (579KB) ( 507 )

Save

With the fast development of public transportation network and widespread use of smart card, more and more rich semantic information about human mobility behaviors are hidden in smart card transaction data. However, a great number of current smart cards are initially designed for charging and do not record any detailed information about where and when a passenger gets on or gets off a bus, which brings out great difficulties for analyzing, mining transaction data and providing more precise location-based services. This paper presents Space-Time Adjacency algorithm (STA) and Historical Trip Based algorithm (HTB) to estimate the bus station of each card's transaction records with the aid of integral historical data including complete subway transaction data. Specifically, STA does the initial reconstruction work according to the space-time proximity of adjacent transaction records. Then HTB first cuts the collection of records to form trips that contain explicit trip purposes, then extracts taken lines and transfer lines using historical data, next generates candidate stations for each taken line, and finally uses them to recover the transaction records again. Experiments show that the proposed algorithms work well and narrow the range of candidate stations for bus lines, and have good time efficiency.

References | Related Articles | Metrics

Design and implementation of graduate student system

LI Yan-bin, PAN Yan-hong, GU Hang, WANG Lei, SHI Bing, SUN Chen, XIA Fan, DONG Qi-wen, SONG Shu-bin

2017, 2017 (5): 213-224. doi: 10.3969/j.issn.1000-5641.2017.05.019

Abstract ( 1060 )

HTML ( 64 )

PDF (1361KB) ( 745 )

Save

In the recent decade, with the increase of the number of enrolled graduate students, the corresponding requirements of the graduate student management also continue changing, which incurs the problem of the existing graduate student management systems cannot match the practical requirement changes. From the business viewpoint, the current graduate student management system at ECNU does not provide a workflow-based status change management. On the other side, from the system viewpoint, the existing graduate student management system has the problems of browser compatibility and slow system response time and cannot support mobile-side service access and secondary development on the original code base. To enable the new requirements, boost system performance and improve system usability, the school of graduate management at ECNU organizes a development team and the popular open source frameworks like AngularJS and Spring Boot are used to design and implement the next generation graduate student management system. This system aims to solve the problem of the platform is not unified for different subsystems. The implementation of the next generation graduate student management system has the advantages of good response time, supporting mobile-side service access and the most important thing is that the self-organized development team can make sure the secondary development for the appearance of the new requirements, with the agile development and deployment mode.

References | Related Articles | Metrics

Design and implementation of operation subsystem in graduate student management system

SHI Bing, XIA Fan, SONG Shu-bin, XIAO Li-min, DONG Qi-wen, ZHOU Ao-ying, XU Lin-hao

2017, 2017 (5): 225-235. doi: 10.3969/j.issn.1000-5641.2017.05.020

Abstract ( 661 )

HTML ( 22 )

PDF (955KB) ( 790 )

Save

Log collection and analysis based operation scheme is very important in the design of modern software systems, which helps system administrators to enhance data security and service stability. Firstly, this paper discusses the existing operation approaches based on log collection and analysis. Secondly, with the open source ELK framework, this paper gives a detailed design of the operation subsystem in graduate student information platform. The proposed design can effectively fulfill the practical requirements of the graduate school, like performance and workload monitoring, user behavior tracking and error tracing, by using the real-time interactive data analysis on system and application logs. Finally, this paper demonstrates the interactive dashboard implementation of our operation subsystem under different business scenarios.

References | Related Articles | Metrics

Table of Content