基于分布式平台Spark的空间文本查询分析

徐阳; 王志杰; 钱诗友

doi:10.3969/j.issn.1000-5641.2018.05.010

华东师范大学学报（自然科学版） >

2018 , Vol. 2018 >Issue 5: 120 - 134,153

DOI: https://doi.org/10.3969/j.issn.1000-5641.2018.05.010

新型互联网应用技术

基于分布式平台Spark的空间文本查询分析

徐阳 ,
王志杰 ,
钱诗友

展开

1. 上海交通大学计算机科学与工程系, 上海 200240;
2. 中山大学数据科学与计算机学院, 广州 510006

徐阳,男,硕士研究生,研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn.

收稿日期: 2018-07-09

网络出版日期: 2018-09-26

基金资助

国家重点研发计划项目（2017YFC0803700）；广东省科技计划项目（2015A030401057，2016B030307002）

收起

Distributed spatio-textual analytics based on the Spark platform

XU Yang ,
WANG Zhi-jie ,
QIAN Shi-you

Expand

1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;
2. School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China

Received date: 2018-07-09

Online published: 2018-09-26

Fold

摘要

随着基于位置服务应用的不断推广，空间文本数据查询的应用价值（例如结合地理位置和用户标签的社交推荐）也在不断提高.但是，随着数据规模的迅速增长，传统的基于单机环境实现的技术难以为用户提供低延时和高吞吐量的服务.为此，本文基于Spark平台对分布式环境下的空间文本查询算法进行了探究.采用了面向海量空间文本数据的两层索引框架（包括全局索引和局部索引），该框架利用了分阶段过滤的策略来处理分布式下的布尔范围查询问题.同时，针对空间文本相似连接提出了Prefix-RI结构并提出了相应的分布式算法.基于Spark平台实现了所提出的分布式算法，并通过大量的实验对比验证了所提出方法的优越性.

关键词： 分布式计算; 空间文本分析; 相似连接

本文引用格式

徐阳 , 王志杰 , 钱诗友 . 基于分布式平台Spark的空间文本查询分析[J]. 华东师范大学学报（自然科学版）, 2018 , 2018(5) : 120 -134,153 . DOI: 10.3969/j.issn.1000-5641.2018.05.010

Abstract

With the rapid development of location-based services, spatio-textual data analytics is becoming increasingly important. For instance, it is widely used in social recommendation applications. However, performing efficient analysis on large spatio-textual datasets in a central environment remains a big challenge. This paper explored distributed algorithms for spatio-textual analytics based on the Spark platform. Speciffically, we proposed a scalable two-level index framework, which processes spatio-textual queries in two steps. The global index is highly scalable and it can retrieve candidate partitions with only a few false positives. The local index is designed based on pruning ability of infrequent keywords and used for each candidate partition. We implemented the proposed distributed algorithms in Spark. Extensive experiments demonstrated promising performance for the proposed solution.

Key words： distributed processing; spatio-textual analytics; similarity join

参考文献

[1] CHEN L, CONG G, JENSEN C S, et al. Spatial keyword query processing:An experimental evaluation[C]//International Conference on Very Large Data Bases. Trondheim, Norway:VLDB Endowment, 2013, 6(3):217-228.
[2] BOUROS P, GE S, MAMOULIS N. Spatio-textual similarity joins[J]. Proceedings of the VLDB Endowment, 2012, 6(1):1-12.
[3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]//USENIX Association Usenix Conference on Networked Systems Design and Implementation. New York:ACM, 2012,70(2):2.
[4] XIE D, LI F, YAO B, et al. Simba:Effcient in-memory spatial analytics[C]//International Conference on Management of Data. New York:ACM, 2016:1071-1085.
[5] ELDAWY A, MOKBEL M F. Spatial Hadoop:A MapReduce framework for spatial data[C]//International Conference on Data Engineering. New York:IEEE, 2016:1352-1363.
[6] XIE D, LI F, YAO B, et al. Simba:Spatial in-memory big data analysis[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York:ACM, 2016:86-89.
[7] YAO B, ZHANG W, WANG Z J, et al. Distributed in-memory analytics for big temporal data[C]//International Conference on Database Systems for Advanced Applications. Berlin:Springer, 2018:549-565.
[8] AJI A, WANG F, VO H, et al. Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J]. Proceedings of the VLDB Endowment, 2013, 6(11):1009-1020.
[9] YAO B, LI F, HADJIELEFTHERIOU M, et al. Approximate string search in spatial databases[C]//International Conference on Data Engineering. New York:IEEE, 2010:545-556.
[10] YAO B, TANG M, LI F. Multi-approximate-keyword routing in GIS data[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York:ACM, 2011:201-210.
[11] LI F, YAO B, TANG M, et al. Spatial approximate string search[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6):1394-1409.
[12] ZHOU Y, XIE X, WANG C, et al. Hybrid index structures for location-based web search[C]//International Conference on Information and Knowledge Management. New York:ACM, 2005:155-162.
[13] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-tree:An effcient and robust access method for points and rectangles[J]. ACM Sigmod Record, 1990, 19(2):322-331.
[14] GUTTMAN A. R-trees:A dynamic index structure for spatial searching[C]//International Conference on Management of Data. New York:ACM, 1984:47-57.
[15] WU D, JENSEN C S. A density-based approach to the retrieval of top-k spatial textual clusters[C]//International Conference on Information and Knowledge Management. New York:ACM, 2016:2095-2100.
[16] CHOUDHURY F M, CULPEPPER J S, SELLIS T. Batch processing of top-k spatial-textual queries[C]//International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data. New York:ACM, 2015:7-12.
[17] LUO C, LI J, LI G, et al. Effcient reverse spatial and textual k nearest neighbor queries on road networks[J].Knowledge-Based Systems, 2016, 93:121-134.
[18] CHEN Z, CONG G, ZHANG Z, et al. Distributed publish/subscribe query processing on the spatio-textual data stream[C]//International Conference on Data Engineering. New York:IEEE, 2017:1095-1106.
[19] YAO B, LI F, KUMAR P. K nearest neighbor queries and knn-joins in large relational databases (almost) for free[C]//International Conference on Data Engineering. New York:IEEE, 2010:4-15.
[20] 徐石磊,王雷,胡卉芪. 基于分布式系统OceanBase的并行连接[J]. 华东师范大学学报(自然科学版), 2017(5):1-10.
[21] SARAWAGI S, KIRPAL A. Effcient set joins on similarity predicates[C]//International Conference on Management of Data. New York:ACM, 2004:743-754.
[22] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//International Conference on Data Engineering. New York:IEEE, 2006:5.
[23] BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search[C]//International Conference on World Wide Web. New York:ACM, 2007:131-140.
[24] FAN L, CAO P, ALMEIDA J, et al. Summary cache:A scalable wide-area web cache sharing protocol[J].IEEE/ACM Transactions on Networking, 2000, 8(3):281-293.
[25] XU Y, YAO B, WANG Z J, et al. Skia:Scalable and effcient in-memory analytics for big spatialtextualdata[EB/OL]. (2018-05-15)[2018-06-18]. https://www.researchgate.net/publication/326352693.
[26] LEUTENEGGER S T, LOPEZ M A, EDGINGTON J. STR:a simple and effcient algorithm for R-tree packing[C]//International Conference on Data Engineering. New York:IEEE, 1997:497-506.
[27] OpenStreetMap Foundation. Openstreetmap project[EB/OL]. (2016-02-12)[2018-06-18]. http://www.openstreetmap.org.
[28] LESKOVEC J, KREYL A. SNAP Datasets:Stanford large network dataset collection[EB/OL]. (2014-06-01)[2018-06-18]. http://snap.stanford.edu/data.
[29] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C]//International Conference on World Wide Web. New York:ACM, 2006:377-386.
[30] YAO B, CHEN Z, GAO X, et al. Flexible aggregate nearest neighbor queries in road networks[C]//International Conference on Data Engineering. New York:IEEE, 2018:1-12.
[31] MA J, YAO B, GAO X, et al. Top-k Critical Vertices Query on Shortest Path[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 99(1):1-13.
[32] XIE D, LI G, YAO B, et al. Practical private shortest path computation based on oblivious storage[C]//International Conference on Data Engineering. New York:IEEE, 2016:361-372.
[33] YAO B, XIAO X, LI F, et al. Dynamic monitoring of optimal locations in road network databases[J]. The International Journal on Very Large Data Bases, 2014, 23(5):697-720.
[34] XIAO X, YAO B, LI F. Optimal location queries in road network databases[C]//International Conference on Data Engineering. New York:IEEE, 2011:804-815.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献