With the rapid development of location-based services, spatio-textual data analytics is becoming increasingly important. For instance, it is widely used in social recommendation applications. However, performing efficient analysis on large spatio-textual datasets in a central environment remains a big challenge. This paper explored distributed algorithms for spatio-textual analytics based on the Spark platform. Speciffically, we proposed a scalable two-level index framework, which processes spatio-textual queries in two steps. The global index is highly scalable and it can retrieve candidate partitions with only a few false positives. The local index is designed based on pruning ability of infrequent keywords and used for each candidate partition. We implemented the proposed distributed algorithms in Spark. Extensive experiments demonstrated promising performance for the proposed solution.
[1] CHEN L, CONG G, JENSEN C S, et al. Spatial keyword query processing:An experimental evaluation[C]//International Conference on Very Large Data Bases. Trondheim, Norway:VLDB Endowment, 2013, 6(3):217-228.
[2] BOUROS P, GE S, MAMOULIS N. Spatio-textual similarity joins[J]. Proceedings of the VLDB Endowment, 2012, 6(1):1-12.
[3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]//USENIX Association Usenix Conference on Networked Systems Design and Implementation. New York:ACM, 2012,70(2):2.
[4] XIE D, LI F, YAO B, et al. Simba:Effcient in-memory spatial analytics[C]//International Conference on Management of Data. New York:ACM, 2016:1071-1085.
[5] ELDAWY A, MOKBEL M F. Spatial Hadoop:A MapReduce framework for spatial data[C]//International Conference on Data Engineering. New York:IEEE, 2016:1352-1363.
[6] XIE D, LI F, YAO B, et al. Simba:Spatial in-memory big data analysis[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York:ACM, 2016:86-89.
[7] YAO B, ZHANG W, WANG Z J, et al. Distributed in-memory analytics for big temporal data[C]//International Conference on Database Systems for Advanced Applications. Berlin:Springer, 2018:549-565.
[8] AJI A, WANG F, VO H, et al. Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J]. Proceedings of the VLDB Endowment, 2013, 6(11):1009-1020.
[9] YAO B, LI F, HADJIELEFTHERIOU M, et al. Approximate string search in spatial databases[C]//International Conference on Data Engineering. New York:IEEE, 2010:545-556.
[10] YAO B, TANG M, LI F. Multi-approximate-keyword routing in GIS data[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York:ACM, 2011:201-210.
[11] LI F, YAO B, TANG M, et al. Spatial approximate string search[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6):1394-1409.
[12] ZHOU Y, XIE X, WANG C, et al. Hybrid index structures for location-based web search[C]//International Conference on Information and Knowledge Management. New York:ACM, 2005:155-162.
[13] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-tree:An effcient and robust access method for points and rectangles[J]. ACM Sigmod Record, 1990, 19(2):322-331.
[14] GUTTMAN A. R-trees:A dynamic index structure for spatial searching[C]//International Conference on Management of Data. New York:ACM, 1984:47-57.
[15] WU D, JENSEN C S. A density-based approach to the retrieval of top-k spatial textual clusters[C]//International Conference on Information and Knowledge Management. New York:ACM, 2016:2095-2100.
[16] CHOUDHURY F M, CULPEPPER J S, SELLIS T. Batch processing of top-k spatial-textual queries[C]//International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data. New York:ACM, 2015:7-12.
[17] LUO C, LI J, LI G, et al. Effcient reverse spatial and textual k nearest neighbor queries on road networks[J].Knowledge-Based Systems, 2016, 93:121-134.
[18] CHEN Z, CONG G, ZHANG Z, et al. Distributed publish/subscribe query processing on the spatio-textual data stream[C]//International Conference on Data Engineering. New York:IEEE, 2017:1095-1106.
[19] YAO B, LI F, KUMAR P. K nearest neighbor queries and knn-joins in large relational databases (almost) for free[C]//International Conference on Data Engineering. New York:IEEE, 2010:4-15.
[20] 徐石磊,王雷,胡卉芪. 基于分布式系统OceanBase的并行连接[J]. 华东师范大学学报(自然科学版), 2017(5):1-10.
[21] SARAWAGI S, KIRPAL A. Effcient set joins on similarity predicates[C]//International Conference on Management of Data. New York:ACM, 2004:743-754.
[22] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//International Conference on Data Engineering. New York:IEEE, 2006:5.
[23] BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search[C]//International Conference on World Wide Web. New York:ACM, 2007:131-140.
[24] FAN L, CAO P, ALMEIDA J, et al. Summary cache:A scalable wide-area web cache sharing protocol[J].IEEE/ACM Transactions on Networking, 2000, 8(3):281-293.
[25] XU Y, YAO B, WANG Z J, et al. Skia:Scalable and effcient in-memory analytics for big spatialtextualdata[EB/OL]. (2018-05-15)[2018-06-18]. https://www.researchgate.net/publication/326352693.
[26] LEUTENEGGER S T, LOPEZ M A, EDGINGTON J. STR:a simple and effcient algorithm for R-tree packing[C]//International Conference on Data Engineering. New York:IEEE, 1997:497-506.
[27] OpenStreetMap Foundation. Openstreetmap project[EB/OL]. (2016-02-12)[2018-06-18]. http://www.openstreetmap.org.
[28] LESKOVEC J, KREYL A. SNAP Datasets:Stanford large network dataset collection[EB/OL]. (2014-06-01)[2018-06-18]. http://snap.stanford.edu/data.
[29] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C]//International Conference on World Wide Web. New York:ACM, 2006:377-386.
[30] YAO B, CHEN Z, GAO X, et al. Flexible aggregate nearest neighbor queries in road networks[C]//International Conference on Data Engineering. New York:IEEE, 2018:1-12.
[31] MA J, YAO B, GAO X, et al. Top-k Critical Vertices Query on Shortest Path[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 99(1):1-13.
[32] XIE D, LI G, YAO B, et al. Practical private shortest path computation based on oblivious storage[C]//International Conference on Data Engineering. New York:IEEE, 2016:361-372.
[33] YAO B, XIAO X, LI F, et al. Dynamic monitoring of optimal locations in road network databases[J]. The International Journal on Very Large Data Bases, 2014, 23(5):697-720.
[34] XIAO X, YAO B, LI F. Optimal location queries in road network databases[C]//International Conference on Data Engineering. New York:IEEE, 2011:804-815.