华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (5): 120-134,153.doi: 10.3969/j.issn.1000-5641.2018.05.010

• 新型互联网应用技术 • 上一篇    下一篇

基于分布式平台Spark的空间文本查询分析

徐阳1, 王志杰2, 钱诗友1   

  1. 1. 上海交通大学 计算机科学与工程系, 上海 200240;
    2. 中山大学 数据科学与计算机学院, 广州 510006
  • 收稿日期:2018-07-09 出版日期:2018-09-25 发布日期:2018-09-26
  • 通讯作者: 王志杰,男,博士,副研究员,研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn. E-mail:wangzhij5@mail.sysu.edu.cn
  • 作者简介:徐阳,男,硕士研究生,研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn.
  • 基金资助:
    国家重点研发计划项目(2017YFC0803700);广东省科技计划项目(2015A030401057,2016B030307002)

Distributed spatio-textual analytics based on the Spark platform

XU Yang1, WANG Zhi-jie2, QIAN Shi-you1   

  1. 1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;
    2. School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China
  • Received:2018-07-09 Online:2018-09-25 Published:2018-09-26

摘要: 随着基于位置服务应用的不断推广,空间文本数据查询的应用价值(例如结合地理位置和用户标签的社交推荐)也在不断提高.但是,随着数据规模的迅速增长,传统的基于单机环境实现的技术难以为用户提供低延时和高吞吐量的服务.为此,本文基于Spark平台对分布式环境下的空间文本查询算法进行了探究.采用了面向海量空间文本数据的两层索引框架(包括全局索引和局部索引),该框架利用了分阶段过滤的策略来处理分布式下的布尔范围查询问题.同时,针对空间文本相似连接提出了Prefix-RI结构并提出了相应的分布式算法.基于Spark平台实现了所提出的分布式算法,并通过大量的实验对比验证了所提出方法的优越性.

关键词: 分布式计算, 空间文本分析, 相似连接

Abstract: With the rapid development of location-based services, spatio-textual data analytics is becoming increasingly important. For instance, it is widely used in social recommendation applications. However, performing efficient analysis on large spatio-textual datasets in a central environment remains a big challenge. This paper explored distributed algorithms for spatio-textual analytics based on the Spark platform. Speciffically, we proposed a scalable two-level index framework, which processes spatio-textual queries in two steps. The global index is highly scalable and it can retrieve candidate partitions with only a few false positives. The local index is designed based on pruning ability of infrequent keywords and used for each candidate partition. We implemented the proposed distributed algorithms in Spark. Extensive experiments demonstrated promising performance for the proposed solution.

Key words: distributed processing, spatio-textual analytics, similarity join

中图分类号: