华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (1): 76-90.doi: 10.3969/j.issn.1000-5641.2018.01.008

• 计算机科学 • 上一篇    下一篇

面向概率RDF数据库查询的数据清洗

王桢, 林欣   

  1. 华东师范大学 上海市多维度信息处理重点实验室, 上海 200062
  • 收稿日期:2016-12-03 出版日期:2018-01-25 发布日期:2018-01-11
  • 通讯作者: 林欣,男,副教授,研究方向为时空数据库和数据清洗.E-mail:xlin@cs.ecnu.edu.cn. E-mail:xlin@cs.ecnu.edu.cn
  • 作者简介:王桢,男,硕士研究生,研究方向为数据清洗.E-mail:zhenwangemail@163.com.
  • 基金资助:
    国家自然科学基金(61572193)

Data cleaning on probabilistic RDF database

WANG Zhen, LIN Xin   

  1. Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai 200062, China
  • Received:2016-12-03 Online:2018-01-25 Published:2018-01-11

摘要: 由于在获取、解析数据的过程中存在误差、干扰等因素,很多领域的数据中存在着不确定性,这已成为影响数据性能的重要因素.概率数据库可以存储不确定数据并且返回带有置信度的查询结果.然而,不确定性的累积和传播会降低查询结果的可用性.因此,有必要降低概率数据库中数据的不确定性.致力于解决在概率RDF (Resource Description Framework)数据库图查询中如何由众包来提升查询结果的确定性,基本思想是让众包工作者决定由边表示的关系是否正确,以降低整个查询的不确定性.提出了3种不同的算法来选择使查询结果不确定性下降最大的边.最后,通过实验验证了提出的算法,表明不稳定剪枝算法和稳定剪枝算法具有更好的效果.

关键词: 概率RDF图, 众包, 数据清洗

Abstract: Due to the factors such as errors and noises in the process of obtaining and analyzing data, uncertain data arises in many domains, which has emerged as an important issue affecting the performance of data. Uncertain data can be stored in probabilistic databases and query facilities always yield answers with confidence. However, the accumulation and propagation of uncertainty may reduce the usability of the query results. As such, it is desirable to reduce the uncertainty of uncertain data. This paper aims at solving the problem how to promote the answers' certainty in RDF(resource description framework) graph query via crowdsourcing. The basic idea is to ask the crowd to decide whether the relationships represented by some edges are correct. In this paper, we introduce three different algorithms to select the edge which maximizes the uncertainty reduction. Finally, we verify these algorithms by experiments and show that unstable pruning algorithm and stable pruning algorithm perform better in term of efficiency.

Key words: probabilistic RDF graph, crowdsourcing, data cleaning

中图分类号: