华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (5): 154-163.doi: 10.3969/j.issn.1000-5641.2018.05.013

• 新型互联网应用技术 • 上一篇    下一篇

Levenshtein算法优化及在题库判重中的应用

张衡, 陈良育   

  1. 华东师范大学 上海市高可信计算重点实验室, 上海 200062
  • 收稿日期:2018-07-04 出版日期:2018-09-25 发布日期:2018-09-26
  • 通讯作者: 陈良育,男,副教授,研究方向为计算机软件与理论.E-mail:lychen@sei.ecnu.edu.cn. E-mail:lychen@sei.ecnu.edu.cn
  • 作者简介:张衡,男,硕士研究生,研究方向为自然语言处理.E-mail:51174500154@stu.ecnu.edu.cn.
  • 基金资助:
    国家自然科学基金(11471209)

Optimization of the Levenshtein algorithm and its application in repeatability judgment for test bank

ZHANG Heng, CHEN Liang-yu   

  1. Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China
  • Received:2018-07-04 Online:2018-09-25 Published:2018-09-26

摘要: 为了解决Levenshtein距离算法在长文本和大规模匹配效率的不足,本文针对Levenshtein距离算法提出一种提前终止的优化策略.首先根据Levenshtein距离矩阵中元素内在的联系,归纳总结出一个递推关系式.再依据此递推关系式,提出一种提前终止策略,可提前判断两个文本是否满足预先设定的相似度阈值.经过多个学科题库判重实验的佐证,本文的提前终止策略能显著减少计算时间.

关键词: 题库匹配, 文本相似度, Levenshtein编辑距离

Abstract: In order to overcome the disadvantages of the Levenshtein distance algorithm for long text and large-scale matching, we propose an early termination strategy for the Levenshtein distance algorithm. Firstly, according to the intrinsic relationship between elements in the Levenshtein distance matrix, we sum up a recurrence relation. Based on this relation, an early termination strategy is proposed to determine early-on whether two texts satisfy the predefined similarity threshold. Through several tests on different subjects, it is demonstrated that the early termination strategy can significantly reduce calculation time.

Key words: bank match, text similarity, Levenshtein edit distance

中图分类号: