华东师范大学学报(自然科学版) ›› 2019, Vol. 2019 ›› Issue (3): 78-85.doi: 10.3969/j.issn.1000-5641.2019.03.009

• 计算机科学 • 上一篇    下一篇

基于支持向量机的数学公式识别

刘婷婷1, 程涛1, 金冈增1, 王熙堃2, 高明1   

  1. 1. 华东师范大学 数据科学与工程学院, 上海 200062;
    2. 辽宁师范大学附属中学, 辽宁 大连 164500
  • 收稿日期:2018-08-06 出版日期:2019-05-25 发布日期:2019-05-30
  • 通讯作者: 高明,男,博士,教授,研究方向为知识图谱、社交媒体数据挖掘与管理.E-mail:mgao@dase.ecnu.edu.cn. E-mail:mgao@dase.ecnu.edu.cn
  • 作者简介:刘婷婷,女,博士研究生,研究方向为知识图谱.E-mail:14111205040@ahnu.edu.cn.
  • 基金资助:
    国家重点研发计划项目(2016YFB1000905);国家自然科学基金广东省联合重点项目(U1811264);国家自然科学基金(61877018,61672234,61502236);上海市科技兴农推广项目(T20170303)

Recognition of mathematical formulas based on support vector machines

LIU Ting-ting1, CHENG Tao1, JIN Gang-zeng1, WANG Xi-kun2, GAO Ming1   

  1. 1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China;
    2. The High School Affiliated to Liaoning Normal University, Dalian Liaoning 164500, China
  • Received:2018-08-06 Online:2019-05-25 Published:2019-05-30

摘要: 数学公式识别在拍照搜题、自动阅卷和题库建设等智慧教育任务中有着广泛的应用.由于这些应用中数学公式大多以图片的形式存在,因此识别图片中的数学公式成为智慧教育领域的重要研究问题之一.数学公式结构复杂,从图片中识别数学公式远比一般的光学符号识别要复杂得多.将公式识别分为字符分割、符号识别和公式重组这3个步骤:首先,综合运用投影和连通域方法将字符从图片中分割出来;其次,基于单个字符的区域像素数占总像素比例提取字符特征,建立监督学习模型识别字符;最后,利用每个字符在公式中出现的位置对数学公式进行重组.真实数据集上的实验结果表明,本文提出的数学公式识别方法准确率高达98.0%.

关键词: 数学公式识别, 支持向量机, 光学符号识别

Abstract: The recognition of mathematical formulas has been widely used in intelligent education applications, such as searching for answers to questions in image format, automatic marking, and constructing a database of questions. Mathematical formulas often exist in the form of images in many applications; hence, identifying the formulas in these images is an important research topic in the field of intelligent education. Given the complex structure of mathematical formulas, however, recognizing their presence within images is far more complicated than a general optical character recognition task. This paper decomposes formula recognition into three steps:character segmentation, character recognition, and formula reconstruction. First, the characters are separated from an image by using a combination of projection and connected-domain methods. Second, the features of characters are extracted based on the proportion of pixels in a single character relative to pixels in all characters, and a supervised learning model is established to identify each character. Finally, the mathematical formula is reconstructed based on the location of each character in the formula. Experimental results on a real data set show the proposed mathematical formula recognition method can achieve an accuracy of up to 98.0%.

Key words: mathematical formula recognition, support vector machine (SVM), optical character recognition (OCR)

中图分类号: