华东师范大学学报(自然科学版) ›› 2018, Vol. 2018 ›› Issue (4): 70-79.doi: 10.3969/j.issn.1000-5641.2018.04.007

• 计算机科学 • 上一篇    下一篇

抽取式自动文本生成算法

艾丽斯1, 唐卫红2, 傅云斌1, 董启民3, 郑建兵1, 高明1   

  1. 1. 华东师范大学 数据科学与工程学院, 上海 200062;
    2. 上海市农业技术推广中心, 上海 201103;
    3. 林西县职业技术教育中心, 内蒙古 林西 025250
  • 收稿日期:2017-06-19 出版日期:2018-07-25 发布日期:2018-07-19
  • 通讯作者: 董启民,男,中学一级教师,研究方向为信息处理技术.E-mail:418976195@qq.com E-mail:418976195@qq.com
  • 作者简介:艾丽斯,女,硕士研究生,研究方向为自然语言处理.E-mail:irisinsh@163.com
  • 基金资助:
    国家重点研发计划项目(2016YFB1000905);国家自然科学基金广东省联合重点项目(U1401256);国家自然科学基金(61402177,61672234,61402180,61363005,61472321).

An algorithm for natural language generation via text extracting

AI Li-si1, TANG Wei-hong2, FU Yun-bin1, DONG Qi-min3, ZHENG Jian-bing1, GAO Ming1   

  1. 1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China;
    2. Shanghai Agricultural Technology Extension and Service Center, Shanghai 201103, China;
    3. Vocational and Technical Education Center of Linxi County, Linxi Inner Mongolia 025250, China
  • Received:2017-06-19 Online:2018-07-25 Published:2018-07-19

摘要: 文本自动生成旨在实现机器像人一样写作,减少语言工作人员的工作量,为读者传送实时、简洁的新闻报道.它可被运用在智能问答和对话、新闻的自动撰写、突发事件报道等应用中,且一直是学术界和工业界想突破的研究问题.本文将文本自动生成建模成关键词集合覆盖问题,提出了一种无监督的抽取式文本自动生成算法.该算法优化了自动文本的结构,不再是一段式文本.实验表明,该算法在大规模语料库上可取得不错效果,生成的文本覆盖信息更全面,与人工生成的文本意思更接近.

关键词: 文本自动生成, 关键词覆盖, 信息量, 冗余

Abstract: The aim of natural language generation is to achieve a state where machines can generate text automatically. This would reduce the workload of human language workers and helps us deliver real-time, concise news coverage to readers. It could be applied to many applications, such as question and answers systems, automatic news writing, incident reporting, and so on. The challenge has been one of the open problems for both academia and industry. In this paper, we model the issue as a keyword covering problem and propose an unsupervised approach to extract text for natural language generation.The experimental results illustrate that the algorithm is effective for large-scale corpus; the text coverage is more comprehensive and the generated text is closer to the manual text produced by an individual.

Key words: natural language generation, keyword cover problem, informative, redundancy

中图分类号: