抽取式自动文本生成算法

艾丽斯; 唐卫红; 傅云斌; 董启民; 郑建兵; 高明

doi:10.3969/j.issn.1000-5641.2018.04.007

华东师范大学学报（自然科学版） >

2018 , Vol. 2018 >Issue 4: 70 - 79

DOI: https://doi.org/10.3969/j.issn.1000-5641.2018.04.007

计算机科学

抽取式自动文本生成算法

艾丽斯 ,
唐卫红 ,
傅云斌 ,
董启民 ,
郑建兵 ,
高明

展开

1. 华东师范大学数据科学与工程学院, 上海 200062;
2. 上海市农业技术推广中心, 上海 201103;
3. 林西县职业技术教育中心, 内蒙古林西 025250

艾丽斯,女,硕士研究生,研究方向为自然语言处理.E-mail:irisinsh@163.com

收稿日期: 2017-06-19

网络出版日期: 2018-07-19

基金资助

国家重点研发计划项目（2016YFB1000905）；国家自然科学基金广东省联合重点项目（U1401256）；国家自然科学基金（61402177，61672234，61402180，61363005，61472321）.

收起

An algorithm for natural language generation via text extracting

AI Li-si ,
TANG Wei-hong ,
FU Yun-bin ,
DONG Qi-min ,
ZHENG Jian-bing ,
GAO Ming

Expand

1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China;
2. Shanghai Agricultural Technology Extension and Service Center, Shanghai 201103, China;
3. Vocational and Technical Education Center of Linxi County, Linxi Inner Mongolia 025250, China

Received date: 2017-06-19

Online published: 2018-07-19

Fold

摘要

文本自动生成旨在实现机器像人一样写作，减少语言工作人员的工作量，为读者传送实时、简洁的新闻报道.它可被运用在智能问答和对话、新闻的自动撰写、突发事件报道等应用中，且一直是学术界和工业界想突破的研究问题.本文将文本自动生成建模成关键词集合覆盖问题，提出了一种无监督的抽取式文本自动生成算法.该算法优化了自动文本的结构，不再是一段式文本.实验表明，该算法在大规模语料库上可取得不错效果，生成的文本覆盖信息更全面，与人工生成的文本意思更接近.

关键词： 文本自动生成; 关键词覆盖; 信息量; 冗余

本文引用格式

艾丽斯 , 唐卫红 , 傅云斌 , 董启民 , 郑建兵 , 高明 . 抽取式自动文本生成算法[J]. 华东师范大学学报（自然科学版）, 2018 , 2018(4) : 70 -79 . DOI: 10.3969/j.issn.1000-5641.2018.04.007

Abstract

The aim of natural language generation is to achieve a state where machines can generate text automatically. This would reduce the workload of human language workers and helps us deliver real-time, concise news coverage to readers. It could be applied to many applications, such as question and answers systems, automatic news writing, incident reporting, and so on. The challenge has been one of the open problems for both academia and industry. In this paper, we model the issue as a keyword covering problem and propose an unsupervised approach to extract text for natural language generation.The experimental results illustrate that the algorithm is effective for large-scale corpus; the text coverage is more comprehensive and the generated text is closer to the manual text produced by an individual.

Key words： natural language generation; keyword cover problem; informative; redundancy

参考文献

[1] 万小军. 文本自动生成研究进展与趋势[R]. 北京:北京大学, 2016:1-2.
[2] ZHANG Y, KRIEGER H U. Large-scale corpus-driven PCFG approximation of an HPSG[C]//Proceedings of the 12th International Conference on Parsing Technologies. Stroudsburg:Association for Computational Linguistics, 2011:198-208.
[3] SRIPADA S, REITER E, DAVY I. Sumtime-mousam:Configurable marine weather forecast generator[J]. Expert Update, 2003, 6(3):4-10.
[4] KUKICH K. Design of a knowledge-based report generator[C]//Proceedings of the 21st Annual Meeting on Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 1983:145-150.
[5] PORTET F, REITER E, GATT A, et al. Automatic generation of textual summaries from neonatal intensive care data[J]. Artificial Intelligence, 2009, 173(7/8):789-816.
[6] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 39(4):664-676.
[7] LI S J, OUYANG Y, WANG W, et al. Multi-document summarization using support vector regression[C/OL]//Proceedings of the Document Understanding Conference.[2017-05-03]. http://www-nlpir.nist.gov/projects/duc/pubs/2007papers/pekingu.final.pdf.
[8] KNIGHT K, MARCU D. Statistics-based summarization-step one:Sentence compression[C]//Senventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence.[S.l]:AAAI Press, 2000:703-710.
[9] CLARKE J, LAPATA M. Global inference for sentence compression:An integer linear programming approach[J]. Journal of Artificial Intelligence Research, 2008, 31:399-429.
[10] FILIPPOVA K. Multi-sentence compression:Finding shortest paths in word graphs[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2010:322-330.
[11] THADANI K, MCKEOWN K. Supervised sentence fusion with single-stage inference[C]//International Joint Conference on Natural Language Processing. 2013:1410-1418.
[12] FUJITA A, INUI K, MATSUMOTO Y. Exploiting lexical conceptual structure for paraphrase generation[C]//International Conference on Natural Language Processing. Berlin:Springer, 2005:908-919.
[13] DUBOUE P A, CHU-CARROLL J. Answering the question you wish they had asked:The impact of paraphrasing for question answering[C]//Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume:Short Papers. Stroudsburg:Association for Computational Linguistics, 2006:33-36.
[14] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3):993-1022
[15] MIHALCEA R, TARAU P. TextRank:Bringing order into texts[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2004:404-411.
[16] EDMUNDSON H P. New methods in automatic extracting[J]. Journal of the ACM (JACM), 1969, 16(2):264-285.
[17] LIN C Y. ROUGE:A package for automatic evaluation of summaries[C/OL]//Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL 2004.[2017-05-03]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/was2004.pdf.
[18] PARVEEN D, MESGAR M, STRUBE M. Generating coherent summaries of scientific articles using coherence patterns[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:772-783.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献