基于前向分步算法的文档实体排序

王燕华

doi:10.3969/j.issn.1000-5641.2018.01.009

华东师范大学学报（自然科学版） >

2018 , Vol. 2018 >Issue 1: 91 - 102,145

DOI: https://doi.org/10.3969/j.issn.1000-5641.2018.01.009

计算机科学

基于前向分步算法的文档实体排序

王燕华

展开

华东师范大学数据科学与工程学院, 上海 200062

王燕华,男,硕士研究生,研究方向为机器学习.E-mail:yhwang917@gmail.com.

收稿日期: 2016-12-01

网络出版日期: 2018-01-11

基金资助

上海市科技兴农推广项目（2015第3-2号）

收起

Forward stagewise additive modeling for entity ranking in documents

WANG Yan-hua

Expand

School of Data Science and Engineering, East China Normal University, EDWEI Shanghai 200062, China

Received date: 2016-12-01

Online published: 2018-01-11

Fold

摘要

文档中的关键实体可以抽象概括文本所描述的事件（或话题）的主体，推动面向实体的检索和问答系统等方面的研究.然而，文档中的实体是无序的，对文本中的实体进行排序显得尤为重要.提取文本实体特征并借助维基百科和词汇分布表示引入外部特征，提出了一种基于前向分步算法（Forward Stagewise Algorithm，FSAM）的排序模型LA-FSAM （FSAM based on AUC Metric and LogisticFunction）.该模型利用曲线下面积（Area Under the Curve，AUC）准则构造损失函数，逻辑斯谛函数整合实体特征，最后使用随机梯度下降法求解模型参数.通过LA-FSAM与基线方法的实验对比证明了所提方法的有效性.

关键词： 实体排序; 前向分步算法; 曲线下面积; 逻辑斯谛函数; 随机梯度下降

本文引用格式

王燕华 . 基于前向分步算法的文档实体排序[J]. 华东师范大学学报（自然科学版）, 2018 , 2018(1) : 91 -102,145 . DOI: 10.3969/j.issn.1000-5641.2018.01.009

Abstract

Key entities of a document can help to summarize the subjects of the events or the topics that the document describes, which can contribute to applications such as entity-oriented information retrieval and question-answering. However, entities in free text are unordered and hence it is important to rank entities of a document. In this paper, firstly, we make full use of features of entities that extracted from the document and draw support from Wikipedia and Word Embedding to generate external features. Then, we propose a novel ranking model named LA-FSAM(FSAM based on AUC Metric and Logistic Function) which is based on forward stagewise algorithm additive modeling. In LA-FSAM, we employ the AUC(Area Under the Curve) metric to construct the loss function and the logistic function to integrate features of entities. Finally, the stochastic gradient descent is utilized to optimize parameters of LA-FSAM model. After experiments, our evaluation shows the efficiency of the model we proposed.

Key words： entity ranking; forward stagewise additive modeling; area under the curve; logistic function; stochastic gradient descent

参考文献

[1] FiNKEL J R, GRENAGER T, MANNING C. Incorporating non-local information into information extraction systems by gibbs sampling[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005:363-370.
[2] ZHANG W, FENG W, WANG J Y. Integrating semantic relatedness and words' intrinsic features for keyword extraction[C]//Proceedings of the 23rd International Join Conference on Artificial Intelligence. 2013:2225-2231.
[3] HOFMANN K, TSAGKIAS M, MEIJ E, et al. The impact of document structure on keyphrase extraction[C]//Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009:1725-1728.
[4] LI Z H, ZHOU D, JUAN Y F, et al. Keyword extraction for social snippets[C]//Proceedings of the 19th International Conference on World Wide Web. ACM, 2010:1143-1144.
[5] JIANG X, HU Y H, LI H. A ranking approach to keyphrase extraction[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009:756-757.
[6] ZHANG F, HUANG L E, PENG B. WordTopic-MultiRank:A new method for automatic keyphrase extraction[C]//Proceedings of the 6th International Joint Conference on Natural Language. ACL, 2013:10-18.
[7] LIU Z Y, HUANG W Y, ZHENG Y B, et al. Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010:366-376.
[8] MIHALCEA R, TARAU P. TextRank:Bringing order into texts[C]//Conference on Empirical Methods in Natural Language Processing. ACL, 2004:404-411.
[9] WANG J H, LIU J Y, WANG C. Keyword extraction based on pagerank[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin:Springer, 2007:857-864.
[10] WANG R, LIU W, MCDONALD C. Using word embeddings to enhance keyword identification for scientific publications[C]//Australasian Database Conference. Berlin:Springer International Publishing, 2015:257-268.
[11] LIU Z Y, LI P, ZHENG Y B, et al. Clustering to find exemplar terms for keyphrase extraction[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1-Volume 1. Association for Computational Linguistics, 2009:257-266.
[12] DEMARTINI G, MISSEN M M S, BLANCO R, et al. Entity summarization of news articles[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2010:795-796.
[13] BASHIR S, AFZAL W, BAIG A R. Opinion-based entity ranking using learning to rank[J]. Applied Soft Computing, 2016, 38:151-163.
[14] SCHUHMACHER M, DIETZ L, PONZETTO S P. Ranking entities for Web queries through text and knowledge[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015:1461-1470.
[15] HASTIE T, FRIEDMAN J,TIBSHIRANI R. The Elements of Statistical Learning[M]//Springer Series in Statistics. New York:Springer-Verlag,2001:342-343.
[16] KANG C S, YIN D W, ZHANG R Q, et al. Learning to rank related entities in Web search[J]. Neurocomputing, 2015, 166:309-318.
[17] KANG C S, VADREVU S, ZHANG R Q, et al. Ranking related entities for Web search queries[C]//Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 2011:67-68.
[18] GRAUS D, TSAGKIAS M, WEERKAMP W, et al. Dynamic collective entity representations for entity ranking[C]//Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 2016:595-604.
[19] LI H. Learning to Rank for Information Retrieval and Natural Language Processing[C/OL]//Synthesis Lectures on Human Language Technologies #26. 2nd ed.[S.l]:Morgan and Claypool Publishers, 2014[2016-07-01]. http://www.morganclaypool.com/doi/suppl/10.2200/S00607ED2V01Y201410HLT026/suppl_file/li_Ch1.pdf.
[20] JIJKOUN V, KHALID M A, MARX M, et al. Named entity normalization in user generated content[C]//Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM, 2008:23-30
[21] 李航. 统计学习方法[M]. 北京:清华大学出版社, 2012:137-145.
[22] BRODER A, KUMAR R, MAGHOUL F, et al. Graph structure in the Web[J]. Computer Networks, 2000, 33(1):309-320.
[23] FENG W, WANG J Y. Incorporating heterogeneous information for personalized tag recommendation in social tagging systems[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2012:1276-1284.
[24] TRAN G, ALRIFAI M, HERDER E. Timeline summarization from relevant headlines[C]//European Conference on Information Retrieval. Springer International Publishing, 2015:245-256.
[25] JOACHIMS T. Training linear SVMs in linear time[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2006:217-226.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献