华东师范大学学报(自然科学版) ›› 2021, Vol. 2021 ›› Issue (5): 24-36.doi: 10.3969/j.issn.1000-5641.2021.05.003
收稿日期:
2021-08-05
出版日期:
2021-09-25
发布日期:
2021-09-28
通讯作者:
岳昆
E-mail:kyue@ynu.edu.cn
基金资助:
Rui FU, Jianyu LI, Jiahui WANG, Kun YUE*(), Kuang HU
Received:
2021-08-05
Online:
2021-09-25
Published:
2021-09-28
Contact:
Kun YUE
E-mail:kyue@ynu.edu.cn
摘要:
文本数据中的实体和关系抽取是领域知识图谱构建和更新的来源. 针对金融科技领域中文本数据存在重叠关系、训练数据缺乏标注样本等问题, 提出一种融合主动学习思想的实体关系联合抽取方法. 首先, 基于主动学习, 以增量的方式筛选出富有信息量的样本作为训练数据; 其次, 采用面向主实体的标注策略将实体关系联合抽取问题转化为序列标注问题; 最后, 基于改进的BERT-BiGRU-CRF模型实现领域实体与关系的联合抽取, 为知识图谱构建提供支撑技术, 有助于金融从业者根据领域知识进行分析、投资、交易等操作, 从而降低投资风险. 针对金融领域文本数据进行实验测试, 实验结果表明, 本文所提出的方法有效, 验证了该方法后续可用于金融知识图谱的构建.
中图分类号:
付瑞, 李剑宇, 王笳辉, 岳昆, 胡矿. 面向领域知识图谱的实体关系联合抽取[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 24-36.
Rui FU, Jianyu LI, Jiahui WANG, Kun YUE, Kuang HU. Joint extraction of entities and relations for domain knowledge graph[J]. Journal of East China Normal University(Natural Science), 2021, 2021(5): 24-36.
表3
金融领域数据集上实体关系联合抽取的有效性"
模型 | 命名实体识别 | 关系抽取 | |||||
P | R | F1 | P | R | F1 | ||
BERT-BiLSTM-CRF+RBert | 0.7163 | 0.7130 | 0.7146 | 0.5001 | 0.3439 | 0.4075 | |
Word2vec-BiLSTM-CRF | 0.6932 | 0.6313 | 0.6608 | 0.4185 | 0.4231 | 0.4207 | |
Word2vec-BiGRU-CRF | 0.6770 | 0.7083 | 0.6923 | 0.4611 | 0.3824 | 0.4182 | |
Word2vec-BiGRU*-CRF | 0.6983 | 0.6874 | 0.6928 | 0.4523 | 0.4111 | 0.4307 | |
BERT-BiLSTM-CRF | 0.7346 | 0.7176 | 0.7260 | 0.4806 | 0.4859 | 0.4832 | |
BERT-BiGRU-CRF | 0.7600 | 0.7037 | 0.7308 | 0.5032 | 0.4593 | 0.4802 | |
BERT- BiGRU*-CRF | 0.7530 | 0.7353 | 0.7440 | 0.4952 | 0.4787 | 0.4868 |
表4
少数民族领域数据集上实体关系联合抽取的有效性"
模型 | 命名实体识别 | 关系抽取 | |||||
P | R | F1 | P | R | F1 | ||
BERT-BiLSTM-CRF+RBert | 0.8579 | 0.8513 | 0.8546 | 0.7578 | 0.3715 | 0.4986 | |
Word2vec-BiLSTM-CRF | 0.7549 | 0.7821 | 0.7682 | 0.5461 | 0.6132 | 0.5777 | |
Word2vec-BiGRU-CRF | 0.7649 | 0.7713 | 0.7680 | 0.6401 | 0.5258 | 0.5773 | |
Word2vec-BiGRU*-CRF | 0.7761 | 0.7894 | 0.7827 | 0.6261 | 0.5739 | 0.5988 | |
BERT-BiLSTM-CRF | 0.8674 | 0.9173 | 0.8916 | 0.7761 | 0.6194 | 0.6889 | |
BERT-BiGRU-CRF | 0.8688 | 0.9285 | 0.8976 | 0.6561 | 0.7188 | 0.6860 | |
BERT- BiGRU*-CRF | 0.9121 | 0.9313 | 0.9216 | 0.7209 | 0.6983 | 0.7094 |
表5
重叠关系抽取的有效性"
模型 | 金融领域 | 少数民族领域 | |||||
P | R | F1 | P | R | F1 | ||
BERT-BiLSTM-CRF+RBert | 0.3526 | 0.1528 | 0.2132 | 0.5714 | 0.3636 | 0.4444 | |
Word2vec-BiLSTM-CRF | 0.4259 | 0.3126 | 0.3606 | 0.4914 | 0.5969 | 0.5390 | |
Word2vec-BiGRU-CRF | 0.4116 | 0.3150 | 0.3569 | 0.4656 | 0.6250 | 0.5351 | |
Word2vec-BiGRU*-CRF | 0.4234 | 0.3203 | 0.3647 | 0.4815 | 0.6500 | 0.5532 | |
BERT-BiLSTM-CRF | 0.5107 | 0.3495 | 0.4150 | 0.5492 | 0.7114 | 0.6199 | |
BERT-BiGRU-CRF | 0.4923 | 0.3495 | 0.4135 | 0.6111 | 0.6250 | 0.6180 | |
BERT- BiGRU*-CRF | 0.5312 | 0.3564 | 0.4213 | 0.6200 | 0.6566 | 0.6378 |
表6
不同规模标注数据下的关系抽取性能比较"
标注数据规模/% | 金融领域 | 少数民族领域 | |||||
P | R | F1 | P | R | F1 | ||
10 | 0.1304 | 0.1071 | 0.1176 | 0.2857 | 0.1333 | 0.1818 | |
20 | 0.3230 | 0.2456 | 0.2790 | 0.4286 | 0.3333 | 0.3750 | |
30 | 0.4328 | 0.3275 | 0.3728 | 0.6250 | 0.5556 | 0.5882 | |
40 | 0.4514 | 0.3936 | 0.4205 | 0.6667 | 0.5714 | 0.6154 | |
50 | 0.4667 | 0.4514 | 0.4589 | 0.7143 | 0.6822 | 0.6978 | |
60 | 0.4936 | 0.4734 | 0.4833 | 0.7163 | 0.6875 | 0.7016 | |
100 | 0.4952 | 0.4787 | 0.4868 | 0.7209 | 0.6983 | 0.7094 |
1 | 刘峤, 李杨, 段宏, 等. 知识图谱构建技术综述. 计算机研究与发展, 2016, 53 (3): 582- 600. |
2 |
LI J, WANG Z, WANG Y, et al. Research on distributed search technology of multiple data sources intelligent information based on knowledge graph. Journal of Signal Processing Systems, 2021, 93 (2): 239- 248.
doi: 10.1007/s11265-020-01592-5 |
3 | 饶子昀, 张毅, 刘俊涛, 等.应用知识图谱的推荐方法与系统 [J/OL].自动化学报, 2020. (2020-07-09)[2021-08-05]. https://doi.org/10.16383/j.aas.c200128. |
4 | LU X, PRAMANIK S, ROY R., et al. Answering complex questions by joining multi-document evidence with quasi knowledge graphs [C]//Proceedings of the 42nd International ACM SIGIR Conference. NewYork: ACM, 2019: 105-114. |
5 |
LEHMANN J, ISELE R., JAKOB M, et al. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 2015, 6 (2): 167- 195.
doi: 10.3233/SW-140134 |
6 | MAHDISOLTANI F, BIEGA J, SUCHANEK F. YAGO3: A knowledge base from multilingual Wikipedias [C/OL]//Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. 2015. [2021-08-05]. https://suchanek.name/work/publications/cidr2015.pdf. |
7 | BOLLACKER K, COOK R, TUFTS P. Freebase: A shared database of structured general human knowledge [C]//Proceedings of the 22nd AAAI Conference on Artificial Intelligence. California: AAAI, 2007: 1962-1963. |
8 | ELHAMMADI S, LAKSHMANAN L, NG R, et al. A high precision pipeline for financial knowledge graph construction [C]//Proceedings of the 28th International Conference on Computational Linguistics. Berlin: Springer, 2020: 967-977. |
9 | YANG Y, WEI Z, CHEN Q, et al. Using external knowledge for financial event prediction based on graph neural networks [C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Beijing: ACM, 2019: 2161-2164. |
10 | 龙军, 殷建平, 祝恩, 等. 主动学习研究综述. 计算机研究与发展, 2008, (S1): 300- 304. |
11 |
HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9 (8): 1735- 1780.
doi: 10.1162/neco.1997.9.8.1735 |
12 | CHO K, MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science, 2014, 1724- 1734. |
13 | ZENG D, LIU K, LAI S, et al. Relation classification via convolutional deep neural network [C]//Proceedings of the 25th International Conference on Computational Linguistics. Pennsylvania: ACL, 2014: 2335-2344. |
14 | XU Y, MOU L, GE L, et al. Classifying relations via long short term memory networks along shortest dependency paths [C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Pennsylvania: ACL, 2015: 1785-1794. |
15 | MIWA M, BANSAL M. End-to-end relation extraction using LSTMs on sequences and tree structures [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Pennsylvania: ACL, 2016: 1105-1116. |
16 | ZHENG S, WANG F, BAO H, et al. Joint extraction of entities and relations based on a novel tagging scheme [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Pennsylvania: ACL, 2017: 1227-1236. |
17 | ZENG X, ZENG D, HE S, et al. Extracting relational facts by an end-to-end neural model with copy mechanism [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Pennsylvania: ACL, 2018: 506-514. |
18 | HOULSBY N, HUSZáR F, GHAHRAMANI Z, et al. Bayesian active learning for classification and preference learning [EB/OL]. (2011-12-24) [2021-08-05]. https://arxiv.org/pdf/1112.5745.pdf. |
19 | TANG P, HUANG S. Self-paced active learning: Query the right thing at the right time [C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. California: AAAI, 2019: 5117-5124. |
20 |
TRAN V, NGUYEN N, FUJITA H, et al. A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields. Knowledge-Based Systems, 2017, 132, 179- 187.
doi: 10.1016/j.knosys.2017.06.023 |
21 | SHEN Y, YUN H, LIPTON Z, et al. Deep active learning for named entity recognition [EB/OL]. (2018-02-04)[2021-09-08]. https://arxiv.org/pdf/1707.05928.pdf. |
22 | JACOB D, CHANG M, LEE K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding [C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186. |
23 | RIEDEL S, YAO L, MCCALLUM A K. Modeling relations and their mentions without labeled text [C]//Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2010: 148-163. |
24 |
郁可人, 傅云斌, 董启文. 基于神经网络语言模型的分布式词向量研究进展. 华东师范大学学报(自然科学版), 2017, (5): 52- 65.
doi: 10.3969/j.issn.1000-5641.2017.05.006 |
[1] | 郑思达, 刘岩, 杨晓坤, 戚成飞, 袁培森. 基于自适应竞争的均衡优化电力系统客户分类[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 146-156. |
[2] | 赵红成, 田秀霞, 杨泽森, 白万荣. YOLO-S: 一种新型轻量的安全帽佩戴检测模型[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 134-145. |
[3] | 杨梦晨, 陈旭栋, 蔡鹏, 倪葎. 早期时间序列分类方法研究综述[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 115-133. |
[4] | 马晓琴, 薛晓慧, 罗红郊, 刘通宇, 袁培森. 基于t-LeNet与时间序列分类的窃电行为检测[J]. 华东师范大学学报(自然科学版), 2021, 2021(5): 104-114. |
[5] | 刘天弼, 冯瑞. 一种基于先验标记特征的精准图像配准算法[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 65-77. |
[6] | 祝鸣, 沈建华, 汪家财. 可扩展物联网教学开发系统的设计与实现[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 78-95. |
[7] | 梁艳春, 房爱莲. 基于多通道卷积神经网络的中文文本关系抽取[J]. 华东师范大学学报(自然科学版), 2021, 2021(3): 96-104. |
[8] | 窦建凯, 林欣, 胡文心. 单图中的近似频繁子图挖掘算法[J]. 华东师范大学学报(自然科学版), 2019, 2019(6): 73-87. |
[9] | 陈远哲, 匡俊, 刘婷婷, 高明, 周傲英. 共指消解技术综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 16-35. |
[10] | 杨康, 黄定江, 高明. 面向自动问答的机器阅读理解综述[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 36-52. |
[11] | 邵明锐, 马登豪, 陈跃国, 覃雄派, 杜小勇. 基于社区问答数据迁移学习的FAQ问答模型研究[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 74-84. |
[12] | 陈亮, 郭佳雯, 武建功, 王占全, 史令. 基于法计算学理论的人工智能辅助决策算法研究[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 85-99. |
[13] | 傅裕, 李优, 林煜明, 周娅. 基于自注意力机制的冗长商品名称精简方法[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 113-122,167. |
[14] | 黄福兴, 周广山, 丁宏, 张罗平, 钱淑韵, 袁培森. 基于孤立森林算法的电能量异常数据检测[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 123-132. |
[15] | 赵波, 田秀霞, 李灿. 基于自适应神经网络的电网稳定性预测[J]. 华东师范大学学报(自然科学版), 2019, 2019(5): 133-142. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||