华东师范大学学报(自然科学版) ›› 2017, Vol. 2017 ›› Issue (5): 52-65,79.doi: 10.3969/j.issn.1000-5641.2017.05.006

• 大数据分析 • 上一篇    下一篇

基于神经网络语言模型的分布式词向量研究进展

郁可人, 傅云斌, 董启文   

  1. 华东师范大学 数据科学与工程学院, 上海 200062
  • 收稿日期:2017-05-01 出版日期:2017-09-25 发布日期:2017-09-25
  • 通讯作者: 傅云斌,男,博士后,研究方向为数据科学与机器学习.E-mail:fuyunbin2012@163.com E-mail:fuyunbin2012@163.com
  • 作者简介:郁可人,男,硕士研究生,研究方向为自然语言处理.E-mail:yu_void@qq.com
  • 基金资助:
    国家重点研发计划(2016YFB1000905);国家自然科学基金广东省联合重点项目(U1401256);国家自然科学基金(61672234,61402177);华东师范大学信息化软课题

Survey on distributed word embeddings based on neural network language models

YU Ke-ren, FU Yun-bin, DONG Qi-wen   

  1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
  • Received:2017-05-01 Online:2017-09-25 Published:2017-09-25

摘要: 单词向量化是自然语言处理领域中的重要研究课题之一,其核心是对文本中的单词建模,用一个较低维的向量来表征每个单词.生成词向量的方式有很多,目前性能最佳的是基于神经网络语言模型生成的分布式词向量,Google公司在2012年推出的Word2vec开源工具就是其中之一.分布式词向量已被应用于聚类、命名实体识别、词性分析等自然语言处理任务中,它的性能依赖于神经网络语言模型本身的性能,并与语言模型处理的具体任务有关.本文从三个方面介绍基于神经网络的分布式词向量,包括:经典神经网络语言模型的构建方法;对语言模型中存在的多分类问题的优化方法;如何利用辅助结构训练词向量.

关键词: 词向量, 语言模型, 神经网络

Abstract: Distributed word embedding is one of the most important research topics in the field of Natural Language Processing, whose core idea is using lower dimensional vectors to represent words in text. There are many ways to generate such vectors, among which the methods based on neural network language models perform best. And the respective case is Word2vec, which is an open source tool developed by Google inc. in 2012. Distributed word embeddings can be used to solve many Natural Language Processing tasks such as text clusting, named entity tagging, part of speech analysing and so on. Distributed word embeddings rely heavily on the performance of the neural network language model it based on and the specific task it processes. This paper gives an overview of the distributed word embeddings based on neural network and can be summarized from three aspects, including the construction of classical neural network language models, the optimization method for multi-classification problem in language model, and how to use auxiliary structure to train word embeddings.

Key words: word embedding, language model, neural network

中图分类号: