Survey on distributed word embeddings based on neural network language models

doi:10.3969/j.issn.1000-5641.2017.05.006

Abstract

Abstract: Distributed word embedding is one of the most important research topics in the field of Natural Language Processing, whose core idea is using lower dimensional vectors to represent words in text. There are many ways to generate such vectors, among which the methods based on neural network language models perform best. And the respective case is Word2vec, which is an open source tool developed by Google inc. in 2012. Distributed word embeddings can be used to solve many Natural Language Processing tasks such as text clusting, named entity tagging, part of speech analysing and so on. Distributed word embeddings rely heavily on the performance of the neural network language model it based on and the specific task it processes. This paper gives an overview of the distributed word embeddings based on neural network and can be summarized from three aspects, including the construction of classical neural network language models, the optimization method for multi-classification problem in language model, and how to use auxiliary structure to train word embeddings.

Key words: word embedding, language model, neural network

CLC Number:

TP391

YU Ke-ren, FU Yun-bin, DONG Qi-wen. Survey on distributed word embeddings based on neural network language models[J]. Journal of East China Normal University(Natural Sc, 2017, 2017(5): 52-65,79.

References

[1] HARRIS Z S. Distributional structure[J]. Word, 1954, 10(2/3):146-162.
[2] FIRTH J R. A synopsis of linguistic theory, 1930-1955[J]. Studies in linguistic analysis, 1957(S):1-31
[3] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京:中国科学院大学, 2016.
[4] TURIAN J, RATINOV L, BENGIO Y. Word representations:a simple and general method for semi-supervised learning[C]//ACL 2010, Proceedings of the Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden. DBLP, 2010:384-394.
[5] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391.
[6] PENNINGTON J, SOCHER R, MANNING C. Glove:Global vectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing, 2014:1532-1543.
[7] BROWN P F, DESOUZA P V, MERCER R L, et al. Class-based n-gram models of natural language[J]. Computational linguistics, 1992, 18(4):467-479.
[8] GUO J, CHE W, WANG H, et al. Revisiting embedding features for simple semi-supervised learning[C]//Conference on Empirical Methods in Natural Language Processing, 2014:110-120.
[9] CHEN X, XU L, LIU Z, et al. Joint learning of character and word embeddings[C]//International Conference on Artificial Intelligence. AAAI Press, 2015:1236-1242.
[10] HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986:12.
[11] MⅡKKULAINEN R, DYER M G. Natural language processing with modular neural networks and distributed lexicon[C]//Cognitive Science, 1991:343-399.
[12] ALEXRUDNICKY. Can artificial neural networks learn language models?[C]//International Conference on Spoken Language Processing. DBLP, 2000:202-205.
[13] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155.
[14] MNIH A, HINTON G. Three new graphical models for statistical language modelling[C]//Machine Learning, Proceedings of the Twenty-Fourth International Conference. DBLP, 2007:641-648.
[15] SUTSKEVER I, HINTON G E. Learning multilevel distributed representations for high-dimensional sequences[J]. Journal of Machine Learning Research, 2007(2):548-555.
[16] MNIH A, HINTON G. A scalable hierarchical distributed language model[C]//Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December. DBLP, 2008:1081-1088.
[17] MNIH A, KAVUKCUOGLU K. Learning word embeddings efficiently with noise-contrastive estimation[C]//Advances in Neural Information Processing Systems, 2013:2265-2273.
[18] MIKOLOV T, KARAFIÁT M, BURGET L, et al. Recurrent neural network based language model[C]//INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September. DBLP, 2010:1045-1048.
[19] MIKOLOV T, KOMBRINK S, DEORAS A, et al. Rnnlm-recurrent neural network language modeling toolkit[C]//Processingof the 2011 ASRU Workshop, 2011:196-201.
[20] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 2002, 5(2):157-166.
[21] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.
[22] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]//Empirical Methods in Natural Language Processing, 2014:1724-1734.
[23] CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the properties of neural machine translation:Encoder-decoder approaches[J]. ArXiv preprint arXiv:1409.1259, 2014.
[24] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. ArXiv preprint arXiv:1412.3555, 2014.
[25] GREFF K, SRIVASTAVA R K, KOUTNÍK J, et al. LSTM:A search space odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015(99):1-11.
[26] JOZEFOWICZ R, ZAREMBA W, SUTSKEVER I, et al. An empirical exploration of recurrent network architectures[C]//International Conference on Machine Learning, 2015:2342-2350.
[27] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. ArXiv preprint arXiv:1301.3781, 2013.
[28] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model[C]//Aistats, 2005:246-252.
[29] GOODMAN J. Classes for fast maximum entropy training[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2001:561-564.
[30] FELLBAUM C, MILLER G. WordNet:An Electronic Lexical Database[M].Cambridge, MA:MIT Press, 1998.
[31] MNIH A, HINTON G. A scalable hierarchical distributed language model[C]//International Conference on Neural Information Processing Systems. Curran Associates Inc, 2008:1081-1088.
[32] LE H S, OPARIN I, ALLAUZEN A, et al. Structured Output Layer neural network language model[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011:5524-5527.
[33] MIKOLOV T, KOMBRINK S, BURGET L, et al. Extensions of recurrent neural network language model[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011:5528-5531.
[34] COLLOBERT R, WESTON J. A unified architecture for natural language processing:Deep neural networks with multitask learning[C]//International Conference. DBLP, 2008:160-167.
[35] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural Language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537.
[36] GUTMANN M, HYVÄRINEN A. Noise-contrastive estimation:A new estimationp rinciple for unnormalized statistical models[J]. Journal of Machine Learning Research, 2010(9):297-304.
[37] GUTMANN M U, HYVARINEN A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics[J]. Journal of Machine Learning Research, 2012, 13(1):307-361.
[38] MNIH A, TEH Y W. A fast and simple algorithm for training neural probabilistic language models[C]//International Conference on Machine Learning, 2012:1751-1758.
[39] BENGIO Y, SENÉCAL J S. Quick Training of Probabilistic Neural Nets by Impo rtance Sampling[C]//AISTATS, 2003:1-9.
[40] ZOPH B, VASWANI A, MAY J, et al. Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2016:1217-1222.
[41] DYER C. Notes on noise contrastive estimation and negative sampling[J]. ArXiv preprint arXiv:1410.8251, 2014.
[42] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119.
[43] CHEN W, GRANGIER D, AULI M, et al. Strategies for training large vocabulary neural language models[C]//Meeting of the Association for Computational Linguistics, 2015:1975-1985.
[44] DEVLIN J, ZBIB R, HUANG Z, et al. Fast and robust neural network joint models for statistical machine translation[C]//Meeting of the Association for Computational Linguistics, 2014:1370-1380.
[45] ANDREAS J, DAN K. When and why are log-linear models self-normalizing?[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2015:244-249.
[46] MIKOLOV T, KOPECKY J, BURGET L, et al. Neural network based language models for highly inflective languages[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009:4725-4728.
[47] SANTOS C D, ZADROZNY B. Learning character-level representations for part-of-speech tagging[C]//Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014:1818-1826.
[48] COTTERELL R, SCHÜTZE H. Morphological word-embeddings[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2015:1287-1292.
[49] BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[J]. ArXiv preprint arXiv:1607.04606, 2016.
[50] LI Y, LI W, SUN F, et al. Component-enhanced Chinese character embeddings[C]//Empirical Methods in Natural Language Processing, 2015:829-834.
[51] YU M, DREDZE M. Improving lexical embeddings with semantic knowledge[C]//Meeting of the Association for Computational Linguistics, 2014:545-550.
[52] WANG Z, ZHANG J, FENG J, et al. Knowledge graph and text jointly embedding[C]//Conference on Empirical Methods in Natural Language Processing, 2014:1591-1601.
[53] REISINGER J, MOONEY R J. Multi-prototype vector-space models of word meaning[C]//Human Language Technologies:The 2010 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010:109-117.
[54] HUANG E H, SOCHER R, MANNING C D, et al. Improving word representations via global context and multiple word prototypes[C]//Meeting of the Association for Computational Linguistics:Long Papers. Association for Computational Linguistics, 2012:873-882.
[55] VILNIS L, MCCALLUM A. Word representations via gaussian embedding[R]. University of Massachusetts Amherst, 2014.
[56] HILL F, REICHART R, KORHONEN A, et al. Simlex-999:Evaluating semantic models with genuine similarity estimation[J]. Computational Linguistics, 2015, 41(4):665-695.
[57] FINKELSTEIN R L. Placing search in context:the concept revisited[J]. Acm Transactions on Information Systems, 2002, 20(1):116-131.
[58] ZWEIG G, BURGES C J C. The Microsoft Research sentence completion challenge[R]. Technical Report MSRTR-2011-129, Microsoft, 2011.
[59] GLADKOVA A, DROZD A, MATSUOKA S. Analogy-based detection of morphological and semantic relations with word embeddings:what works and what doesn't[C]//HLT-NAACL, 2016:8-15.
[60] MIKOLOV T, YIH W, ZWEIG G. Linguistic regularities in continuous space word representations[C]//HLTNAACL, 2013:746-751.
[61] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//ICLR, 2015:1-15.
[62] GROVER A, LESKOVEC J. Node2vec:Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016:855-864.