Extraction of social media data based on the knowledge graph and LDA model

  • MA You ,
  • YUE Kun ,
  • ZHANG Zi-chen ,
  • WANG Xiao-yi ,
  • GUO Jian-bin
Expand
  • 1. School of Information Science and Engineering, Yunnan University, Kunming 650500, China;
    2. School of Ethnology and Sociology, Yunnan University, Kunming 650500, China

Received date: 2018-07-10

  Online published: 2018-09-26

Abstract

Social media data extraction forms the basis of research and applications related to public opinion, news dissemination, corporate brand promotion, commercial marketing development, etc. Accurate extraction results are critical to guarantee the effectiveness of the data analysis. In this paper, we analyze the underlying topics in data based on the LDA (Latent Dirichlet Allocation) model; we further implement data extraction in specific domains by adopting featured word sequences and knowledge graphs that describe entities and relevant relationships. Experimental results using "Headline Today" news and Sina Weibo data show that our proposed method can be used to extract social media data effectively.

Cite this article

MA You , YUE Kun , ZHANG Zi-chen , WANG Xiao-yi , GUO Jian-bin . Extraction of social media data based on the knowledge graph and LDA model[J]. Journal of East China Normal University(Natural Science), 2018 , 2018(5) : 183 -194 . DOI: 10.3969/j.issn.1000-5641.2018.05.016

References

[1] OUYANG Y, GUO B, ZHANG J, et al. SentiStory:Multi-grained sentiment analysis and event summarization with crowdsourced social media data[J]. Personal & Ubiquitous Computing, 2017, 21(1):97-111.
[2] HE W, WANG F K, AKULA V. Managing extracted knowledge from big social media data for business decision making[J]. Journal of Knowledge Management, 2017, 21(2):275-294.
[3] ZHOU X, GUO L, LIU P, et al. Latent factor SVM for text categorization[C]//IEEE International Conference on Data Mining Workshop. IEEE, 2015:105-110.
[4] WAJEED M A, ADILAKSHMI T. Supervised and semi-supervised learning in text classification using enhanced KNN algorithm:A comparative study of supervised and semi-supervised classification in text categorization[J]. International Journal of Intelligent Systems Technologies & Applications, 2012, 11(3/4):179-195.
[5] RISTIN M, GUILLAUMIN M, GALL J, et al. Incremental learning of random forests for Large-Scale image classification[J]. IEEE Trans Pattern Anal Mach Intell, 2016, 38(3):490-503.
[6] BLEI D, NG A, JORDAN M. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3):993-1022.
[7] JARADAT S, DOKOOHAKI N, MATSKIN M. OLLDA:A supervised and dynamic topic mining framework in twitter[C]//2015 IEEE 15th International Conference on Data Mining Workshop. IEEE, 2016:1354-1359.
[8] 刘少鹏, 印鉴, 欧阳佳, 等. 基于MB-HDP模型的微博主题挖掘[J]. 计算机学报, 2015, 38(7):1408-1419.
[9] DUPUY C, BACH F, DIOT C. Qualitative and descriptive topic extraction from movie reviews using LDA[C]//Machine Learning and Data Mining in Pattern Recognition. Springer, 2017:91-106.
[10] MA J, YAO Z, SUN M. WSO-LDA:An online "Sentiment+Topic" weibo topic mining algorithm[C/OL]//Pacific Asia Conference on Information Systems.[2018-07-01].http://aisel.aisnet.org/pacis2017/223.
[11] 刘冰玉, 王翠荣, 王聪, 等. 基于动态主题模型融合多维数据的微博社区发现算法[J]. 软件学报, 2017, 28(2):246-261.
[12] KHOLGHI M, SITBON L, ZUCCON G, et al. External knowledge and query strategies in active learning:A study in clinical information extraction[C]//24th ACM International on Conference on Information and Knowledge Management. ACM, 2015:143-152.
[13] 陈德华, 殷苏娜, 乐嘉锦, 等. 一种面向临床领域时序知识图谱的链接预测模型[J]. 计算机研究与发展, 2017, 54(12):2687-2697.
[14] ORAMAS S, ESPINOSA-ANKE L, SORDO M, et al. Information extraction for knowledge base construction in the music domain[J]. Data & Knowledge Engineering, 2016, 106:70-83.
[15] VELASCO-ELÍZOÑDO P, MARIN-PINA R, VAZQUEZ-REYES S, et al. Knowledge representation and information extraction for analysing architectural patterns[J]. Science of Computer Programming, 2016, 121:176-189.
[16] DIETZ L, KOTOV A, MEIJ E. Utilizing knowledge graphs in text-centric information retrieval[C]//Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017:815-816.
[17] 高俊平, 张晖, 赵旭剑, 等. 面向维基百科的领域知识演化关系抽取[J]. 计算机学报, 2016, 39(10):2088-2101.
[18] MARIN A, HOLENSTEIN R, SARIKAYA R, et al. Learning phrase patterns for text classification using a knowledge graph and unlabeled data[J]. ISCA-International Speech Communication Association, 2014(15):253-257.
[19] KLIEGR T, ZAMAZAL O. LHD 2.0:A text mining approach to typing entities in knowledge graphs[J]. Web Semantics Science Services & Agents on the World Wide Web, 2016, 39:47-61.
[20] SHI W, ZHENG W, YU J X, et al. Keyphrase extraction using knowledge graphs[J]. Data Science & Engineering, 2017, 2(4):275-288.
[21] CHEN Z, LIU B. Mining topics in documents:Standing on the shoulders of big data[C]//20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:1116-1125.
[22] BLEI D. Probabilistic topic models[J]. Communications of the ACM, 2012, 55(4):77-84.
[23] LU Y, MEI Q, ZHAI C. Investigating task performance of probabilistic topic models:An empirical study of PLSA and LDA[J]. Information Retrieval, 2011, 14(2):178-203.
[24] 北京字节跳动科技有限公司. 今日头条媒体平台[EB/OL].[2017-12-31]. https://www.toutiao.com/.
[25] KNUTH D E, MORRIS J H, PRATT V R, et al. Fast pattern matching in strings[J]. SIAM Journal on Computing, 1977, 6(2):323-350.
Outlines

/