计算机科学

基于特征工程的视频点击率预测算法

  • 匡俊 ,
  • 唐卫红 ,
  • 陈雷慧 ,
  • 陈辉 ,
  • 曾炜 ,
  • 董启民 ,
  • 高明
展开
  • 1. 华东师范大学 数据科学与工程学院, 上海 200062;
    2. 上海市农业技术推广服务中心, 上海 201103;
    3. 深圳腾讯计算机系统有限公司, 北京 100080;
    4. 林西县职业技术教育中心, 内蒙古 林西 025250
匡俊,男,硕士研究生,研究方向为用户行为分析、点击率预测.E-mail:15001830063@163.com.

收稿日期: 2017-05-19

  网络出版日期: 2018-05-29

基金资助

国家重点研发计划(2016YFB1000905);国家自然科学基金广东省联合重点项目(U1401256);国家自然科学基金(61672234,61502236,61472321)

Algorithm for video click-through rate prediction

  • KUANG Jun ,
  • TANG Wei-hong ,
  • CHEN Lei-hui ,
  • CHEN Hui ,
  • ZENG Wei ,
  • DONG Qi-min ,
  • GAO Ming
Expand
  • 1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China;
    2. Shanghai Agricultural Technology Extension and Service Center, Shanghai 201103, China;
    3. Shenzhen Tencent Computer System Co. Ltd., Beijing 100080, China;
    4. Vocational and Technical Education Center of Linxi County, Linxi Inner Mongolia 025250, China

Received date: 2017-05-19

  Online published: 2018-05-29

摘要

点击率预测技术在视频推荐系统中具有重要的作用.视频推荐系统可以根据点击率预测的结果调整投放顺序,从而提高用户的真实点击率.在点击率预测问题中,由于数据存在海量性以及不平衡性等问题,点击率预测的精确度一般都较低.针对以上问题,使用特征工程和机器学习相结合的方法,有效地改进了现有的视频点击率预测算法的性能.首先,使用特征工程方法,从原始数据中提取特征,并使用矩阵分解等方法生成交叉特征;然后,分别基于逻辑回归、因子分解机和梯度提升决策树-逻辑回归实现点击率预测模型.实验结果表明,基于因子分解机模型和基于梯度提升决策树-逻辑回归模型的预测精度要优于基于逻辑回归的模型,并且将用户特征和视频特征进行交叉组合能够改进点击率预测的精度.

本文引用格式

匡俊 , 唐卫红 , 陈雷慧 , 陈辉 , 曾炜 , 董启民 , 高明 . 基于特征工程的视频点击率预测算法[J]. 华东师范大学学报(自然科学版), 2018 , 2018(3) : 77 -87 . DOI: 10.3969/j.issn.1000-5641.2018.03.009

Abstract

Click-through rate prediction has played an important role in video recommendation systems. A video recommendation system can suggest media to users based on the results of click-through rate prediction. In this way, users may be more likely to click the videos recommended by platforms. However, given the volume and imbalance of data in some applications, the accuracy of click-through rate prediction may be very low. To improve the performance, this paper proposes an integrated approach by combining feature engineering with techniques from machine learning. In the first stage, the algorithm uses feature engineering to extract user, video, and combinational features from the original dataset. In the second stage, the algorithm predicts the click-through rate by employing supervised models of logistic regression, factorization machine, and gradient boosting decision tree combined with logistic regression. The experimental results illustrate that the prediction accuracy of the factorization machine model and the gradient boosting decision tree combined with logistic regression model are better than the logistic regression model. Moreover, the cross combination of user and video features can improve the accuracy of the click-through rate prediction.

参考文献

[1] RENDLE S. Factorization machines[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2010:995-1000.
[2] FRIEDMAN J H. Greedy function approximation:A gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.
[3] HE X, PAN J, JIN O, et al. Practical lessons from predicting clicks on ads at Facebook[C]//Proceedings of the 8th International Workshop on Data Mining for Online Advertising. ACM, 2014:1-9.
[4] 纪文迪, 王晓玲, 周傲英. 广告点击率估算技术综述[J]. 华东师范大学学报(自然科学版), 2013(3):1-14.
[5] RICHARDSON M, DOMINOWSKA E, RAGNO R. Predicting clicks:Estimating the click-through rate for new ads[C]//International Conference on World Wide Web. ACM, 2007:521-530.
[6] CHAPELLE O, ZHANG Y. A dynamic bayesian network click model for web search ranking[C]//International Conference on World Wide Web. ACM, 2009:1-10.
[7] GRAEPEL T, CANDELA J Q, BORCHERT T, et al. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft's Bing Search engine[C]//International Conference on Machine Learning. DBLP, 2010:13-20.
[8] JOACHIMS T. Optimizing search engines using click-through data[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2002:133-142.
[9] SHAN L, LIN L, SUN C, et al. Predicting ad click-through rates via feature-based fully coupled interaction tensor factorization[J]. Electronic Commerce Research & Applications, 2016, 16(C):30-42.
[10] YAN L, LI W J, XUE G R, et al. Coupled group lasso for web-scale CTR prediction in display advertising[C]//International Conference on Machine Learning. 2014:802-810.
[11] AGARWAL D, LONG B, TRAUPMAN J, et al. LASER:A scalable response prediction platform for online advertising[C]//ACM International Conference on Web Search and Data Mining. ACM, 2014:173-182.
[12] AQUIAR E, NAGRECHA S, CHAWLA N V. Predicting online video engagement using clickstreams[C]//IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2015. DOI:10.1109/DSAA.2015.7344873.
[13] 李思琴, 林磊, 孙承杰. 基于卷积神经网络的搜索广告点击率预测[J]. 智能计算机与应用, 2015(5):22-25.
[14] SCHAPIRE R E. A brief introduction to boosting[C]//16th International Joint Conference on Artificial Intelligence.[S.l.]:Morgan Kaufmann Publishers Inc, 1999:1401-1406.
[15] QUINLAN J R. Induction on decision tree[J]. Machine Learning, 1986(1):81-106.
[16] HARTIGAN J A, WONG M A. Algorithm AS 136:A k-means clustering algorithm[J]. Applied Statistics, 1979, 28(1):100-108.
[17] BREIMAN L. Out-of-bag estimation[R]. Berkeley:University of California, 1996.
[18] BREIMAN L. Bagging Predictors[M].[S.l.]:Kluwer Academic Publishers, 1996.
[19] CHEN T, GUESTRIN C. XGBoost:A scalable tree boosting system[C]//ACM SIGKDD International Conference. ACM, 2016:785-794.
文章导航

/