

  • 余若男 ,
  • 黄定江 ,
  • 董启文
  • 华东师范大学 数据科学与工程学院, 上海 200062

收稿日期: 2018-06-27

  网络出版日期: 2018-09-26



Survey on scene text detection based on deep learning

  • YU Ruo-nan ,
  • HUANG Ding-jiang ,
  • DONG Qi-wen
  • School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Received date: 2018-06-27

  Online published: 2018-09-26




余若男 , 黄定江 , 董启文 . 基于深度学习的场景文字检测研究进展[J]. 华东师范大学学报(自然科学版), 2018 , 2018(5) : 1 -16 . DOI: 10.3969/j.issn.1000-5641.2018.05.001


With improvements in computer hardware performance, object detection, and image segmentation algorithms (based on deep learning) have broken the bottlenecks posed by traditional algorithms in big data-driven applications and become the mainstream algorithms in the field of computer vision. In this context, scene text detection algorithms have made great breakthroughs in recent years. The objectives of this survey are three-fold:introduce the progress of scene text detection over the past 5 years, compare and analyze the advantages and limitations of advanced algorithms, and summarize the relevant benchmark datasets and evaluation methods in the field.


[1] ZHU Y, YAO C, BAI X. Scene text detection and recognition:Recent advances and future trends[J]. Front Comput Sci, 2014, 10(1):19-36.
[2] YE Q, DOERMANN D. Text detection and recognition in imagery:A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(7):1480-1500.
[3] WANG K, BELONGIE S. Word spotting in the wild[C]//Computer Vision-ECCV 2010. Berlin:Springer, 2010:591-604.
[4] NEUMANN L, MATAS J. Scene text localization and recognition with oriented stroke detection[C]//2013 IEEE International Conference on Computer Vision. IEEE, 2013:97-104.
[5] JADERBERG M, VEDALDI A, ZISSERMAN A. Deep features for text spotting[C]//Computer Vision-ECCV 2014. Cham:Springer, 2014:512-528.
[6] WANG T, WU D J, COATES A, et al. End-to-end text recognition with convolutional neural networks[C]//Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). 2012:3304-3308.
[7] EPSHTEIN B, OFEK E, WEXLER Y. Detecting text in natural scenes with stroke width transform[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010:2963-2970.
[8] MATAS J, CHUM O, URBAN M, et al. Robust wide baseline stereo from maximally stable extremal regions[J]. Image and Vision Computing, 2004, 22:761-767.
[9] YAO C, BAI X, LIU W, et al. Detecting texts of arbitrary orientations in natural images[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012:1083-1090.
[10] KANG L, LI Y, DOERMANN D. Orientation robust text line detection in natural images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2014:4034-4041.
[11] YIN X C, YIN X, HUANG K, et al. Robust text detection in natural scene images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5):970-983.
[12] YIN X C, PEI W Y, ZHANG J, et al. Multi-orientation scene text detection with adaptive clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1930-1937.
[13] CHO H, SUNG M, JUN B. Canny text detector:Fast and robust scene text localization algorithm[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:3566-3573.
[14] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014:580-587.
[15] GIRSHICK R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015:1440-1448.
[16] REN S, HE K, GIRSHICK R, et al. Faster R-CNN:Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017(6):1137-1149.
[17] DAI J, LI Y, HE K, et al. R-FCN:Object detection via region-based fully convolutional networks[C]//Advances in Neural Information Processing Systems 29. NIPS, 2016:379-387.
[18] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once:Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016:779-788.
[19] LIU W, ANGUELOV D, ERHAN D, et al. SSD:Single shot MultiBox detector[C]//European Conference on Computer Vision. Cham:Springer, 2016:21-37.
[20] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25. NIPS, 2012:1097-1105.
[21] UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.
[22] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[C]//Computer Vision-ECCV 2014. Cham:Springer, 2014:346-361.
[23] REDMON J, FARHADI A. YOLO9000:Better, faster, stronger[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017:6517-6525.
[24] REDMON J, FARHADI A. YOLOv3:An incremental improvement[J]. arXiv preprint, arXiv:1804. 02767v1[cs.CV] 8 Apr 2018.
[25] CIRESAN D, GIUSTI A, GAMBARDELLA L M, et al. Deep neural networks segment neuronal membranes in electron microscopy images[G]//Advances in Neural Information Processing Systems 25. Curran Associates, Inc, 2012:2843-2851.
[26] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015:3431-3440.
[27] LI Y, QI H, DAI J, et al. Fully convolutional instance-aware semantic segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017:4438-4446.
[28] HE K, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//2017 IEEE International Conferé nce on Computer Vision (ICCV). IEEE, 2017:2980-2988.
[29] TIAN Z, HUANG W, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]//European Conference on Computer Vision. Cham:Springer, 2016:56-72.
[30] ZHONG Z, JIN L, ZHANG S, et al. DeepText:A unified framework for text proposal generation and text detection in natural images[J]. arXiv preprint, arXiv:1605. 07314v1[cs.CV] 24 May 2016.
[31] JIANG Y, ZHU X, WANG X, et al. R2CNN:Rotational region CNN for orientation robust scene text detection[J]. arXiv preprint, arXiv:1706. 09579v2[cs.CV] 30 Jun 2017.
[32] MA J, SHAO W, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. arXiv preprint, arXiv:1703. 01086v3[cs.CV] 15 Mar 2018.
[33] ZHANG S, LIU Y, JIN L, et al. Feature enhancement network:A refined scene text detector[J]. arXiv preprint, arXiv:1711. 04249v1[cs.CV] 12 Nov 2017.
[34] GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6):602-610.
[35] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[J]. arXiv preprint, arXiv:1409. 4842v1[cs.CV] 17 Sep 2014.
[36] SHI B, BAI X, BELONGIE S. Detecting oriented text in natural images by linking segments[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017:3482-3490.
[37] TIAN S, LU S, LI C. WeText:Scene text detection under weak supervision[C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:1501-1509.
[38] QIN S, MANDUCHI R. Cascaded segmentation-detection networks for word-level text spotting[C]//201714th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017:1275-1282.
[39] HU H, ZHANG C, LUO Y, et al. WordSup:Exploiting word annotations for character based text detection[C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:4950-4959.
[40] ZHANG Z, ZHANG C, SHEN W, et al. Multi-oriented text detection with fully convolutional networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016:4159-4167.
[41] HE T, HUANG W, QIAO Y, et al. Accurate text localization in natural image with cascaded convolutional text network[J]. arXiv preprint, arXiv:1603. 09423v1[cs.CV] 31 Mar 2016.
[42] YAO C, BAI X, SANG N, et al. Scene text detection via holistic, multi-channel prediction[J]. arXiv preprint, arXiv:1606. 09002v2[cs.CV] 5 Jul 2016.
[43] POLZOUNOV A, ABLAVATSKI A, ESCALERA S, et al. Wordfence:Text detection in natural images with border awareness[C]//2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017:1222-1226.
[44] DENG D, LIU H, LI X, et al. PixelLink:Detecting scene text via instance segmentation[J]. arXiv preprint, arXiv:1801. 01315v1[cs.CV] 4 Jan 2018.
[45] YANG Q, CHENG M, ZHOU W, et al. Incep text:A new inception-text module with deformable PSROI pooling for multi-oriented scene text detection[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 2018:1071-1077.
[46] DAI Y, HUANG Z, GAO Y, et al. Fused text segmentation networks for multi-oriented scene text detection[J]. arXiv preprint, arXiv:1709. 03272v4[cs.CV] 7 May 2018.
[47] HE W, ZHANG X Y, YIN F, et al. Deep direct regression for multi-oriented scene text detection[C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:745-753.
[48] JIANG F, HAO Z, LIU X. Deep scene text detection with connected component proposals[J]. arXiv preprint, arXiv:1708. 05133v1[cs.CV] 17 Aug 2017.
[49] ZHOU X, YAO C, WEN H, et al. EAST:An efficient and accurate scene text detector[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017:2642-2651.
[50] KIM K H, HONG S, ROH B, et al. PVANET:Deep but lightweight neural networks for real-time object detection[J]. arXiv preprint, arXiv:1608. 08021v3[cs.CV] 30 Sep 2016.
[51] JADERBERG M, SIMONYAN K, VEDALDI A, et al. Reading text in the wild with convolutional neural networks[J]. International Journal of Computer Vision, 2016, 116(1):1-20.
[52] GUPTA A, VEDALDI A, ZISSERMAN A. Synthetic data for text localisation in natural images[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016:2315-2324.
[53] LIAO M, SHI B, BAI X, et al. TextBoxes:A fast text detector with a single deep neural network[C]//31st AAAI Conference on Artificial Intelligence. 2017:4161-4167.
[54] LI H, WANG P, SHEN C. Towards end-to-end text spotting with convolutional recurrent neural networks[C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:5248-5256.
[55] BUSTA M, NEUMANN L, MATAS J. Deep textspotter:An end-to-end trainable scene text localization and recognition framework[C]//Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017:2223-2231.
[56] LIAO M, SHI B, BAI X. TextBoxes++:A single-shot oriented scene text detector[J]. IEEE Transactions on Image Processing, 2018, 27(8):3676-3690.
[57] BARTZ C, YANG H, MEINEL C. See:Towards semi-supervised end-to-end scene text recognition[J]. arXiv preprint, arXiv:1712. 05404v1[cs.CV] 14 Dec 2017.
[58] LIU X, LIANG D, YAN S, et al. FOTS:Fast oriented text spotting with a unified network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018:5676-5685.
[59] JADERBERG M, SIMONYAN K, VEDALDI A, et al. Synthetic data and artificial neural networks for natural scene text recognition[J]. arXiv preprint, arXiv:1406. 2227v4[cs.CV] 9 Dec 2014.
[60] SHI B, BAI X, YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11):2298-2304.
[61] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification:Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning. New York:ACM, 2006:369-376.
[62] JADERBERG M, SIMONYAN K, ZISSERMAN A. Spatial transformer networks[C]//Advances in Neural Information Processing Systems 27. NIPS, 2015:2017-2025.
[63] LUCAS S M, PANARETOS A, SOSA L, et al. ICDAR 2003 robust reading competitions:Entries, results, and future directions[J]. International Journal of Document Analysis and Recognition (IJDAR), 2005, 7(2/3):105-122.
[64] LUCAS S M. ICDAR 2005 text locating competition results[C]//8th International Conference on Document Analysis and Recognition (ICDAR'05). 2005:80-84.
[65] SHAHAB A, SHAFAIT F, DENGEL A. ICDAR 2011 robust reading competition challenge 2:Reading text in scene images[C]//Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011:1491-1496.
[66] KARATZAS D, SHAFAIT F, UCHIDA S, et al. ICDAR 2013 robust reading competition[C]//International Conference on Document Analysis and Recognition. IEEE Computer Society, 2013:1484-1493.
[67] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//International Conference on Document Analysis and Recognition. IEEE 2015:1156-1160.
[68] NAYEF N, YIN F, BIZID I, et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT[C]//201714th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017:1454-1459.
[69] LEE S, CHO M S, JUNG K, et al. Scene text extraction with edge constraint and text collinearity[C]//201020th International Conference on Pattern Recognition. 2010:3983-3986.
[70] NAGY R, DICKER A, MEYER-WEGENER K. NEOCR:A configurable dataset for natural image text recognition[C]//Camera-Based Document Analysis and Recognition. Berlin:Springer, 2011:150-163.
[71] YI C, TIAN Y. Text string detection from natural scenes by structure-based partition and grouping[J]. IEEE Transactions on Image Processing, 2011, 20(9):2594-2605.
[72] RISNUMAWAN A, SHIVAKUMARA P, CHAN C S, et al. A robust arbitrary text detection system for natural scene images[J]. Expert Systems with Applications, 2014, 41(18):8027-8048.
[73] YAO C, BAI X, LIU W. A unified framework for multioriented text detection and recognition[J]. IEEE Transactions on Image Processing, 2014, 23(11):4737-4749.
[74] YIN X C, PEI W Y, ZHANG J, et al. Multi-orientation scene text detection with adaptive clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1930-1937.
[75] 张树业. 深度模型及其在视觉文字分析中的应用[D]. 广州:华南理工大学, 2016.
[76] VEIT A, MATERA T, NEUMANN L, et al. COCO-Text:Dataset and benchmark for text detection and recognition in natural images[J]. arXiv preprint, arXiv:1601. 07140v2[cs.CV] 19 Jun 2016.
[77] SHI B, YAO C, LIAO M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17)[C]//Document Analysis and Recognition (ICDAR), 201714th IAPR International Conference on. IEEE, 2017:1429-1434.
[78] CHNG C K, CHAN C S. Total-text:A comprehensive dataset for scene text detection and recognition[C]//201714th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017:935-942.
[79] LIU Y L, JIN L W, ZHANG S T, et al. Detecting curve text in the wild:New dataset and new solution[J]. arXiv preprint, arXiv:1712. 02170v1[cs.CV] 6 Dec 2017.
[80] YUAN T L, ZHU Z, XU K, et al. Chinese text in the wild[J]. arXiv preprint, arXiv:1803. 00085v1[cs.CV] 28 Feb 2018.
[81] HUA X S, LIU W Y, ZHANG H J. An automatic performance evaluation protocol for video text detection algorithms[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2004, 14(4):498-507.
[82] WOLF C, JOLION J M. Object count/area graphs for the evaluation of object detection and segmentation algorithms[J]. International Journal of Document Analysis and Recognition (IJDAR), 2006, 8(4):280-296.
[83] EVERINGHAM M, ESLAMI S M A, GOOL L V, et al. The pascal visual object classes challenge:A retrospective[J]. International Journal of Computer Vision, 2015, 111(1):98-136.
