[1] MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017, 88:65-82.
[2] GONZALVO X, TAZARI S, CHAN C A, et al. Recent advances in Google real-time HMM-driven unit selection synthesizer[C]//Interspeech 2016. 2016:2238-2242.
[3] ZEN H, AGIOMYRGIANNAKIS Y, EGBERTS N, et al. Fast,compact,and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices[C]//Interspeech 2016. 2016:2273-2277.
[4] TAYLOR P. Text-to-Speech Synthesis[M]. Cambridge:Cambridge University Press, 2009.
[5] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron:Towards end-to-end speech synthesis[J]. arXiv preprint arXiv:1703.10135, 2017.
[6] SHEN J, PANG R, WEISS R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018:4779-4783.
[7] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet:A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016.
[8] OORD A, LI Y, BABUSCHKIN I, et al. Parallel WaveNet:Fast high-fidelity speech synthesis[J]. arXiv preprint arXiv:1711.10433, 2017.
[9] ARIK S O, Chrzanowski M, Coates A, et al. Deep voice:Real-time neural text-to-speech[J]. arXiv preprint arXiv:1702.07825, 2017.
[10] ARIK S, DIAMOS G, GIBIANSKY A, et al. Deep voice 2:Multi-speaker neural text-to-speech[J]. arXiv preprint arXiv:1705.08947, 2017.
[11] PING W, PENG K, CHEN J. ClariNet:Parallel Wave Generation in End-to-End Text-to-Speech[J]. arXiv preprint arXiv:1807.07281, 2018.
[12] PRENGER R, VALLE R, CATANZARO B. WaveGlow:A Flow-based Generative Network for Speech Synthesis[J]. arXiv preprint arXiv:1811.00002, 2018.
[13] OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[J]. arXiv preprint arXiv:1601.06759, 2016.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. NIPS, 2017:5998-6008.
[15] SUTSKEVER I, VINYALS O, Le Q V. Sequence to sequence learning with neural networks[C]//28th Annual Conference on Neural Information Processing Systems. NIPS, 2014:3104-3112.
[16] FREEMAN P, VILLEGAS E, KAMALU J. Storytime-end to end neural networks for audiobooks[R/OL].[2018-08-28]. http://web.stanford.edu/class/cs224s/reports/Pierce Freeman.pdf.
[17] GRIFFIN D, LIM J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(2):236-243.
[18] WANG D, ZHANG X W. Thchs-30:A free chinese speech corpus[J]. arXiv preprint arXiv:1512.01882, 2015.
[19] CHUNG Y A, WANG Y, HSU W N, et al. Semi-supervised training for improving data efficiency in end-to-end speech synthesis[J]. arXiv preprint arXiv:1808.10128, 2018.
[20] KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]//Communications, Computers and Signal Processing, IEEE Pacific Rim Conference on. IEEE, 1993:125-128. |