一种基于Tacotron 2的端到端中文语音合成方案

王国梁; 陈梦楠; 陈蕾

doi:10.3969/j.issn.1000-5641.2019.04.011

华东师范大学学报（自然科学版） >

2019 , Vol. 2019 >Issue 4: 111 - 119

DOI: https://doi.org/10.3969/j.issn.1000-5641.2019.04.011

计算机科学

一种基于Tacotron 2的端到端中文语音合成方案

王国梁 ,
陈梦楠 ,
陈蕾

展开

1. 国家电网安徽省电力有限公司信息通信分公司, 合肥 230061;
2. 华东师范大学计算机科学技术系, 上海 200062

王国梁,男,硕士,高级工程师,长期从事电力信息化建设和电力信息化管理工作.

收稿日期: 2018-10-28

网络出版日期: 2019-07-18

收起

An end-to-end Chinese speech synthesis scheme based on Tacotron 2

WANG Guo-liang ,
CHEN Meng-nan ,
CHEN Lei

Expand

1. Information and Communication Branch, State Grid Anhui Electric Power Co., Ltd., Hefei 230061, China;
2. Department of Computer Science and Technology, East China Normal University, Shanghai 200062, China

Received date: 2018-10-28

Online published: 2019-07-18

Fold

摘要

颠覆性设计的端到端语音合成系统Tacotron 2，目前仅能处理英文.致力于对Tacotron 2进行多方位改进，设计了一种中文语音合成方案，主要包括：针对汉字不表音、变调和多音字等问题，添加预处理模块，将中文转化为注音字符；针对现有中文训练语料不足的情况，使用预训练解码器，在较少语料上获得了较好音质；针对中文语音合成急促停顿问题，采用对交叉熵损失进行加权，并用多层感知机代替变线性变换对停止符进行预测的策略，获得了有效改善；另外通过添加多头注意力机制进一步提高了中文语音合成音质.梅尔频谱、梅尔倒谱距离等的实验对比结果表明了方案的有效性：可以令Tacotron 2较好地适应中文语音合成的要求.

关键词： 语音合成; 多头注意力; Tacotron 2

本文引用格式

王国梁 , 陈梦楠 , 陈蕾 . 一种基于Tacotron 2的端到端中文语音合成方案[J]. 华东师范大学学报（自然科学版）, 2019 , 2019(4) : 111 -119 . DOI: 10.3969/j.issn.1000-5641.2019.04.011

Abstract

The disruptively design for an end-to-end speech synthesis system Tacotron 2, is currently only available in English. This paper is devoted to implementing several improvements to Tacotron 2 and presents a Chinese speech synthesis scheme, including:a pre-processing module to convert Chinese characters into phonetic characters to address the challenge of Chinese character not corresponding to pronunciation, having multiple tones, and having polyphonic words; a pre-training decoder to achieve better sound quality with less corpus given the lack of existing Chinese training corpus; a strategy of weighting the cross-entropy loss and using the multi-layer perceptron, instead of the linear transformation, to predict stop tokens and to solve the Chinese speech synthesis sudden pause problem; and a multi-head attention mechanism to further improve Chinese speech quality. The experimental comparison of the Mel spectrum and the Mel cepstrum distance (MCD) shows that our work is effective and can make Tacotron 2 adapted to the requirements of Chinese speech synthesis.

Key words： text to speech; multi-head attention; Tacotron 2

参考文献

[1] MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017, 88:65-82.
[2] GONZALVO X, TAZARI S, CHAN C A, et al. Recent advances in Google real-time HMM-driven unit selection synthesizer[C]//Interspeech 2016. 2016:2238-2242.
[3] ZEN H, AGIOMYRGIANNAKIS Y, EGBERTS N, et al. Fast,compact,and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices[C]//Interspeech 2016. 2016:2273-2277.
[4] TAYLOR P. Text-to-Speech Synthesis[M]. Cambridge:Cambridge University Press, 2009.
[5] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron:Towards end-to-end speech synthesis[J]. arXiv preprint arXiv:1703.10135, 2017.
[6] SHEN J, PANG R, WEISS R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018:4779-4783.
[7] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet:A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016.
[8] OORD A, LI Y, BABUSCHKIN I, et al. Parallel WaveNet:Fast high-fidelity speech synthesis[J]. arXiv preprint arXiv:1711.10433, 2017.
[9] ARIK S O, Chrzanowski M, Coates A, et al. Deep voice:Real-time neural text-to-speech[J]. arXiv preprint arXiv:1702.07825, 2017.
[10] ARIK S, DIAMOS G, GIBIANSKY A, et al. Deep voice 2:Multi-speaker neural text-to-speech[J]. arXiv preprint arXiv:1705.08947, 2017.
[11] PING W, PENG K, CHEN J. ClariNet:Parallel Wave Generation in End-to-End Text-to-Speech[J]. arXiv preprint arXiv:1807.07281, 2018.
[12] PRENGER R, VALLE R, CATANZARO B. WaveGlow:A Flow-based Generative Network for Speech Synthesis[J]. arXiv preprint arXiv:1811.00002, 2018.
[13] OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[J]. arXiv preprint arXiv:1601.06759, 2016.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. NIPS, 2017:5998-6008.
[15] SUTSKEVER I, VINYALS O, Le Q V. Sequence to sequence learning with neural networks[C]//28th Annual Conference on Neural Information Processing Systems. NIPS, 2014:3104-3112.
[16] FREEMAN P, VILLEGAS E, KAMALU J. Storytime-end to end neural networks for audiobooks[R/OL].[2018-08-28]. http://web.stanford.edu/class/cs224s/reports/Pierce Freeman.pdf.
[17] GRIFFIN D, LIM J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(2):236-243.
[18] WANG D, ZHANG X W. Thchs-30:A free chinese speech corpus[J]. arXiv preprint arXiv:1512.01882, 2015.
[19] CHUNG Y A, WANG Y, HSU W N, et al. Semi-supervised training for improving data efficiency in end-to-end speech synthesis[J]. arXiv preprint arXiv:1808.10128, 2018.
[20] KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]//Communications, Computers and Signal Processing, IEEE Pacific Rim Conference on. IEEE, 1993:125-128.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献