Mandeel, Ali Raheem and Al-Radhi, Mohammed Salah and Csapó, Tamás Gábor (2023) Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis. MULTIMEDIA TOOLS AND APPLICATIONS: AN INTERNATIONAL JOURNAL, 82. pp. 15635-15649. ISSN 1380-7501
|
Text
Mandeel-et-al-mtap-2022.pdf - Published Version Available under License Creative Commons Attribution. Download (2MB) | Preview |
Abstract
This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis. In purposes that demand low computational complexity, conventional vocoder-based statistical parametric speech synthesis can be preferable. While capable of remarkable naturalness, recent neural vocoders nonetheless fall short of the criteria for real-time synthesis. We investigate our former continuous vocoder, in which the excitation is characterized employing two one-dimensional parameters: Maximum Voiced Frequency and continuous fundamental frequency (F0). We show that an average voice can be trained for deep neural network-based TTS utilizing data from nine English speakers. We did speaker adaptation experiments for each target speaker with 400 utterances (approximately 14 minutes). We showed an apparent enhancement in the quality and naturalness of synthesized speech compared to our previous work by utilizing the recurrent neural network topologies. According to the objective studies (Mel-Cepstral Distortion and F0 correlation), the quality of speaker adaptation using Continuous Vocoder-based DNN-TTS is slightly better than the WORLD Vocoder-based baseline. The subjective MUSHRA-like test results also showed that our speaker adaptation technique is almost as natural as the WORLD vocoder using Gated Recurrent Unit and Long Short Term Memory networks. The proposed vocoder, being capable of real-time synthesis, can be used for applications which need fast synthesis speed.
Item Type: | Article |
---|---|
Additional Information: | Funding Agency and Grant Number: APH-ALARM project - European Commission [2019-2.1.2-NEMZ-2020-00012]; National Research, Development and Innovation Office of Hungary; European Union [RRF-2.3.1-21-2022-00004]; Ministry of Innovation and Technology; National Research, Development and Innovation Office; Bolyai Janos Research Fellowship of the Hungarian Academy of Sciences; New National Excellence Program of the Ministry for Innovation and Technology [uNKP-21-5, uNKP-21-5-BME-352] Funding text: The research was partially sponsored by the APH-ALARM project (contract 2019-2.1.2-NEMZ-2020-00012), funded by the European Commission and the National Research, Development and Innovation Office of Hungary and supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory. The research reported in this publication, carried out by the Department of Telecommunications and Media Informatics Budapest University of Technology and Economic and IdomSoft Ltd., was supported by the Ministry of Innovation and Technology and the National Research, Development and Innovation Office within the framework of the National Laboratory of Infocommunication and Information Technology. Tamas Gabor Csapo's research was supported by the Bolyai Janos Research Fellowship of the Hungarian Academy of Sciences and by the uNKP-21-5 (identifier: uNKP-21-5-BME-352) New National Excellence Program of the Ministry for Innovation and Technology from the source of the National, Research, Development and Innovation Fund. The Titan X GPU used was donated by NVIDIA Corporation. We would like to thank the subjects for participating in the listening test. |
Uncontrolled Keywords: | Speech synthesis, RNN, TTS, Continuous vocoder |
Subjects: | P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány T Technology / alkalmazott, műszaki tudományok > T2 Technology (General) / műszaki tudományok általában |
SWORD Depositor: | MTMT SWORD |
Depositing User: | MTMT SWORD |
Date Deposited: | 19 Sep 2023 08:08 |
Last Modified: | 19 Sep 2023 08:08 |
URI: | http://real.mtak.hu/id/eprint/173943 |
Actions (login required)
![]() |
Edit Item |