REAL

Audiovizuális beszédszintézis nyelvultrahang alapon

Csapó, Tamás Gábor (2022) Audiovizuális beszédszintézis nyelvultrahang alapon. BESZÉDTUDOMÁNY / SPEECH SCIENCE, 3 (1). pp. 273-291. ISSN 2732-3773

[img]
Preview
Text
Csapo-besztud2022.pdf

Download (1MB) | Preview

Abstract

In this study, we present our initial results in audiovisual text-to-speech synthesis (AVTTS), which is a subfield of the more general areas of speech synthesis and computer facial animation. The goal of visible speech synthesis is typically to generate face motion or articulatory-related information (e.g., lip, tongue movement, or velum position). We conduct experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional deep neural network-based text-to-speech synthesis (DNN-TTS) framework by predicting ultrasound tongue images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. The final output is speech and ultrasound tongue video in ’wedge’ orientation. We use the data of eight English speakers (roughly 200 sentences from each) from the UltraSuite-TaL dataset, train several types of deep neural networks, and show that simple DNNs are the most suitable ones for the prediction of sequential articulatory data, as we have limited training material. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are mostly close to natural tongue movement but are sometimes oversmoothed. A specific application of audiovisual speech synthesis and text-to-articulation prediction is computer-assisted pronunciation training/computer-aided language learning, which can be beneficial for learners of second languages. With such an AV-TTS, by giving an arbitrary input text, one is able to hear the synthesized speech and, in synchrony with it, see (in 2D or 3D) how to move the tongue to produce target speech sounds. This visual feedback can be helpful for pronunciation training in L2 learning, especially when the target language contains speech sounds that are difficult to articulate (e.g., significantly different from the speaker’s mother tongue).

Item Type: Article
Uncontrolled Keywords: AV-TTS, mély neurális hálózatok, DNN, beszédtechnológia
Subjects: P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet
Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
T Technology / alkalmazott, műszaki tudományok > T2 Technology (General) / műszaki tudományok általában
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 19 Sep 2023 07:49
Last Modified: 19 Sep 2023 07:49
URI: http://real.mtak.hu/id/eprint/173939

Actions (login required)

Edit Item Edit Item