REAL

Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

Csapó, Tamás Gábor and Gosztolya, Gábor and Tóth, László and Honarmandi Shandiz, Amin and Markó, Alexandra (2022) Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping. SENSORS, 22 (22). No-8601. ISSN 1424-8220

[img]
Preview
Text
Csapo-et-al-sensors-2022.pdf - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

Within speech processing, articulatory-to-acoustic mapping (AAM) methods can apply ultrasound tongue imaging (UTI) as an input. (Micro)convex transducers are mostly used, which provide a wedge-shape visual image. However, this process is optimized for the visual inspection of the human eye, and the signal is often post-processed by the equipment. With newer ultrasound equipment, now it is possible to gain access to the raw scanline data (i.e., ultrasound echo return) without any internal post-processing. In this study, we compared the raw scanline representation with the wedge-shaped processed UTI as the input for the residual network applied for AAM, and we also investigated the optimal size of the input image. We found no significant differences between the performance attained using the raw data and the wedge-shaped image extrapolated from it. We found the optimal pixel size to be 64 × 43 in the case of the raw scanline input, and 64 × 64 when transformed to a wedge. Therefore, it is not necessary to use the full original 64 × 842 pixels raw scanline, but a smaller image is enough. This allows for the building of smaller networks, and will be beneficial for the development of session and speaker-independent methods for practical applications. AAM systems have the target application of a “silent speech interface”, which could be helpful for the communication of the speaking-impaired, in military applications, or in extremely noisy conditions.

Item Type: Article
Additional Information: Funding Agency and Grant Number: European Commission [20192.1.2-NEMZ-2020-00012]; National Research, Development and Innovation Office of Hungary [FK 142163]; Bolyai Janos Research Fellowship of the Hungarian Academy of Sciences; New National Excellence Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund [UNKP-22-5-BME-316]; Hungarian Ministry of Innovation and Technology NRDI Office [TKP2021-NVA-09]; Artificial Intelligence National Laboratory [RRF-2.3.1-21-2022-00004] Funding text: T.G. Csapo's research was partly supported by the APH-ALARM project (contract 20192.1.2-NEMZ-2020-00012) funded by the European Commission and the National Research, Development and Innovation Office of Hungary (FK 142163 grant), by the Bolyai Janos Research Fellowship of the Hungarian Academy of Sciences and the UNKP-22-5-BME-316 New National Excellence Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund. The work of G. Gosztolya and L. Toth were also supported by the Hungarian Ministry of Innovation and Technology NRDI Office (grant TKP2021-NVA-09) and by the Artificial Intelligence National Laboratory (RRF-2.3.1-21-2022-00004).
Uncontrolled Keywords: speech processing; ultrasound imaging; deep learning
Subjects: P Language and Literature / nyelvészet és irodalom > P0 Philology. Linguistics / filológia, nyelvészet
Q Science / természettudomány > QA Mathematics / matematika > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
T Technology / alkalmazott, műszaki tudományok > T2 Technology (General) / műszaki tudományok általában
SWORD Depositor: MTMT SWORD
Depositing User: MTMT SWORD
Date Deposited: 19 Sep 2023 08:13
Last Modified: 19 Sep 2023 08:13
URI: http://real.mtak.hu/id/eprint/173944

Actions (login required)

Edit Item Edit Item