TY - JOUR
T1 - Articulation-to-speech synthesis using articulatory flesh point sensors' orientation information
AU - Cao, Beiming
AU - Kim, Myungjong
AU - Wang, Jun R.
AU - Van Santen, Jan
AU - Mau, Ted
AU - Wang, Jun
N1 - Funding Information:
This work was supported by the National Institutes of Health (NIH) under award number R03DC013990 and by the American Speech-Language-Hearing Foundation through a New Century Scholar Research Grant. We thank all volunteering listeners (judges) in the subjective evaluation.
Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.
AB - Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.
KW - Articulation-to-speech synthesis
KW - Deep neural network
KW - Orientation information
UR - http://www.scopus.com/inward/record.url?scp=85054962302&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054962302&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2018-2484
DO - 10.21437/Interspeech.2018-2484
M3 - Conference article
AN - SCOPUS:85054962302
SN - 2308-457X
VL - 2018-September
SP - 3152
EP - 3156
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018
Y2 - 2 September 2018 through 6 September 2018
ER -