TY - JOUR
T1 - Multiview representation learning via deep CCA for silent speech recognition
AU - Kim, Myungjong
AU - Cao, Beiming
AU - Mau, Ted
AU - Wang, Jun
N1 - Funding Information:
This work was supported by the National Institutes of Health under an award R03DC013990 and by the American Speech-Language-Hearing Foundation through a New Century Scholar Research Grant. We thank Joanna Brown, Betsy Ruiz, Janis Deane, Laura Toles, Amy Hamilton, Se-in Kim, Kristin Teplan-sky, Katie Purdum, and the volunteering participants.
Publisher Copyright:
Copyright © 2017 ISCA.
PY - 2017
Y1 - 2017
N2 - Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.
AB - Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.
KW - Articulatory movements
KW - Deep canonical correlation analysis
KW - Multiview representation learning
KW - Silent speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85039159046&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039159046&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2017-952
DO - 10.21437/Interspeech.2017-952
M3 - Conference article
AN - SCOPUS:85039159046
SN - 2308-457X
VL - 2017-August
SP - 2769
EP - 2773
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
Y2 - 20 August 2017 through 24 August 2017
ER -