Multiview representation learning via deep CCA for silent speech recognition

Myungjong Kim; Beiming Cao; Ted Mau; Jun Wang

doi:10.21437/Interspeech.2017-952

Multiview representation learning via deep CCA for silent speech recognition

Myungjong Kim, Beiming Cao, Ted Mau, Jun Wang

Research output: Contribution to journal › Conference article › peer-review

7 Scopus citations

Abstract

Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.

Original language	English (US)
Pages (from-to)	2769-2773
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2017-August
DOIs	https://doi.org/10.21437/Interspeech.2017-952
State	Published - 2017
Event	18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden Duration: Aug 20 2017 → Aug 24 2017

Keywords

Articulatory movements
Deep canonical correlation analysis
Multiview representation learning
Silent speech recognition

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2017-952

Cite this

@article{c0e84da23cd743a09282009db950049d,

title = "Multiview representation learning via deep CCA for silent speech recognition",

abstract = "Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.",

keywords = "Articulatory movements, Deep canonical correlation analysis, Multiview representation learning, Silent speech recognition",

author = "Myungjong Kim and Beiming Cao and Ted Mau and Jun Wang",

note = "Publisher Copyright: Copyright {\textcopyright} 2017 ISCA.; 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 ; Conference date: 20-08-2017 Through 24-08-2017",

year = "2017",

doi = "10.21437/Interspeech.2017-952",

language = "English (US)",

volume = "2017-August",

pages = "2769--2773",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Multiview representation learning via deep CCA for silent speech recognition

AU - Kim, Myungjong

AU - Cao, Beiming

AU - Mau, Ted

AU - Wang, Jun

PY - 2017

Y1 - 2017

N2 - Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.

AB - Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCAbased multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural networkbased SSR systems.

KW - Articulatory movements

KW - Deep canonical correlation analysis

KW - Multiview representation learning

KW - Silent speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85039159046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039159046&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-952

DO - 10.21437/Interspeech.2017-952

M3 - Conference article

AN - SCOPUS:85039159046

SN - 2308-457X

VL - 2017-August

SP - 2769

EP - 2773

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017

Y2 - 20 August 2017 through 24 August 2017

ER -

Multiview representation learning via deep CCA for silent speech recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this