Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network

Myungjong Kim; Beiming Cao; Ted Mau; Jun Wang

doi:10.1109/TASLP.2017.2758999

Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network

Myungjong Kim, Beiming Cao, Ted Mau, Jun Wang

Research output: Contribution to journal › Article › peer-review

79 Scopus citations

Abstract

Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.

Original language	English (US)
Pages (from-to)	2323-2336
Number of pages	14
Journal	IEEE/ACM Transactions on Audio Speech and Language Processing
Volume	25
Issue number	12
DOIs	https://doi.org/10.1109/TASLP.2017.2758999
State	Published - Dec 2017

Keywords

Articulatory normalization
Procrustes matching
long short-term memory (LSTM)
silent speech recognition

ASJC Scopus subject areas

Computer Science (miscellaneous)
Acoustics and Ultrasonics
Computational Mathematics
Electrical and Electronic Engineering

Access to Document

10.1109/TASLP.2017.2758999

Cite this

@article{f9380278692d49fe866fcd6270660fdf,

title = "Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network",

abstract = "Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.",

keywords = "Articulatory normalization, Procrustes matching, long short-term memory (LSTM), silent speech recognition",

author = "Myungjong Kim and Beiming Cao and Ted Mau and Jun Wang",

note = "Funding Information: Manuscript received December 15, 2016; revised May 22, 2017 and August 7, 2017; accepted September 15, 2017. Date of current version November 27, 2017. This work was supported in parts by the National Institutes of Health (No. R03DC013990) and the American Speech-Language-Hearing Foundation through a New Century Scholar Grant. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Tanja Schultz. (Corresponding author: Myungjong Kim.) M. Kim, B. Cao, and J. Wang are with the Department of Bioengineering, University of Texas at Dallas, Richardson, TX 75080 USA (e-mail: myungjong. kim@utdallas.edu; beiming.cao@utdallas.edu; wangjun@utdallas.edu). Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2017",

month = dec,

doi = "10.1109/TASLP.2017.2758999",

language = "English (US)",

volume = "25",

pages = "2323--2336",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

publisher = "IEEE Advancing Technology for Humanity",

number = "12",

}

TY - JOUR

T1 - Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network

AU - Kim, Myungjong

AU - Cao, Beiming

AU - Mau, Ted

AU - Wang, Jun

N1 - Funding Information: Manuscript received December 15, 2016; revised May 22, 2017 and August 7, 2017; accepted September 15, 2017. Date of current version November 27, 2017. This work was supported in parts by the National Institutes of Health (No. R03DC013990) and the American Speech-Language-Hearing Foundation through a New Century Scholar Grant. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Tanja Schultz. (Corresponding author: Myungjong Kim.) M. Kim, B. Cao, and J. Wang are with the Department of Bioengineering, University of Texas at Dallas, Richardson, TX 75080 USA (e-mail: myungjong. kim@utdallas.edu; beiming.cao@utdallas.edu; wangjun@utdallas.edu). Publisher Copyright: © 2014 IEEE.

PY - 2017/12

Y1 - 2017/12

N2 - Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.

AB - Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.

KW - Articulatory normalization

KW - Procrustes matching

KW - long short-term memory (LSTM)

KW - silent speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85040443398&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040443398&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2017.2758999

DO - 10.1109/TASLP.2017.2758999

M3 - Article

C2 - 30271809

AN - SCOPUS:85040443398

SN - 2329-9290

VL - 25

SP - 2323

EP - 2336

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

IS - 12

ER -

Speaker-Independent Silent Speech Recognition from Flesh-Point Articulatory Movements Using an LSTM Neural Network

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this