Speaker-Aware Multi-Task Learning for Automatic Speech Recognition

[en] Overfitting is a commonly met issue in automatic speech recognition and is especially impacting when the amount of training data is limited. In order to address this problem, this article investigates acoustic modeling through Multi-Task Learning, with two speaker-related auxiliary tasks. Multi-Task Learning is a regularization method which aims at improving the network's generalization ability, by training a unique model to solve several different, but related tasks. In this article, two auxiliary tasks are jointly examined. On the one hand, we consider speaker classification as an auxiliary task by training the acoustic model to recognize the speaker, or find the closest one inside the training set. On the other hand, the acoustic model is also trained to extract i-vectors from the standard acoustic features. I-Vectors are efficiently applied in the speaker identification community in order to characterize a speaker and its acoustic environment. The core idea of using these auxiliary tasks is to give the network an additional inter-speaker awareness, and thus, reduce overfitting. We investigate this Multi-Task Learning setup on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.

Disciplines :

Electrical & electronics engineering
Library & information sciences

Author, co-author :

Pironkov, Gueorgui ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Dupont, Stéphane ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Dutoit, Thierry ; Université de Mons > Faculté Polytechnique > Service Information, Signal et Intelligence artificielle

Language :

English

Title :

Speaker-Aware Multi-Task Learning for Automatic Speech Recognition

Publication date :

04 December 2016

Event name :

International Conference on Pattern Recognition

Event place :

Cancun, Mexico

Event date :

2016

Research unit :

F105 - Information, Signal et Intelligence artificielle

Research institute :

R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique
R450 - Institut NUMEDIART pour les Technologies des Arts Numériques

Available on ORBi UMONS :

since 16 January 2017

Statistics

Number of views

1 (0 by UMONS)

Number of downloads

6 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82-97, 2012.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, "Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4277-4280.
O. Vinyals, S. V. Ravuri, and D. Povey, "Revisiting recurrent neural networks for robust ASR," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4085-4088.
R. Caruana, "Multitask learning," Machine learning, vol. 28, no. 1, pp. 41-75, 1997.
L. Prechelt, "Early stopping-but when?" in Neural Networks: Tricks of the trade. Springer, 1998, pp. 55-69.
S. J. Nowlan and G. E. Hinton, "Simplifying neural networks by soft weight-sharing," Neural computation, vol. 4, no. 4, pp. 473-493, 1992.
G. Pironkov, S. Dupont, and T. Dutoit, "Investigating sparse deep neural networks for speech recognition," in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, Dec 2015, pp. 124-129.
Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, "Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis."
Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and R. Maia, "Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning," in Proc. Interspeech, 2015.
N. Chen, Y. Qian, and K. Yu, "Multi-task learning for text-dependent speaker verification," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
S. Dupont, C. Ris, O. Deroo, and S. Poitoux, "Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents," in Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on. IEEE, 2005, pp. 29-34.
G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, "Multilingual acoustic models using distributed deep neural networks," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8619-8623.
A. Mohan and R. Rose, "Multi-lingual speech recognition with lowrank multi-task deep neural networks," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4994-4998.
G. Tur, "Multitask learning for spoken language understanding," in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1. IEEE, 2006, pp. I-I.
X. Li, Y.-Y. Wang, and G. Tur, "Multi-task learning for spoken language understanding with shared slots." in INTERSPEECH, vol. 20, no. 1, 2011, p. 1.
R. Collobert and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multitask learning," in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160-167.
Y. Lu, F. Lu, S. Sehgal, S. Gupta, J. Du, C. H. Tham, P. Green, and V. Wan, "Multitask learning in connectionist speech recognition," in Proceedings of the Tenth Australian International Conference on Speech Science & Technology: 8-10 December 2004; Sydney, 2004, pp. 312-315.
J. Stadermann, W. Koska, and G. Rigoll, "Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model." in INTERSPEECH, 2005, pp. 2993-2996.
M. L. Seltzer and J. Droppo, "Multi-task learning in deep neural networks for improved phoneme recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6965-6969.
P. Bell and S. Renals, "Regularization of context-dependent deep neural networks with context-independent multi-task training," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4290-4294.
D. Chen, B. Mak, C.-C. Leung, and S. Sivadas, "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5592-5596.
Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, "Rapid adaptation for deep neural networks through multi-task learning," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, "Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
S. Kim, B. Raj, and I. Lane, "Environmental noise embeddings for robust speech recognition," arXiv preprint arXiv:1601.02553, 2016.
G. Pironkov, S. Dupont, and T. Dutoit, "Speaker-aware long shortterm memory multi-task learning for speech recognition," in 24th European Signal Processing Conference (EUSIPCO), 2016 (Under Review). IEEE, 2016.
T. Tan, Y. Qian, D. Yu, S. Kundu, L. Lu, K. C. SIM, X. Xiao, and Y. Zhang, "Speaker-aware training of LSTM-RNNS for acoustic modelling."
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788-798, 2011.
G. Pironkov, S. Dupont, and T. Dutoit, "I-vector estimation as auxiliary task for multi-task learning based acoustic modeling for automatic speech recognition," in INTERSPEECH (Under Review), 2016.
A. W. Senior and I. Lopez-Moreno, "Improving dnn speaker independence with i-vector inputs." in ICASSP, 2014, pp. 225-229.
G. Pironkov, S. Dupont, and T. Dutoit, "Multi-task learning for speech recognition: an overview," in Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN), 2016.
U. Reubold, J. Harrington, and F. Kleber, "Vocal aging effects on F0 and the first formant: A longitudinal analysis in adult speakers," Speech Communication, vol. 52, no. 7, pp. 638-651, 2010.
E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052-4056.
P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, "Joint factor analysis versus eigenchannels in speaker recognition," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 4, pp. 1435-1447, 2007.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl'cek, Y. Qian, P. Schwarz et al., "The kaldi speech recognition toolkit," 2011.
J. S. Garofolo, L. D. Consortium et al., TIMIT: acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 1993.
H. Sak, A. W. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling." in INTERSPEECH, 2014, pp. 338-342.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 10, pp. 1533-1545, 2014.