Subspace Gaussian mixture models for automatic speech recognition
Abstract
In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs)
are used to model the density of the emitting states in the hidden Markov models
(HMMs). In a conventional system, the model parameters of each GMM are estimated
directly and independently given the alignment. This results a large number of
model parameters to be estimated, and consequently, a large amount of training data
is required to fit the model. In addition, different sources of acoustic variability that
impact the accuracy of a recogniser such as pronunciation variation, accent, speaker
factor and environmental noise are only weakly modelled and factorized by adaptation
techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori
adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis,
we will discuss an alternative acoustic modelling approach — the subspace Gaussian
mixture model (SGMM), which is expected to deal with these two issues better. In an
SGMM, the model parameters are derived from low-dimensional model and speaker
subspaces that can capture phonetic and speaker correlations. Given these subspaces,
only a small number of state-dependent parameters are required to derive the corresponding
GMMs. Hence, the total number of model parameters can be reduced, which
allows acoustic modelling with a limited amount of training data. In addition, the
SGMM-based acoustic model factorizes the phonetic and speaker factors and within
this framework, other source of acoustic variability may also be explored.
In this thesis, we propose a regularised model estimation for SGMMs, which avoids
overtraining in case that the training data is sparse. We will also take advantage of
the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource
speech recognition. Here, the model subspace is estimated from out-domain data and
ported to the target language system. In this case, only the state-dependent parameters
need to be estimated which relaxes the requirement of the amount of training data. To
improve the robustness of SGMMs against environmental noise, we propose to apply
the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective.
We will report experimental results on the Wall Street Journal (WSJ) database
and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of
SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on
the Aurora 4 database.