Exploring Discourse-level Features for Audiobook-based Speech Synthesis
View/ Open
Ribeiro 2013 MSc.pdf (1.149Mb)
Date
2013Item status
Restricted AccessAuthor
Ribeiro, Manuel Sam
Metadata
Abstract
Audiobooks are a powerful source of rich information for speech synthesis. Recent
work has been focusing on this type of data to improve synthetic speech on two
essential dimensions: naturalness and expressiveness. In audiobooks, sentences
are not spoken in isolation, as in traditional speech synthesis databases, which
allow us to explore discourse-level effects in synthetic speech. Furthermore, audiobook
readers often change their voices to impersonate certain characters or to
convey particular emotions related to the text, essentially making their speech
more expressive.
We should be aware, however, that audiobooks are found data. They were not
specifically designed for speech synthesis, and may lack the coverage that carefully
designed databases would have. Also, especially with freely available materials,
audiobook recordings may not have been performed in the best conditions and
readers may not be professional voice talents.
Regardless, recent work has shown that audiobook data can be a valuable
asset. This work will conduct an exploratory analysis on audiobook data, focusing
on discourse-level phenomena, with the intent of improving the naturalness and
expressiveness of synthetic speech. Several topics in these two dimensions are
explored considering the written paragraph as a unit of discourse.
In terms of naturalness, we begin by exploring acoustic effects within the paragraph
and at its boundaries, focusing on intonational (F0-based) and durational
cues in speech production. We continue with the prediction of pause duration
from the large audiobook corpus.
As for expressiveness, we explore Cluster Adaptive Training (CAT) interpolation
weight vectors. We analyze their distributions and propose several text-based
features that might help explain their variability. We then build univariate and
multivariate regression trees in order to predict CAT weight vectors from the
suggested textual features.
Given the exploratory nature of this work, some analyses and models are more
successful than others, and some results inevitably lead to new hypotheses. We
conclude with suggestions for future work for each of the observed topics.