Exploring Discourse-level Features for Audiobook-based Speech Synthesis

Ribeiro, Manuel Sam

View/Open

Ribeiro 2013 MSc.pdf (1.149Mb)

Date

2013

Item status

Restricted Access

Author

Ribeiro, Manuel Sam

Metadata

Show full item record

Abstract

Audiobooks are a powerful source of rich information for speech synthesis. Recent work has been focusing on this type of data to improve synthetic speech on two essential dimensions: naturalness and expressiveness. In audiobooks, sentences are not spoken in isolation, as in traditional speech synthesis databases, which allow us to explore discourse-level effects in synthetic speech. Furthermore, audiobook readers often change their voices to impersonate certain characters or to convey particular emotions related to the text, essentially making their speech more expressive. We should be aware, however, that audiobooks are found data. They were not specifically designed for speech synthesis, and may lack the coverage that carefully designed databases would have. Also, especially with freely available materials, audiobook recordings may not have been performed in the best conditions and readers may not be professional voice talents. Regardless, recent work has shown that audiobook data can be a valuable asset. This work will conduct an exploratory analysis on audiobook data, focusing on discourse-level phenomena, with the intent of improving the naturalness and expressiveness of synthetic speech. Several topics in these two dimensions are explored considering the written paragraph as a unit of discourse. In terms of naturalness, we begin by exploring acoustic effects within the paragraph and at its boundaries, focusing on intonational (F0-based) and durational cues in speech production. We continue with the prediction of pause duration from the large audiobook corpus. As for expressiveness, we explore Cluster Adaptive Training (CAT) interpolation weight vectors. We analyze their distributions and propose several text-based features that might help explain their variability. We then build univariate and multivariate regression trees in order to predict CAT weight vectors from the suggested textual features. Given the exploratory nature of this work, some analyses and models are more successful than others, and some results inevitably lead to new hypotheses. We conclude with suggestions for future work for each of the observed topics.

URI

http://hdl.handle.net/1842/8616

Collections

Linguistics and English Language Masters thesis collection