Altering speech synthesis prosody through real time natural gestural control
View/ Open
Date
2013Author
Abelman, David
Metadata
Abstract
A significant amount of research has been and continues to be undertaken into generating
expressive prosody within speech synthesis. Separately, recent developments in
HMM-based synthesis (specifically pHTS, developed at University of Mons) provide
a platform for reactive speech synthesis, able to react in real time to surroundings or
user interaction.
Considering both of these elements, this project explores whether it is possible to
generate superior prosody in a speech synthesis system, using natural gestural controls,
in real time. Building on a previous piece of work undertaken at The University of Edinburgh,
a system is constructed in which a user may apply a variety of prosodic effects
in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures
are recognised and prosodic adjustments made through a series of hand-crafted
rules (based on data gathered from preliminary experiments), though machine learning
techniques are also considered within this project and recommended for future iterations
of the work.
Two sets of formal experiments are implemented, both of which suggest that - under
further development - the system developed may work successfully in a real world
environment. Firstly, user tests show that subjects can learn to control the device successfully,
adding prosodic effects to the intended words in the majority of cases with
practice. Results are likely to improve further as buffering issues are resolved. Secondly,
listening tests show that the prosodic effects currently implemented significantly
increase perceived naturalness, and in some cases are able to alter the semantic perception
of a sentence in an intended way.
Alongside this paper, a demonstration video of the project may be found on the accompanying
CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised
to view this demonstration, as a way of understanding how the system functions and
sounds in action.