Text-based Editing of Talking-head Video

Fried, Ohad; Tewari, Ayush; Zollhöfer, Michael; Finkelstein, Adam; Shechtman, Eli; Goldman, Dan B.; Genova, Kyle; Jin, Zeyu; Theobalt, Christian; Agrawala, Maneesh

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

成果報告書

Text-based Editing of Talking-head Video

MPS-Authors

/persons/resource/persons206546

Tewari, Ayush
Computer Graphics, MPI for Informatics, Max Planck Society;

/persons/resource/persons45610

Theobalt, Christian
Computer Graphics, MPI for Informatics, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

arXiv:1906.01524.pdf
(プレプリント), 11MB

付随資料 (公開)

There is no public supplementary material available

引用

Fried, O., Tewari, A., Zollhöfer, M., Finkelstein, A., Shechtman, E., Goldman, D. B., Genova, K., Jin, Z., Theobalt, C., & Agrawala, M. (2019). Text-based Editing of Talking-head Video. Retrieved from http://arxiv.org/abs/1906.01524.

引用: https://hdl.handle.net/21.11116/0000-0003-FE15-8

要旨

Editing talking-head video to change the speech content or to remove filler
words is challenging. We propose a novel method to edit talking-head video
based on its transcript to produce a realistic output video in which the
dialogue of the speaker has been modified, while maintaining a seamless
audio-visual flow (i.e. no jump cuts). Our method automatically annotates an
input talking-head video with phonemes, visemes, 3D face pose and geometry,
reflectance, expression and scene illumination per frame. To edit a video, the
user has to only edit the transcript, and an optimization strategy then chooses
segments of the input corpus as base material. The annotated parameters
corresponding to the selected segments are seamlessly stitched together and
used to produce an intermediate video representation in which the lower half of
the face is rendered with a parametric face model. Finally, a recurrent video
generation network transforms this representation to a photorealistic video
that matches the edited transcript. We demonstrate a large variety of edits,
such as the addition, removal, and alteration of words, as well as convincing
language translation and full sentence synthesis.