UR Research > Computer Science Department > CS Ph.D. Theses >

Unsupervised alignment of natural language with video

URL to cite or link to: http://hdl.handle.net/1802/30614

Naim_rochester_0188E_11098.pdf   2.02 MB (No. of downloads : 853)
PDF of thesis.
Thesis (Ph. D.)--University of Rochester. Department of Computer Science, 2016.
Today we encounter large amounts of video data, often accompanied with text descriptions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal sequences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on manual annotations, which are often tedious and expensive to collect. In this thesis, we focus on automatically aligning sentences with the corresponding video frames without any direct human supervision.

We first propose two hierarchical generative alignment models, which jointly align each sentence with the corresponding video frames, and each noun in a sentence with the corresponding object in the video frames. Next, we propose several latent-variable discriminative alignment models, which incorporate rich features involving verbs and video actions, and outperform the generative models. Our alignment algorithms are primarily applied to align biological wetlab videos with text instructions. Furthermore, we extend our alignment models for automatically aligning movie scenes with associated scripts and learning word-level translations between language pairs for which bilingual training data is unavailable.

Thesis: By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision.
Contributor(s):
Iftekhar Naim - Author

Daniel J. Gildea - Thesis Advisor

Primary Item Type:
Thesis
Identifiers:
Local Call No. AS38.661
Language:
English
Subject Keywords:
Alignment; Computer vision; Language and vision; Natural language processing; Video processing
Sponsor - Description:
Office of Naval Research (ONR) - N00014-11-10417
Intel - ISTC-PC Award
National Science Foundation (NSF) - IIS-1446996; IIS- 1449278
First presented to the public:
3/11/2016
Originally created:
2015
Original Publication Date:
2015
Previously Published By:
University of Rochester
Place Of Publication:
Rochester, N.Y.
Citation:
Extents:
Illustrations - illustrations (some color)
Number of Pages - xvi, 127 pages
License Grantor / Date Granted:
John Dickson / 2016-03-25 15:25:49.702 ( View License )
Date Deposited
2016-03-25 15:25:49.702
Submitter:
John Dickson

Copyright © This item is protected by copyright, with all rights reserved.

All Versions

Thumbnail Name Version Created Date
Unsupervised alignment of natural language with video1 2016-03-25 15:25:49.702