Unsupervised alignment of natural language with video

Naim, Iftekhar

UR Research > Computer Science Department > CS Ph.D. Theses >

Unsupervised alignment of natural language with video

URL to cite or link to: http://hdl.handle.net/1802/30614

Naim_rochester_0188E_11098.pdf 2.02 MB (No. of downloads : 853)

PDF of thesis.

Description

Thesis (Ph. D.)--University of Rochester. Department of Computer Science, 2016.

Abstract

Today we encounter large amounts of video data, often accompanied with text descriptions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal sequences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on manual annotations, which are often tedious and expensive to collect. In this thesis, we focus on automatically aligning sentences with the corresponding video frames without any direct human supervision.

We first propose two hierarchical generative alignment models, which jointly align each sentence with the corresponding video frames, and each noun in a sentence with the corresponding object in the video frames. Next, we propose several latent-variable discriminative alignment models, which incorporate rich features involving verbs and video actions, and outperform the generative models. Our alignment algorithms are primarily applied to align biological wetlab videos with text instructions. Furthermore, we extend our alignment models for automatically aligning movie scenes with associated scripts and learning word-level translations between language pairs for which bilingual training data is unavailable.

Thesis: By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision.

Contributor(s):

Iftekhar Naim - Author

Daniel J. Gildea - Thesis Advisor

Primary Item Type:

Thesis

Identifiers:

Local Call No. AS38.661

Language:

English

Subject Keywords:

Alignment; Computer vision; Language and vision; Natural language processing; Video processing

Sponsor - Description:

Office of Naval Research (ONR) - N00014-11-10417

Intel - ISTC-PC Award

National Science Foundation (NSF) - IIS-1446996; IIS- 1449278

First presented to the public:

3/11/2016

Originally created:

2015

Original Publication Date:

2015

Previously Published By:

University of Rochester

Place Of Publication:

Rochester, N.Y.

Citation:

Extents:

Illustrations - illustrations (some color)

Number of Pages - xvi, 127 pages

License Grantor / Date Granted:

John Dickson / 2016-03-25 15:25:49.702 ( View License )

Date Deposited

2016-03-25 15:25:49.702

Submitter:

John Dickson

All Versions

Thumbnail	Name	Version	Created Date
	Unsupervised alignment of natural language with video	1	2016-03-25 15:25:49.702

Reason for withdraw :*
Display metadata:
Withdraw all versions: