Building agents that can see, talk, and act

Das, Abhishek

Title:

Building agents that can see, talk, and act

Files

DAS-DISSERTATION-2020.pdf (37.53 MB)

Author(s)

Das, Abhishek

Advisor(s)

Batra, Dhruv

Advisor(s)

Person

Batra, Dhruv

Associated Organization(s)

Organizational Unit

College of Computing

Organizational Unit

School of Interactive Computing

Collections

Theses and Dissertations

Permanent Link

http://hdl.handle.net/1853/62768

Abstract

A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through vision, audition, or other sensors), reason and infer from perception in an interpretable and actionable manner, communicate this understanding to humans and other agents (e.g., hold a natural language dialog grounded in the environment), and act on this understanding in physical worlds (e.g., aid humans by executing commands in an embodied environment). To be able to make progress towards this grand goal, we must explore new multimodal AI tasks, move from datasets to physical environments, and build new kinds of models. In this dissertation, we combine insights from different areas of AI -- computer vision, language understanding, reinforcement learning -- and present steps to connect the underlying domains of vision and language to actions towards such general-purpose agents. In Part 1, we develop agents that can see and talk -- capable of holding free-form conversations about images -- and reinforcement learning-based algorithms to train these visual dialog agents via self-play. In Part 2, we extend our focus to agents that can see, talk, and act -- embodied agents that can actively perceive and navigate in partially-observable simulated environments, to accomplish tasks such as question-answering. In Part 3, we devise techniques for training populations of agents that can comunicate with each other, to coordinate, strategize, and utilize their combined sensory experiences and act in the physical world. These agents learn both what messages to send and who to communicate with, solely from downstream reward without any communication supervision. Finally, in Part 4, we use question-answering as a task-agnostic probe to ask a self-supervised embodied agent what it knows about its physical world, and use it to quantify differences in visual representations agents develop when trained with different auxiliary objectives.

Date Issued

2020-04-25

Resource Type

Text

Resource Subtype

Dissertation

Full item page

Title:

Building agents that can see, talk, and act

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Georgia Tech Library

Title: Building agents that can see, talk, and act

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Title:

Building agents that can see, talk, and act