Advances in scene understanding: object detection, reconstruction, layouts, and inference

Henderson, Paul Matthew

View/Open

Henderson2019.pdf (97.74Mb)

Date

01/07/2019

Author

Henderson, Paul Matthew

Metadata

Show full item record

Abstract

The goal of scene understanding is to capture the full content of an image in a human-interpretable representation. This must describe the different objects present, including their attributes such as class, shape, and pose, as well as the relations between objects. Moreover, the representation should be globally-consistent across the entire image. In this thesis, we consider four sub-tasks within scene understanding, and make contributions to each. When describing the content of an image, it is natural to start by detecting all the objects that are present—that is, localising and classifying them. Our first contribution is to show how to train a neural-network-based object class detector end-to-end in a principled fashion, using the evaluation metric as the training loss, and using the same pipeline at both training and test time. This is simpler and more elegant than the traditional approach of using a surrogate loss, yet we show it achieves comparable performance. Once the location and class of an object are known, we can estimate its shape and pose in 3D space. Our second contribution is a new approach to these tasks, which supports training purely from 2D images—without 3D supervision, multiple views, or annotations such as pose or keypoints. Moreover, this model is generative, and so allows sampling new object shapes a priori. To produce a globally-consistent description of a scene, it is important to reason over all objects simultaneously, rather than considering each individually. Our third contribution is a probabilistic generative model over complete indoor scene layouts. It models complex arrangements in 3D space, including high-order spatial relations among furniture and other objects. One common approach to generating predictions that are consistent over all objects in a scene, or pixels in an image, is to formulate and solve a discrete energy minimisation problem. The energy is defined as a sum over factors, and the factor structure greatly affects what minimisation algorithms work well. Our fourth contribution is a method that automatically selects a suitable algorithm to solve a given energy minimisation problem. To do so, it learns to predict the best algorithm based on characteristics of the problem instance.

URI

http://hdl.handle.net/1842/35600

Collections

Informatics thesis and dissertation collection