Scene Understanding for Autonomous Driving based on Visual and Depth Cues

Schneider, Lukas

doi:10.3929/ethz-b-000503399

Download

Full text (PDF, 71.25Mb)

Open access

Author

Schneider, Lukas

Date

2021

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 71.25Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Autonomous driving has been one of the most challenging and exciting research topics in the last years inspiring research and development in both universities and companies. The presented work contributes to this active area of research by means of environment perception with a focus on visual cues. The task is to provide the relevant information of the surrounding scene to decision making modules of the intelligent vehicle. Typically, several different sensors such as camera, Light Detection and Ranging (LIDAR) and radar sensors are fused to that end, which enables addressing tasks like object recognition and tracking, detection of drivable space, and many more. This thesis argues that these tasks require access to information close to the raw sensor signal in order to deliver accurate and robust scene interpretations. Accessing this raw data in each task separately, in particular in the case of high resolution visual data, however, induces high computational costs in terms of data transfer and processing. A design that processes this data in each task, furthermore, prevents from tracing errors to its root cause and makes assessing the performance of the vehicle’s perception abilities difficult and data intensive. This holds particularly true in the case of DNN based processing. To overcome this problem, Stixels are introduced which constitute a mid-level representation. This mid-level approach effectively combines advantages of both: first, a compactness close to traditional high-level approaches such as lists of objects as well as second, an accurate and complete scene representation comparable to the raw data itself. To that end, a Stixel semi-CRF is developed that harmonizes several attempts to formalize and simplify the Stixel model and combines complementary semantic and depth information on pixel-level in a sound probabilistic formulation. Furthermore, several generalizations and extensions are introduced which allow for the accurate representation of arbitrary complex scenarios. The combination of the complementary cues from the semantic and depth domain in the joint optimization process which leverages the strength of the individual inputs allows to even surpass the accuracy of the raw inputs. The pixel-level semantic input used in the Stixel representation is obtained using semantic segmentation, a very active area of research in computer vision. The focus for the methods in this area, however, has been on methods using solely the color information delivered by the camera. In this work, it is proposed to apply the same principle, i.e. combining geometric and appearance cues on a low-level of processing, already in this task. To that end, a novel CNN architecture is proposed which brings both inputs together within the network. This turns out to be superior compared to the standard approaches which either concatenate the inputs directly or combine the results of two separate networks at the very end. In addition to the semantic segmentation input, the Stixel representation builds on dense depth maps as input which we obtain using stereo vision naturally requiring a pair of cameras. As alternative variant for the depth estimation, this dissertation proposes to combine a single camera with a LIDAR sensor which delivers sparse but extremely accurate depth estimates. In order to generate dense depth maps using the sparse measurements, a depth completion method is introduced which uses the color image as guidance in order to distribute the depth information throughout the image. This task is addressed by defining an energy function which considers all measurements within the image in order to infer planar elements and incorporates smoothness assumptions in a local neighborhood of every pixel. This neighborhood within the image turns out to be a crucial part within the optimization. This is why a novel semantic and edge aware geodesic distance is introduced which leverages the available appearance information and is thus able to respect the borders of objects. In doing so, the proposed method is able to reconstruct scenes accurately including even fine structures such as poles or side mirrors with no or only few associated LIDAR measurements. Furthermore, an efficient inference algorithm is introduced which allows for real-time depth estimation using even high resolution images. Thorough experiments of all proposed methods on real-data benchmarks demonstrate the benefits of the novel approaches and examine their most important design choices, parameters and properties. Finally, this thesis demonstrates the usage of the proposed Stixel representation in DNN based processing on fusion-level. To that end, Stixels are considered a generalized LIDAR point cloud which enables to apply an adapted variant of the PointNet(C. R. Qi, Su, Mo, et al. 2017) to the data in order to segment object instances. A method is developed to create Stixel-level groundtruth from pixel-level instance segmentation annotations which is used to train the proposed model. Despite the intended compression and quantization effects of the Stixels, the experiments reveal promising results opening the door for future research on Stixel-level fusion. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000503399

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Pollefeys, Marc
Examiner: Franke, Uwe
Examiner: Van Gool, Luc
Examiner: Yu, Fisher

Publisher

ETH Zurich

Subject

Computer vision + scene understanding (artificial intelligence); Autonomous Driving Assistance Systems; stereo vision; MACHINE LEARNING (ARTIFICIAL INTELLIGENCE)

Organisational unit

03766 - Pollefeys, Marc / Pollefeys, Marc

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Scene Understanding for Autonomous Driving based on Visual and Depth Cues Mendeley CSV RIS BibTeX

Scene Understanding for Autonomous Driving based on Visual and Depth Cues

Mendeley

CSV

RIS

BibTeX