Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Autonomous driving has been one of the most challenging and exciting
research topics in the last years inspiring research and development in
both universities and companies. The presented work contributes to this
active area of research by means of environment perception with a focus
on visual cues. The task is to provide the relevant information of the
surrounding scene to decision making modules of the intelligent vehicle.
Typically, several different sensors such as camera, Light Detection and
Ranging (LIDAR) and radar sensors are fused to that end, which enables
addressing tasks like object recognition and tracking, detection of drivable
space, and many more. This thesis argues that these tasks require access
to information close to the raw sensor signal in order to deliver accurate
and robust scene interpretations. Accessing this raw data in each task
separately, in particular in the case of high resolution visual data, however,
induces high computational costs in terms of data transfer and processing.
A design that processes this data in each task, furthermore, prevents from
tracing errors to its root cause and makes assessing the performance of
the vehicle’s perception abilities difficult and data intensive. This holds
particularly true in the case of DNN based processing.
To overcome this problem, Stixels are introduced which constitute a
mid-level representation. This mid-level approach effectively combines
advantages of both: first, a compactness close to traditional high-level
approaches such as lists of objects as well as second, an accurate and
complete scene representation comparable to the raw data itself. To that
end, a Stixel semi-CRF is developed that harmonizes several attempts
to formalize and simplify the Stixel model and combines complementary
semantic and depth information on pixel-level in a sound probabilistic
formulation. Furthermore, several generalizations and extensions are introduced
which allow for the accurate representation of arbitrary complex
scenarios. The combination of the complementary cues from the semantic
and depth domain in the joint optimization process which leverages the
strength of the individual inputs allows to even surpass the accuracy of
the raw inputs.
The pixel-level semantic input used in the Stixel representation is obtained
using semantic segmentation, a very active area of research in
computer vision. The focus for the methods in this area, however, has
been on methods using solely the color information delivered by the camera.
In this work, it is proposed to apply the same principle, i.e. combining
geometric and appearance cues on a low-level of processing, already in
this task. To that end, a novel CNN architecture is proposed which brings
both inputs together within the network. This turns out to be superior
compared to the standard approaches which either concatenate the inputs
directly or combine the results of two separate networks at the very end.
In addition to the semantic segmentation input, the Stixel representation
builds on dense depth maps as input which we obtain using stereo
vision naturally requiring a pair of cameras. As alternative variant for the
depth estimation, this dissertation proposes to combine a single camera
with a LIDAR sensor which delivers sparse but extremely accurate depth
estimates. In order to generate dense depth maps using the sparse measurements,
a depth completion method is introduced which uses the color
image as guidance in order to distribute the depth information throughout
the image. This task is addressed by defining an energy function which
considers all measurements within the image in order to infer planar elements
and incorporates smoothness assumptions in a local neighborhood
of every pixel. This neighborhood within the image turns out to be a
crucial part within the optimization. This is why a novel semantic and
edge aware geodesic distance is introduced which leverages the available
appearance information and is thus able to respect the borders of objects.
In doing so, the proposed method is able to reconstruct scenes accurately
including even fine structures such as poles or side mirrors with no or only
few associated LIDAR measurements. Furthermore, an efficient inference
algorithm is introduced which allows for real-time depth estimation using
even high resolution images. Thorough experiments of all proposed
methods on real-data benchmarks demonstrate the benefits of the novel
approaches and examine their most important design choices, parameters
and properties.
Finally, this thesis demonstrates the usage of the proposed Stixel representation
in DNN based processing on fusion-level. To that end, Stixels
are considered a generalized LIDAR point cloud which enables to apply
an adapted variant of the PointNet(C. R. Qi, Su, Mo, et al. 2017) to
the data in order to segment object instances. A method is developed to
create Stixel-level groundtruth from pixel-level instance segmentation annotations
which is used to train the proposed model. Despite the intended
compression and quantization effects of the Stixels, the experiments reveal
promising results opening the door for future research on Stixel-level
fusion. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000503399Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Pollefeys, Marc
Examiner: Franke, Uwe
Examiner: Van Gool, Luc
Examiner: Yu, Fisher
Publisher
ETH ZurichSubject
Computer vision + scene understanding (artificial intelligence); Autonomous Driving Assistance Systems; stereo vision; MACHINE LEARNING (ARTIFICIAL INTELLIGENCE)Organisational unit
03766 - Pollefeys, Marc / Pollefeys, Marc
More
Show all metadata
ETH Bibliography
yes
Altmetrics