Towards Semantic Scene Understanding for Robotic Applications in Unstructured Environments

Mascaro, Ruben

doi:10.3929/ethz-b-000646462

Download

Full text (PDF, 49.99Mb)

Open access

Author

Mascaro, Ruben

Date

2023

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 49.99Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Robotics technology has lately been experiencing a continuous a surge in adoption across various sectors of modern society, offering cost-efficient, precise and often safer alternatives to human labor in a broad spectrum of tasks, ranging from inspection and exploration to manufacturing and logistics. Despite the advances in the field, there still lies a significant challenge in ensuring safe and reliable operation in unstructured real-world environments, whose complexity significantly exceeds that of controlled, factory or laboratory-like settings. To autonomously navigate and plan interactions in such scenarios, robots must develop an understanding of their surroundings that extends beyond the basic distinction between free and occupied space traditionally employed onboard robotics systems even today. Ideally, they should be capable of building an internal representation of the scene that not only reconstructs the surface geometry with high fidelity, but also provides awareness of entities that hold semantic significance, such as objects, allowing for more sophisticated reasoning and decision-making. While significant progress has been made in the area of metric-semantic robotic perception, the current state of the art mostly relies on RGB-D sensing and generally lacks solutions that can be realistically deployed for the interaction of a robot with the scene outside the traditional indoor environments containing man-made objects. This thesis addresses the need for more adaptable and flexible methods, exploring their extension to different sensing modalities, such as the combination of vision with LiDAR, and enabling their applicability to particularly challenging environments or conditions for which no specific training data is directly available. Taking construction sites as a great example of highly unstructured and dynamic scenarios where robotic autonomy can have a drastic impact, a complete perception stack for a large-scale mobile manipulator tasked to build structures using raw materials found on-site, such as natural stones, is initially developed. The system uses the robot's onboard LiDAR scanners to incrementally build a map of its surroundings that gets automatically segmented by means of geometric methods into object-like instances, ultimately enabling their manipulation. While the object-oriented map representation proves especially suitable to manage scene changes caused by the manipulation of objects, thereby ensuring the map remains consistently aligned with the state of the world at any given time, the purely geometric-based segmentation approach limits the applicability of the system to settings where objects of interest are scattered on a roughly flat ground. Aiming at providing the ability to handle arbitrarily complex settings, this thesis then explores ways to incorporate semantically-meaningful segmentation masks extracted through vision-based deep learning techniques into the reconstructed 3D maps. This results in the development of an efficient strategy to propagate labels from multiple localized images to a 3D representation of a scene, where local geometric context is leveraged to account for multi-view prediction inconsistencies and to refine 3D segmentation boundaries. The approach is first formulated for the offline setting and a fixed number of possible semantic categories, and then extended to work within an online mapping framework, enabling instance-level semantic scene segmentation. Experimental evaluations on various public datasets, as well as real tests performed again on a construction site, demonstrate the efficacy of the proposed methods in dealing with different types of scenes and sensor modalities (i.e. RGB-D and RGB-LiDAR). With the aforementioned approaches heavily relying on learning-based models to extract semantic information from images, the final part of this thesis addresses the problem of learning from unlabeled data as a means of boosting their robustness and adaptability in environments where annotated data might be prohibitively expensive to collect. In particular, an Unsupervised Domain Adaptation (UDA) method that leverages the attention-based architecture of novel Transformer models to enhance the learning of domain-robust features is proposed. The approach is proven to achieve state-of-the-art performance in both synthetic-to-real and clear-to-adverse-weather UDA benchmarks from the literature, outperforming methods based on similar principles while being substantially more memory-efficient. Delving into topics such as metric-semantic mapping and domain-adaptive, image-based semantic scene understanding, the approaches and systems presented in this thesis contribute towards enabling advanced robotic perception in a wide variety of real-world scenarios. Remarkably, the developed methods are proven suitable for demanding settings outside the conventional indoor, man-made environments, where not only less predictable object shapes and scene layouts, but also variable illumination or weather conditions pose significant challenges. This results in more robust and versatile frameworks, able to succeed in settings where state-of-the-art systems often struggle, and paves the way towards a broader adoption of autonomous robots in the wild. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000646462

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Siegwart, Roland
Examiner: Milford, Michael
Examiner: Tombari, Federico
Examiner: Chli, Margarita

Publisher

ETH Zurich

Subject

Semantic Scene Understanding; 3D Reconstruction; Computer Vision; Robotics