Visual Object Tracking is a continuously developing and competitive field in computer vision and machine learning. It is based on the idea of being able to follow any object, unknown up to that moment, in its movements without losing it or letting it run away. It is of fundamental importance to keep under observation a specific object otherwise difficult to track, in particular in very sensitive contexts such as security, more specifically in video surveillance, or autonomously guided or unmanned aerial vehicles (UAV), and in contexts in which it is of great help to perform automated tasks, such as in video production. As in any other field, deep learning methodologies became part of visual tracking, bringing with them the state-of-the-art of many methodologies including object detection, super-resolution, adversarial generative networks, and image classification. Most of these methodologies cooperate in a single architecture dedicated to tracking, to the point that the computational resources required for this purpose are very intensive, having to coexist and collaborate with multiple deep models. Visual object tracking has evolved more and more over the years from using correlation models based on transformations in the frequency domain, to allowing a global and highly efficient correlation, compatible, with learning based on gradient descent. It evolved again using models based on correlation of elements with similar characteristics extracted from a module that is able to recognize the characteristics of objects previously trained on the classification of images with large datasets. Moreover, we have recently witnessed the advent of transformers models with their explosion in the field of NLP and computer vision. These models were also rapidly included in methodologies for the field of visual tracking. The following thesis aims to inspect methodologies that can enrich the current state-of-the-art by building and exploring architectures not yet defined in the literature as well as trying to improve those already developed. We started our research with the idea of using Generative Adversarial Networks for their reconstruction power, putting the tracking problem in the point of view of subject segmentation. Going further into the topic, the work continued with the application of models based on Siamese networks that effectively allow to correlate the element to be tracked with the research space itself. However, these models have already been widely studied which has led us to develop even more complex networks. Although we remained in the context of Siamese networks, we started from the use of conventional convolutional networks to novel techniques in computer vision as the Transformers. These have proven to be the center of interest of a large part of the scientific community in every field and, therefore, we tried to apply them in a field that is new compared to the state-of-the art. As a result we obtained a method that is able to be compared with all current techniques, obtaining scores that allow the implemented methodology, not only to enter in high positions in the benchmark leaderboards of the various datasets, but also to participate in VOT2022, which is the challenge of reference for the world of visual tracking and which is the goal of every tracking algorithm. In addition, we investigated what the tracking algorithms are actually observing through methodologies of eXplainable Artificial Intelligence (XAI) and how the transformers and the attention mechanism play a very important role. During the research activity learning models that are different from those based on traditional crisp logic have also been developed. The aim was to try to find a common point that could combine fuzzy logic with the flexibility of deep learning and how this can be used to explain the relationships between the data as the complexity of the neural network increases. Fuzzy logic was applied to transformers to build a Fuzzy Transformer, where the attention component is more easily explained given the success of fuzzy models in the field of explainability. All the work of XAI has allowed us to verify how our idea of using the internal components of the mechanism of attention has produced a direct link between this and the elements on which we are actually going to act in order to produce the desired output. In addition, the experiments conducted on the fuzzy components, in this first phase, seems to validate our idea that not only can using a specific highly interpretable component produce results similar to the state of the art, but this also makes it easier to understand them.

ADVANCED METHODOLOGIES FOR VISUAL OBJECT TRACKING / E. Di Nardo ; tutor: P. Boldi ; co-tutor: A. Ciaramella ; coordinatore: P. Boldi. Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jul 18. 34. ciclo, Anno Accademico 2021.

ADVANCED METHODOLOGIES FOR VISUAL OBJECT TRACKING

E. DI NARDO
2022

Abstract

Visual Object Tracking is a continuously developing and competitive field in computer vision and machine learning. It is based on the idea of being able to follow any object, unknown up to that moment, in its movements without losing it or letting it run away. It is of fundamental importance to keep under observation a specific object otherwise difficult to track, in particular in very sensitive contexts such as security, more specifically in video surveillance, or autonomously guided or unmanned aerial vehicles (UAV), and in contexts in which it is of great help to perform automated tasks, such as in video production. As in any other field, deep learning methodologies became part of visual tracking, bringing with them the state-of-the-art of many methodologies including object detection, super-resolution, adversarial generative networks, and image classification. Most of these methodologies cooperate in a single architecture dedicated to tracking, to the point that the computational resources required for this purpose are very intensive, having to coexist and collaborate with multiple deep models. Visual object tracking has evolved more and more over the years from using correlation models based on transformations in the frequency domain, to allowing a global and highly efficient correlation, compatible, with learning based on gradient descent. It evolved again using models based on correlation of elements with similar characteristics extracted from a module that is able to recognize the characteristics of objects previously trained on the classification of images with large datasets. Moreover, we have recently witnessed the advent of transformers models with their explosion in the field of NLP and computer vision. These models were also rapidly included in methodologies for the field of visual tracking. The following thesis aims to inspect methodologies that can enrich the current state-of-the-art by building and exploring architectures not yet defined in the literature as well as trying to improve those already developed. We started our research with the idea of using Generative Adversarial Networks for their reconstruction power, putting the tracking problem in the point of view of subject segmentation. Going further into the topic, the work continued with the application of models based on Siamese networks that effectively allow to correlate the element to be tracked with the research space itself. However, these models have already been widely studied which has led us to develop even more complex networks. Although we remained in the context of Siamese networks, we started from the use of conventional convolutional networks to novel techniques in computer vision as the Transformers. These have proven to be the center of interest of a large part of the scientific community in every field and, therefore, we tried to apply them in a field that is new compared to the state-of-the art. As a result we obtained a method that is able to be compared with all current techniques, obtaining scores that allow the implemented methodology, not only to enter in high positions in the benchmark leaderboards of the various datasets, but also to participate in VOT2022, which is the challenge of reference for the world of visual tracking and which is the goal of every tracking algorithm. In addition, we investigated what the tracking algorithms are actually observing through methodologies of eXplainable Artificial Intelligence (XAI) and how the transformers and the attention mechanism play a very important role. During the research activity learning models that are different from those based on traditional crisp logic have also been developed. The aim was to try to find a common point that could combine fuzzy logic with the flexibility of deep learning and how this can be used to explain the relationships between the data as the complexity of the neural network increases. Fuzzy logic was applied to transformers to build a Fuzzy Transformer, where the attention component is more easily explained given the success of fuzzy models in the field of explainability. All the work of XAI has allowed us to verify how our idea of using the internal components of the mechanism of attention has produced a direct link between this and the elements on which we are actually going to act in order to produce the desired output. In addition, the experiments conducted on the fuzzy components, in this first phase, seems to validate our idea that not only can using a specific highly interpretable component produce results similar to the state of the art, but this also makes it easier to understand them.
18-lug-2022
Settore INF/01 - Informatica
deep learning; xai; transformers; visual object tracking; fuzzy
BOLDI, PAOLO
BOLDI, PAOLO
Doctoral Thesis
ADVANCED METHODOLOGIES FOR VISUAL OBJECT TRACKING / E. Di Nardo ; tutor: P. Boldi ; co-tutor: A. Ciaramella ; coordinatore: P. Boldi. Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jul 18. 34. ciclo, Anno Accademico 2021.
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R12445.pdf

Open Access dal 02/09/2022

Tipologia: Publisher's version/PDF
Dimensione 17.57 MB
Formato Adobe PDF
17.57 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/931766
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact