The empirical moment matrix and its application in computer vision

Title:
The empirical moment matrix and its application in computer vision
Creator:
Gou, Mengran (Author)
Contributor:
Camps, Octavia I. (Advisor)
Dy, Jennifer G. (Committee member)
Radke, Richard J. (Committee member)
Sznaier, Mario (Committee member)
Language:
English
Publisher:
Boston, Massachusetts : Northeastern University, May 2018
Date Awarded:
May 2018
Date Accepted:
April 2018
Type of resource:
Text
Genre:
Dissertations
Format:
electronic
Digital origin:
born digital
Abstract/Description:
Embedding local properties of an image, for instance its color intensities or the magnitude and orientation of its gradients, to create a representative feature is a critical component in many computer vision tasks, such as detection, classification, segmentation and tracking. A feature that is representative yet invariant to nuisance factors will scaffold the following modules in the processing pipeline and lead to a better performance for the task at hand. Statistical moments have often been utilized to build such descriptors since they provide a quantitative measure for the shape of the underlying distribution of the data. Examples of these include the covariance matrix feature, bilinear pooling encoding and Gaussian descriptors. However, until now, these features have been limited to using up to second order moments, i.e. the mean and variance of the data, and hence can be poor descriptors when the underlying distribution is non-Gaussian. This dissertation aims towards examining this problem in-depth and identifying possible solutions. In particular, we propose to use feature descriptors based on the empirical moment matrix, which gathers high order moments and embeds them into the manifold of symmetric positive definite (SPD) matrices. The effectiveness of the proposed approach is illustrated in the context of two computer vision problems: person re-Identification (re-ID) and fine-grain classification.

Person re-ID is the problem of matching images of a pedestrian across cameras with no overlapping fields of view. It is one of the key tasks in surveillance video processing. Yet, due to the extremely large inter-class variances across different cameras (e.g., poses, illumination, viewpoints), the performance of the state-of-the-art person re-id algorithms is still far from ideal. In this thesis, we propose a novel descriptor, based on the on-manifold mean of a moment matrix (moM) and horizontal mean pooling, which can be used to approximate complex, non-Gaussian, distributions of the pixel features within a mid-sized local patch. To mitigate the gap between academic research and real-world applications, two large-scale public re-ID datasets are proposed and a systematic benchmark evaluation is established on both new datasets. Extensive experiments on five widely used public re-ID datasets and two newly collected datasets demonstrate that incorporating the proposed moM feature improves re-ID performance.

Different from general objection recognition tasks, fine-grained classification usually tries to distinguish objects at the sub-category level, such as different makes of cars or different species of a bird. The main challenge of this task is the relatively large inter-class and relatively small intra-class variations. The most successful approaches to this problem use deep convolutional neural network(CNN), where the top convolutional layers perform a local representation extraction step and the bottom fully connected layers perform an encoding step. In the case of fine-grain classification, bilinear pooling and Gaussian embedding have been shown as the best encoding options but at the price of an enormous feature dimensionality. Approximate compact pooling methods have been explored towards addressing this weakness. Additionally, recent results have shown that significant performance gains can be achieved by using matrix normalization to regularize the unstable higher order information. However, combining compact pooling with matrix normalization has not been explored until now. In this thesis, we unify the bilinear pooling layer and the global Gaussian embedding layer through the empirical moment matrix in a novel deep architecture, moment embedding network MoNet. In addition, we propose a novel sub-matrix square-root layer, which can be used to normalize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pooling methods. Our experiments on three widely used fine- grained classification datasets illustrate that our proposed architecture MoNet can achieve similar or better performance than the state-of-art architectures . Furthermore, when combined with compact pooling techniques, it obtains comparable performance with encoded features but with only 4% of the dimensions.
Subjects and keywords:
computer vision
empirical moment
feature encoding
feature extraction
fine-grained classification
person re-identification
DOI:
https://doi.org/10.17760/D20291232
Permanent Link:
http://hdl.handle.net/2047/D20291232
Use and reproduction:
In Copyright: This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the right-holder(s). (http://rightsstatements.org/vocab/InC/1.0/)
Copyright restrictions may apply.

Downloads