One-shot learning for long-tail visual relation detection

Wang, W; Wang, M; Wang, S; Long, G; Yao, L; Qi, G; Chen, Y

One-shot learning for long-tail visual relation detection

Wang, W Wang, M Wang, S Long, G

Yao, L Qi, G Chen, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, pp. 12225-12232
Issue Date:: 2020-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.19 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, W
dc.contributor.author	Wang, M
dc.contributor.author	Wang, S
dc.contributor.author	Long, G https://orcid.org/0000-0003-3740-9515
dc.contributor.author	Yao, L
dc.contributor.author	Qi, G
dc.contributor.author	Chen, Y
dc.date.accessioned	2021-06-20T19:30:25Z
dc.date.available	2021-06-20T19:30:25Z
dc.date.issued	2020-01-01
dc.identifier.citation	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, pp. 12225-12232
dc.identifier.isbn	9781577358350
dc.identifier.uri	http://hdl.handle.net/10453/149646
dc.description.abstract	The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in <object-predicate-object> form; for example, <person-lean on-wall>. This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2.8% to 12.2% compared with state-of-the-art baselines.
dc.language	en
dc.relation.ispartof	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
dc.rights	info:eu-repo/semantics/openAccess
dc.title	One-shot learning for long-tail visual relation detection
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2021-06-20T19:30:22Z
pubs.publication-status	Published

Abstract:

The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in form; for example, . This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2.8% to 12.2% compared with state-of-the-art baselines.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/149646