A multi-modal transformer approach for football event classification
Video understanding has been enhanced by the use of multi-modal networks. However, recent multi-modal video analysis models have limited applicability to sports videos due to their specialised nature. This paper proposes a novel attention-based multi-modal neural network for sports event classification featuring a multi-stage fusion training strategy. The proposed multi-modal neural network integrates three modalities, including an image sequence modality, an audio modality and a newly proposed sports formation modality, to improve the sports video classification performance. Empirical results show that the proposed model outperforms the state-of-the-art transformer-based video method by 4.43% on top-1 accuracy on Soccernet-V2 dataset.
Funding
China Scholarship Council
Loughborough University
JADE: Joint Academic Data science Endeavour - 2
Engineering and Physical Sciences Research Council
Find out more...History
School
- Science
Department
- Computer Science
Published in
2023 IEEE International Conference on Image Processing (ICIP)Pages
2220 - 2224Source
2023 IEEE International Conference on Image Processing (ICIP 2023)Publisher
IEEEVersion
- AM (Accepted Manuscript)
Rights holder
© IEEEPublisher statement
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Acceptance date
2023-06-21Publication date
2023-09-11Copyright date
2023ISBN
9781728198354Publisher version
Language
- en