Cross-task weakly supervised learning from instructional videos

2019-01-01
Zhukov, Dimitri
Alayrac, Jean-Baptiste
Cinbiş, Ramazan Gökberk
Fouhey, David
Laptev, Ivan
Sivic, Josef
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: "pour egg" should be trained jointly with other tasks involving "pour" and "egg". We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.

Suggestions

One-dimensional representation of two-dimensional information for HMM based handwriting recognition
Arica, N; Yarman Vural, Fatoş Tunay (Elsevier BV, 2000-06-01)
In this study, we introduce a one-dimensional feature set, which embeds two-dimensional information into an observation sequence of one-dimensional string, selected from a code-book. It provides a consistent normalization among distinct classes of shapes, which is very convenient for Hidden Markov Model (HMM) based shape recognition schemes. The normalization parameters, which maximize the recognition rate, are dynamically estimated in the training stage of HMM. The proposed recognition system is tested on ...
Face classification with support vector machine
Kepenekci, B; Akar, Gözde (2004-04-30)
A new approach to feature based frontal face recognition with Gabor wavelets and support vector machines is presented in this paper. The feature points are automatically extracted using the local characteristics of each individual face. A kernel that computes the similarity between two feature vectors, is used to map the face features to a space with higher dimension. To find the identity of a test face, the possible labels of each feature vector of that face is found with support vector machines, then the ...
A comparison on textured motion classification
Oztekin, Kaan; Akar, Gözde (2006-01-01)
Textured motion - generally known as dynamic or temporal texture analysis, classification, synthesis, segmentation and recognition is popular research areas in several fields such as computer vision, robotics, animation, multimedia databases etc. In the literature, several algorithms are proposed to characterize these textured motions such as stochastic and deterministic algorithms. However, there is no study which compares the performances of these algorithms. In this paper, we carry out a complete compari...
WEAKLY SUPERVISED DEEP CONVOLUTIONAL NETWORKS FOR FINE-GRAINED OBJECT RECOGNITION IN MULTISPECTRAL IMAGES
Aygunes, Bulut; AKSOY, SELİM; Cinbiş, Ramazan Gökberk (2019-01-01)
The challenging task of training object detectors for fine-grained classification faces additional difficulties when there are registration errors between the image data and the ground truth. We propose a weakly supervised learning methodology for the classification of 40 types of trees by using fixed-sized multispectral images with a class label but with no exact knowledge of the object location. Our approach consists of an end-to-end trainable convolutional neural network with separate branches for learni...
Low-level multiscale image segmentation and a benchmark for its evaluation
Akbaş, Emre (Elsevier BV, 2020-10-01)
In this paper, we present a segmentation algorithm to detect low-level structure present in images. The algorithm is designed to partition a given image into regions, corresponding to image structures, regardless of their shapes, sizes, and levels of interior homogeneity. We model a region as a connected set of pixels that is surrounded by ramp edge discontinuities where the magnitude of these discontinuities is large compared to the variation inside the region. Each region is associated with a scale that d...
Citation Formats
D. Zhukov, J.-B. Alayrac, R. G. Cinbiş, D. Fouhey, I. Laptev, and J. Sivic, “Cross-task weakly supervised learning from instructional videos,” 2019, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/36538.