Recognition of human interactions with vehicles using 3-D models and dynamic context

Lee, Jong Taek, 1983-

Recognition of human interactions with vehicles using 3-D models and dynamic context

Access full-text files

LEE-DISSERTATION.pdf (26.85 MB)

Date

2012-05

Authors

Lee, Jong Taek, 1983-

Abstract

This dissertation describes two distinctive methods for human-vehicle interaction recognition: one for ground level videos and the other for aerial videos. For ground level videos, this dissertation presents a novel methodology which is able to estimate a detailed status of a scene involving multiple humans and vehicles. The system tracks their configuration even when they are performing complex interactions with severe occlusion such as when four persons are exiting a car together. The motivation is to identify the 3-D states of vehicles (e.g. status of doors), their relations with persons, which is necessary to analyze complex human-vehicle interactions (e.g. breaking into or stealing a vehicle), and the motion of humans and car doors to detect atomic human-vehicle interactions. A probabilistic algorithm has been designed to track humans and analyze their dynamic relationships with vehicles using a dynamic context. We have focused on two ideas. One is that many simple events can be detected based on a low-level analysis, and these detected events must contextually meet with human/vehicle status tracking results. The other is that the motion clue interferes with states in the current and future frames, and analyzing the motion is critical to detect such simple events. Our approach updates the probability of a person (or a vehicle) having a particular state based on these basic observed events. The probabilistic inference is made for the tracking process to match event-based evidence and motion-based evidence. For aerial videos, the object resolution is low, the visual cues are vague, and the detection and tracking of objects is less reliable as a consequence. Any method that requires accurate tracking of objects or the exact matching of event definition are better avoided. To address these issues, we present a temporal logic based approach which does not require training from event examples. At the low-level, we employ dynamic programming to perform fast model fitting between the tracked vehicle and the rendered 3-D vehicle models. At the semantic-level, given the localized event region of interest (ROI), we verify the time series of human-vehicle relationships with the pre-specified event definitions in a piecewise fashion. With special interest in recognizing a person getting into and out of a vehicle, we have tested our method on a subset of the VIRAT Aerial Video dataset and achieved superior results.