We consider scenarios in which we wish to perform joint scene
understanding, object tracking, activity recognition, and other tasks
in scenarios in which multiple people are wearing body-worn
cameras while a third-person static camera also captures the scene. To do
this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging because
the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we
propose a new semi-Siamese Convolutional Neural Network architecture
to address this novel challenge. We formulate the problem as learning
a joint embedding space for first- and third-person videos that considers both
spatial- and motion-domain cues. A new triplet loss function is
designed to minimize the distance between correct first- and third-person matches
while maximizing the distance between incorrect ones. This end-to-end
approach performs significantly better than several baselines, in part
by learning the first- and third-person features optimized for matching jointly with
the distance measure itself.
- Github Code Repository This repository is caffe implementation.
- 1st-3rd dataset This dataset contains the images used for 1st-3rd person matching.
|Nvidia||IU Pervasive Technology Institute||IU Vice Provost for Research|