Identifying First-person Camera Wearers in Third-person Videos

Chenyou Fan, David Crandall and Michael S. Ryoo

 

Abstract

We consider scenarios in which we wish to perform joint scene
understanding, object tracking, activity recognition, and other tasks
in scenarios in which multiple people are wearing body-worn
cameras while a third-person static camera also captures the scene.  To do
this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging because
the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we
propose a new semi-Siamese Convolutional Neural Network architecture
to address this novel challenge. We formulate the problem as learning
a joint embedding space for first- and third-person videos that considers both
spatial- and motion-domain cues.  A new triplet loss function is
designed to minimize the distance between correct first- and third-person matches
while maximizing the distance between incorrect ones. This end-to-end
approach performs significantly better than several baselines, in part
by learning the first- and third-person features optimized for matching jointly with
the distance measure itself.

 

Downloads

  • Github Code Repository  This repository is caffe implementation.
  • 1st-3rd dataset This dataset contains the images used for 1st-3rd person matching.

Acknowledgements

National Science Foundation Google
National Science
Foundation
Google Nvidia IU Pervasive Technology Institute IU Vice Provost for Research