Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

Sven Bambach, Stefan Lee, David Crandall, Chen Yu

Abstract

We present a CNN-based technique for detecting, identifying, and segmenting hands in egocentric video that includes multiple people interacting with each other. To illustrate one specific application, we show that hand segments alone can be used for accurate activity recognition.

We present a CNN-based technique for detecting, identifying, and segmenting hands in egocentric video that includes multiple people interacting with each other. To illustrate one specific application, we show that hand segments alone can be used for accurate activity recognition.

Hands appear very often in egocentric video, and their appearance and pose give important cues about what people are doing and what they are paying attention to. But existing work in hand detection has made strong assumptions that work well in only simple scenarios, such as with limited interaction with other people or in lab settings. We develop methods to locate and distinguish between hands in egocentric video using strong appearance models with Convolutional Neural Networks, and introduce a simple candidate region generation approach that outperforms existing techniques at a fraction of the computational cost. We show how these high-quality bounding boxes can be used to create accurate pixelwise hand regions, and as an application, we investigate the extent to which hand segmentation alone can distinguish between different activities.  We evaluate these techniques on a new dataset of 48 first-person videos (along with pixel-level ground truth for over 15,000 hand instances) of people interacting in realistic environments.

Dataset

Visualizations of our dataset and ground truth annotations. Left: Ground truth hand segmentation masks superimposed on sample frames from the dataset, where colors indicate the different hand types. Right: A random subset of cropped hands according to ground truth segmentations (resized to square aspect ratios for ease of visualization.

Visualizations of our dataset and ground truth annotations. Left: Ground truth hand segmentation masks superimposed on sample frames from the dataset, where colors indicate the different hand types. Right: A random subset of cropped hands according to ground truth segmentations.

The EgoHands dataset contains 48 Google Glass videos of complex, first-person interactions between two people. The main intention of this dataset is to enable better, data-driven approaches to understanding hands in first-person computer vision. The dataset offers

  • high quality, pixel-level segmentations of hands
  • the possibility to semantically distinguish between the observer’s hands and someone else’s hands, as well as left and right hands
  • virtually unconstrained hand poses as actors freely engage in a set of joint activities lots of data with 15,053 ground-truth labeled hands

We provide the entire EgoHands dataset online under vision.soic.indiana.edu/egohands!

Caffe Models

New! We provide our trained caffe models for egocentric hand classification/detection online. Both models take input via caffe’s window data layer. The first network classifies windows between hands and background, while the second network classifies windows between background and four different semantic hand types (own left/right hand and other left/right hand). Both networks are trained with the “main split” training data from our dataset.

Window Proposal Code

New! We now provide MATLAB code for the window proposal method as discussed in Section 4.1 of the paper. If you also download the dataset from our dataset website, the provided code will learn sampling and skin-color parameters based on the training videos, as well as demonstrate how to apply the proposal method for unseen frames from the test videos. Simply download the file below and unzip it into the same directory as the dataset.

Papers and presentations

BibTeX entries:

@inproceedings{egohands2015iccv,
    title = {Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions},
    author = {Sven Bambach and Stefan Lee and David Crandall and Chen Yu},
    booktitle = {IEEE International Conference on Computer Vision (ICCV)},
    year = {2015}
}

Acknowledgements

National Science Foundation NIH_Master_Logo_Vertical_2Color Google Lily Endowment
National Science
Foundation
National Institutes
of Health
Google Nvidia Lilly Endowment IU Data to Insight Center
The IU Computer Vision Lab's projects and activities have been funded, in part, by grants and contracts from the Air Force Office of Scientific Research (AFOSR), the Defense Threat Reduction Agency (DTRA), Dzyne Technologies, EgoVid, Inc., ETRI, Facebook, Google, Grant Thornton LLP, IARPA, the Indiana Innovation Institute (IN3), the IU Data to Insight Center, the IU Office of the Vice Provost for Research through an Emerging Areas of Research grant, the IU Social Sciences Research Commons, the Lilly Endowment, NASA, National Science Foundation (IIS-1253549, CNS-1834899, CNS-1408730, BCS-1842817, CNS-1744748, IIS-1257141, IIS-1852294), NVidia, ObjectVideo, Office of Naval Research (ONR), Pixm, Inc., and the U.S. Navy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.