From Coarse Attention to Fine-GrainedGaze: A Two-stage 3D Fully ConvolutionalNetwork for Predicting Eye Gaze in FirstPerson Video

Zehua Zhang, Sven Bambach, David Crandall, Yu Chen

(This site is still under construction! More contents will be added in the future!)

While gaze prediction in viewing static scenes has been well-studied, a much more challenging problem is to predict where people will look within their own fields of view during their everyday lives. This problem is difficult because where a person looks depends not just on the visual surroundings themselves, but on other information including the task they have in mind, their own internal state, their past gaze patterns and actions, as well as non-visual information (e.g., sounds) that might attract their attention. We use first person (egocentric) video captured from head-mounted cameras as an approximation of people’s field of view, and propose a two-stage 3D fully convolutional network
(T3F) to predict eye gaze for each frame in a video sequence. The model predicts a coarse attention region in the first stage and then combines this attention with spatial and temporal features to predict a precise gaze point in the second stage. We evaluate on a
public dataset to demonstrate that T3F outperforms other baseline methods, as well as on a new challenging dataset in which parents and toddlers are freely interacting with toys as well as each other without being prompted with any specific task.


Figure 1: The architecture of our proposed two-stage 3D FCN model for eye gaze prediction. The number below each part indicates the output dimension of this part, where N is the length of the input video sequence.

Figure 2: Example results of our model. Frames in the first row are from the OST dataset. Frames from the second row to the last row are from the ATT dataset. The big red cross indicates the ground truth gaze point, the little green cross indicates the gaze point prediction, and the blue bounding box indicates the coarse attention cell prediction.

 

Here is a short video of visualization of the results. We removed the bounding box visualization and the big blue cross is now our predicted eye gaze while the big red cross still indicates the ground truth.

This video below shows an interesting sequence where the ground truth is not reasonable due to calibration error of the eye tracker. However, our model predicts reasonable eye gaze points here.

Detailed architecture of each module of our model:

1. Feature Extractor:

# 1st layer group
x = Conv3D(64, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer), 
           padding='same', name='current_conv1')(currentFrames)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling3D(pool_size=(1, 2, 2), strides=(1, 2, 2),
                 padding='valid', name='current_pool1')(x)
# 2nd layer group
x = Conv3D(128, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv2')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling3D(pool_size=(1, 2, 2), strides=(1, 2, 2),
                 padding='valid', name='current_pool2')(x)
# 3rd layer group
x = Conv3D(256, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv3a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(256, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
          padding='same', name='current_conv3b')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling3D(pool_size=(1, 2, 2), strides=(1, 2, 2),
                 padding='valid', name='current_pool3')(x)
# 4th layer group
x = Conv3D(512, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv4a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv4b')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = MaxPooling3D(pool_size=(1, 2, 2), strides=(1, 2, 2),
                 padding='valid', name='current_pool4')(x)
# 5th layer group
x = Conv3D(512, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv5a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (tempConvSize, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv5b')(x)
x = BatchNormalization()(x)
exFeat = Activation('relu')(x)

2. Coarse Attention Predictor

x = Conv3D(512, (3, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv6a')(exFeat)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (1, 1, 1), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv6b')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(1, (1, 1, 1), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv6c')(x)
x = BatchNormalization()(x)
cap = Activation('linear')(x)

3. Upsampling and Processing Section

x = Concatenate()([exFeat, cap])
x = Conv3DTranspose(512, (1, 4, 4), strides=(1, 4, 4), padding='valid', name='current_deconv_1')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (3, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv7a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (3, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv7b')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3DTranspose(512, (1, 4, 4), strides=(1, 4, 4), padding='valid', name='current_deconv_2')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (3, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv8a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(512, (3, 3, 3), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv8b')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)

4. Fine-grained Gaze Predictor

x = Conv3D(512, (1, 1, 1), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv11a')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv3D(1, (1, 1, 1), kernel_regularizer=l2(regularizer),
           padding='same', name='current_conv11b')(x)
x = BatchNormalization()(x)
fgp = Activation('linear')(x)

Our Adults, Toddlers and Toys (ATT) dataset

Papers and presentations

BibTeX entries:

@inproceedings{attention2019neurips,
    title = {A Self Validation Network for Object-Level Human Attention Estimation},
    author = {Zehua Zhang and Chen Yu and David Crandall},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year = {2019}
}

@inproceedings{gaze2018bmvc,
    title = {From Coarse Attention to Fine-Grained Gaze: A Two-stage 3D Fully Convolutional Network for Predicting Eye Gaze in First Person Video},
    author = {Zehua Zhang and Sven Bambach and David Crandall and Chen Yu},
    year = {2018},
    booktitle = {British Machine Vision Conference (BMVC)}
}

Downloads

  • Poster
  • Slides
  • Full code will be available once we finish cleaning it
  • ATT Dataset: Due to children privacy and  human ethics, we are not able to publish it on the Internet. We are trying our best to make part of it available in the future. If you are interested in conducting research on this dataset, please contact the authors at zehzhang@indiana.edu.

Acknowledgements

This work was supported by the National Science Foundation (CAREER IIS-1253549) and the National Institutes of Health (R01 HD074601, R21 EY017843), and the IU Office of the Vice Provost for Research, the College of Arts and Sciences, and the School of Informatics, Computing, and Engineering through the Emerging Areas of Research Project “Learning: Brains, Machines, and Children.” We would like to thank Drew Abney, Esther Chen, Steven Elmlinger, Seth Foster, Laura Sloane, Catalina Suarez, Charlene Tay, Yayun Zhang for helping with the collection of the first-person toy play dataset, and Shujon Naha and Satoshi Tsutsui for helpful discussions.

The IU Computer Vision Lab's projects and activities have been funded, in part, by grants and contracts from the Air Force Office of Scientific Research (AFOSR), the Defense Threat Reduction Agency (DTRA), Dzyne Technologies, EgoVid, Inc., ETRI, Facebook, Google, Grant Thornton LLP, IARPA, the Indiana Innovation Institute (IN3), the IU Data to Insight Center, the IU Office of the Vice Provost for Research through an Emerging Areas of Research grant, the IU Social Sciences Research Commons, the Lilly Endowment, NASA, National Science Foundation (IIS-1253549, CNS-1834899, CNS-1408730, BCS-1842817, CNS-1744748, IIS-1257141, IIS-1852294), NVidia, ObjectVideo, Office of Naval Research (ONR), Pixm, Inc., and the U.S. Navy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.