Deepdiary: Automatically Captioning Lifelogging Image Streams

Chenyou Fan and David Crandall

Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy.
We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections. An expanded version of this paper is available here.

Figure 1: Sample captions generated by captioning technique with diversity regulation.


Figure 2: LSTM model for generating captions

Papers and presentations

BibTeX entries:

    journal = {Journal of Visual Communication and Image Representation},
    title = {Deepdiary: Lifelogging image captioning and summarization},
    author = {Chenyou Fan and Zehua Zhang and David Crandall},
    year = {2018},
    volume = {55},
    month = {August},
    pages = {40--55}

    title = {DeepDiary: Automatically Captioning Lifelogging Image Streams},
    author = {Chenyou Fan and David Crandall},
    booktitle = {European Conference on Computer Vision International Workshop on Egocentric Perception, Interaction, and Computing (EPIC)},
    year = {2016}

    title = {{DeepDiary:} Automatic caption generation for lifelogging image streams},
    author = {Chenyou Fan and David Crandall},
    year = {2016},
    institution = {arXiv 1606.07839}


  • Poster.
  • Github Code Repository  This repository is caffe implementation of image captioning on lifelogging data. Please check our paper and repository readme for more information of how to use this package on generating interesting and diverse sentences for your own photos.
  • Lifelogging dataset This dataset contains the image VGG features and human labeling we collected during this project. Our github site has a detailed explanation of how to use the data files to train a human labeling model.
  • AMT dataset We list a subset of our dataset with photos which we published on Amazon Mechanical Turk for public labeling.


National Science Foundation Google Lily Endowment
National Science
Google Nvidia Lilly Endowment IU Pervasive Technology Institute IU Vice Provost for Research
The IU Computer Vision Lab's projects and activities have been funded, in part, by grants and contracts from the Air Force Office of Scientific Research (AFOSR), the Defense Threat Reduction Agency (DTRA), Dzyne Technologies, EgoVid, Inc., ETRI, Facebook, Google, Grant Thornton LLP, IARPA, the Indiana Innovation Institute (IN3), the IU Data to Insight Center, the IU Office of the Vice Provost for Research through an Emerging Areas of Research grant, the IU Social Sciences Research Commons, the Lilly Endowment, NASA, National Science Foundation (IIS-1253549, CNS-1834899, CNS-1408730, BCS-1842817, CNS-1744748, IIS-1257141, IIS-1852294), NVidia, ObjectVideo, Office of Naval Research (ONR), Pixm, Inc., and the U.S. Navy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.