Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

Authors: Satoshi Tsutsui and David Crandall.


Recent work in computer vision has yielded impressive results in automatically describing images with natural language. Most of these systems generate captions in a sin- gle language, requiring multiple language-specific models to build a multilingual captioning system. We propose a very simple technique to build a single unified model across languages, using artificial tokens to control the language, making the captioning system more compact. We evaluate our approach on generating English and Japanese captions, and show that a typical neural captioning architecture is capable of learning a single model that can switch between two different languages.


– Github code:
– Arxiv of the paper:
– Citation: If you use the figure separator in your research or found it useful, please consider to cite:

title={{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}},
author={Satoshi Tsutsui, David Crandall},

author={Satoshi Tsutsui, David Crandall},
booktitle = {CVPR Language and Vision Workshop},
title = {{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}},
year = {2017}


Randomly selected samples of automatically-generated image captions. En only captions are from the model trained on English, Jp only are from the model trained on Japanese, and Unified are from the single model trained with both.

The IU Computer Vision Lab's projects and activities have been funded, in part, by grants and contracts from the Air Force Office of Scientific Research (AFOSR), the Defense Threat Reduction Agency (DTRA), Dzyne Technologies, EgoVid, Inc., ETRI, Facebook, Google, Grant Thornton LLP, IARPA, the Indiana Innovation Institute (IN3), the IU Data to Insight Center, the IU Office of the Vice Provost for Research through an Emerging Areas of Research grant, the IU Social Sciences Research Commons, the Lilly Endowment, NASA, National Science Foundation (IIS-1253549, CNS-1834899, CNS-1408730, BCS-1842817, CNS-1744748, IIS-1257141, IIS-1852294), NVidia, ObjectVideo, Office of Naval Research (ONR), Pixm, Inc., and the U.S. Navy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.