AutoCaption: Automatic Caption Generation for Personal Photos

To appear at WACV 2014

Krishnan Ramnath     Simon Baker     Lucy Vanderwende     Motaz El-Saban     Sudipta Sinha     Anitha Kannan     Noran Hassan     Michel Galley    

Microsoft Research

Yi Yang                   Deva Ramanan

Dept. of Computer Science, UC Irvine

Alessandro Bergamo                   Lorenzo Torresani

Dept. of Computer Science, Dartmouth University


AutoCaption is a system that helps a smartphone user generate a caption for their photos. It operates by uploading the photo to a cloud service where a number of parallel modules are applied to recognize a variety of entities and relations. The outputs of the modules are combined to generate a large set of candidate captions, which are returned to the phone. The phone client includes a convenient user interface that allows users to select their favorite caption, reorder, add, or delete words to obtain the grammatical style they prefer. The user can also select from multiple candidates returned by the recognition modules.

System architecture. (Left) The smartphone client captures the photo and uploads it along with associated metadata to the cloud service. (Middle) The cloud service runs a number of processing modules in parallel. The outputs of the processing modules are passed through a fusion step and then to the text generator. The generated captions are then personalized. (Right) The client receives the captions and allows the user to pick and edit them before sharing.



End-to-End System Caption Generation Caption Selection Text Reordering Multiple Candidates