Otter is a powerful AI tool that utilizes the MIMIC-IT dataset, which consists of 2.8 million multimodal instruction-response pairs derived from images and videos. Each pair includes in-context information, creating conversational contexts to enhance the performance of vision-language models (VLMs) in perception, reasoning, and planning. Otter, a large VLM trained using the MIMIC-IT dataset, showcases exceptional proficiency in multi-modal perception and aligns effectively with user intentions. The MIMIC-IT dataset is not only diverse and creative, but also multilingual, supporting eight languages. This tool has applications in various vision-language tasks, from general scene understanding to enhancing comprehension for augmented reality headsets, and is expected to advance research in multimodal in-context instruction tuning and vision-language models.