Multimodal model

A multimodal model, or multimodal foundational model (MFM), is a machine learning model that can process and combine information from multiple data types, such as text, images, audio, and video. Instead of working with a single input type, multimodal models learn to understand how different data formats relate to one another, providing a richer, more complete view of a situation.

This is different from unimodal models, which are built to handle only one modality at a time. For example, a traditional image model may only analyze pictures, while a text model only works with words.

Most multimodal models follow a similar structure. They use specialized encoders to process each type of data (for example, one for text and another for images), then combine those representations through a fusion mechanism. A decoder uses this combined understanding to generate an output, such as text, an image, audio, or another response that reflects all the inputs together.

Multimodal models can power applications like visual question answering, image and video captioning, cross-modal search, and generative AI systems that move between text, images, audio, and video. Examples include models like CLIP, DALL·E, Gemini, GPT-4o, and Claude, which support more natural and flexible ways for humans to interact with AI.