Contrastive Language–Image Pretraining (CLIP)

Contrastive Language–Image Pretraining (CLIP) is a multimodal AI model that was developed by OpenAI to connect visual and textual information—meaning it doesn’t treat vision and language as separate problems. It uses two neural network encoders—one for images and one for text—that are trained together to map both types of data into a shared representation space. In this space, matching image–text pairs are close together, while unrelated pairs are far apart.

During training, the model learns to identify which text best matches a given image (and which image best matches a given piece of text). This setup allows CLIP to understand visual concepts through natural language, without relying on fixed data labels.

A key strength of CLIP is zero-shot learning (ZSL). After training, it can classify or retrieve images using plain text prompts, even for categories it was never explicitly trained on. For example, CLIP can decide whether an image matches the phrase “a photo of a red bicycle” without needing a special bicycle classifier.

CLIP can be applied to tasks like zero-shot image classification, text-to-image retrieval, and content moderation. However, it may struggle with fine-grained distinctions or images outside the scope of its training data.