What is an encoder?

An encoder is a neural network component that converts raw input—text, image, audio—into a compact numerical representation that captures the meaning and relationships within that input.

In the context of transformer-based models, the encoder processes an entire input sequence at once rather than word by word. As it does, a mechanism called self-attention calculates how strongly each token in the sequence relates to every other token, producing what are called contextualized embeddings, which are representations that shift depending on surrounding words. For example, the word “bank” in “river bank” gets a different representation than “bank” in “deposit at the bank”—the encoder captures both.

Encoders are one half of the encoder-decoder architecture, where the encoder's output is passed to a decoder that generates its own output sequence. This pairing is common in tasks like machine translation and speech recognition.

Encoder-only models like BERT drop the decoder entirely and use the encoder's representations directly for tasks like named entity recognition and semantic search, which require understanding rather than generation.