Mixture of experts is a machine learning architecture that divides a model into separate sub-networks, each called an “expert,” where each one specializes in handling a particular type of input.
Rather than running every part of the network for every input, a routing mechanism called a gating network selects only the most relevant experts for tasks. This selective activation is what makes MoE models computationally efficient despite having a large number of parameters.
MoE architecture is heavily used in large language models and has helped models like Mixtral and GPT achieve strong performance at lower computational costs than equivalent dense models.
Share