All topics
Architectures

Mixture of Experts

Scale parameters without scaling compute.

8 min read

MoE models route each token to a subset of expert sub-networks.

Total parameter count grows while active compute stays modest.

Mixtral, DeepSeek V3, and rumored GPT-4 all use MoE designs.