Architectures
Mixture of Experts
Scale parameters without scaling compute.
8 min read
MoE models route each token to a subset of expert sub-networks.
Total parameter count grows while active compute stays modest.
Mixtral, DeepSeek V3, and rumored GPT-4 all use MoE designs.