🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Mixture of Experts (MoE)

Key Points

  • •
    A Mixture of Experts (MoE) routes each input to a small subset of specialized models called experts, enabling conditional computation.
  • •
    A gating (router) network computes probabilities over experts and selects top-k experts per input, reducing compute used per example.
  • •
    The core equation is y(x) = sum over experts of p(e∣x)timesexperto​utput,wherep(e∣x) comes from a softmax or top-k softmax.
  • •
    Sparse MoE activates only k experts out of E, often giving O(k) expert compute per token instead of O(E).
  • •
    Practical systems enforce per-expert capacity to avoid overload and add a load-balancing auxiliary loss during training.
  • •
    MoE layers are commonly used in large language models to scale parameters substantially while keeping inference cost manageable.
  • •
    Routing mistakes (e.g., imbalanced load, dropped tokens) can degrade quality, so balancing and capacity tuning are critical.
  • •
    C++ implementations center on efficient routing, batching by expert, and combining outputs with router weights.

Prerequisites

  • →Linear algebra (vectors, matrices, matrix multiplication) — Experts and routers are linear or multi-layer linear transformations, and combining outputs is a weighted sum.
  • →Softmax and probability distributions — Routing uses softmax to convert logits into probabilities over experts.
  • →Neural network basics (MLP, activation functions) — Experts are often small feed-forward networks and require understanding layers and activations.
  • →Big-O notation and computational complexity — MoE’s advantage is conditional computation; analyzing k vs E requires complexity reasoning.
  • →Batch processing and indexing — Efficient MoE needs routing, batching by expert, and reassembling outputs correctly.
  • →Regularization and auxiliary losses — Load balancing relies on auxiliary loss terms to encourage even expert usage.
  • →Numerical stability — Softmax requires log-sum-exp tricks to avoid overflow; probabilities must be renormalized safely.
  • →Parallel and systems considerations — Real MoE deployments rely on efficient dispatch, communication, and memory management.

Detailed Explanation

Tap terms for definitions

01Overview

A Mixture of Experts (MoE) is a model architecture that divides a complex task among multiple specialized sub-networks called experts. Instead of applying every expert to every input, a small router (gating network) decides which experts should process each input, enabling conditional computation. This design allows the total number of parameters to grow large—since many experts exist—while keeping the computation per input relatively constant, by activating only a few experts per example. In neural networks, experts are often simple feed-forward networks (MLPs) and the router is a linear layer followed by a softmax that produces a distribution over experts. The model’s output is a weighted combination of the selected experts’ outputs, where the weights come from the router probabilities. Modern variants use top-k routing (e.g., k=1 or 2) and enforce per-expert capacity limits to prevent overload. Training often includes an auxiliary loss to encourage even usage of experts across a batch. MoE layers have been instrumental in scaling large language models and other systems where parameter count and capacity matter, but computation budgets must stay bounded. The key trade-offs include routing quality, load balance, memory bandwidth, and implementation complexity.

02Intuition & Analogies

Imagine a hospital triage desk. Every incoming patient (input) is briefly assessed by a triage nurse (router) who decides which specialist(s) to consult—cardiologist, neurologist, or dermatologist (experts). Not every patient needs all specialists; most only see one or two. This saves time and resources compared to sending each patient to every doctor. Similarly, in a Mixture of Experts, the router quickly decides which expert networks are most relevant for an input, and only those experts run. The final diagnosis (model output) is then a careful combination of the selected experts’ opinions, often weighted by how confident the router was in each selection. If every patient went to every doctor, the hospital would be overwhelmed (dense mixture). The MoE design avoids this by activating only a few experts (sparse mixture), keeping the workload manageable while still having a huge staff of specialists available when needed. However, if too many patients are routed to the same doctor (load imbalance), queues build up and some patients may be turned away (capacity overflow). To prevent that, the system can add gentle incentives for the triage nurse to spread patients more evenly (auxiliary loss), and also set a maximum number of patients per specialist per hour (capacity). The benefits are large-scale expertise without proportional increases in per-patient cost.

03Formal Definition

Let there be E experts, each represented by a function fe​: Rdin​ → Rdout​. A gating network computes routing scores g(x) ∈ RE for an input x ∈ Rdin​. The probabilistic mixture uses p(e∣ x) = softmax(g(x))_{e}. The output is the expectation under this discrete distribution: y(x) = ∑e=1E​ p(e∣ x) fe​(x). In sparse MoE, we restrict computation to a subset Sk​(x) of size k (top-k routing): p'(e∣ x) = \frac{p(e\mid x) \cdot \mathbf{1}[e \in S_{k}(x)]}{∑j∈Sk​(x)​ p(j∣ x)} and y(x) = ∑e∈Sk​(x)​ p'(e∣ x) fe​(x). Practical implementations operate on batches X = \{xi​\}_{i=1}^{B}. For capacity management, each expert e has a capacity c = ⌈ α ⋅ B / E ⌉ with capacity factor α ≥ 1; tokens exceeding capacity may be rerouted or dropped. To improve load balance, an auxiliary loss encourages the empirical fraction of tokens assigned to each expert to match the router’s mean probabilities. Training optimizes the primary task loss plus balancing terms; inference typically uses the sparse top-k combination without auxiliary losses.

04When to Use

Use MoE layers when you want to scale parameter count and representational capacity without linearly increasing computation per input. This is common in large language models, translation systems, and multimodal models where diverse input patterns benefit from specialization. MoE is valuable when compute budgets are fixed (e.g., latency or cost constraints) but more parameters could improve quality. It is particularly useful in settings with heterogeneous data distributions, where different experts can specialize on different regimes or features. Consider MoE when the batch size is large enough to keep experts utilized and when your system can handle the added engineering for routing, batching by expert, and combining outputs. Avoid or delay MoE if you cannot afford additional memory/communication overhead, you have very small batches (leading to underutilization), or your deployment environment cannot tolerate routing variability. Also consider model maintenance: MoE may complicate profiling, debugging, and reproducibility because behavior depends on routing decisions, capacity limits, and potential token dropping under load.

⚠️Common Mistakes

• Ignoring load balance: Without an auxiliary loss or noise in router logits, a few experts can dominate, causing capacity overflow and quality loss. Add balancing losses and monitor per-expert traffic. • No capacity limits: Letting any number of tokens hit the same expert can explode latency and memory. Always set a capacity c = \lceil \alpha B / E \rceil and choose \alpha via profiling. • Dense compute by accident: Implementations that compute all experts then mask most outputs defeat the purpose. In sparse MoE, compute only selected experts’ outputs. • Unstable routing: Router logits with large magnitude can cause near-deterministic routing early in training. Use temperature, noise, or regularization to keep routing exploratory. • Incorrect combination weights: After selecting top-k experts, failing to renormalize probabilities leads to biased outputs. Renormalize within the chosen set. • Dropping too many tokens: If capacities are too small, many tokens get dropped or re-routed poorly. Track drop rate and adjust \alpha or E. • Overfitting experts: Experts that see too few varied examples may overspecialize. Use shared initialization, expert dropout, or data mixing. • Ignoring system costs: Routing and reordering overhead (CPU/GPU, memory traffic) can dominate if not optimized; batch by expert and fuse operations where possible.

Key Formulas

MoE Output (Dense)

y(x)=e=1∑E​p(e∣x)fe​(x)

Explanation: The model output is a weighted sum of all expert outputs, where weights are the router probabilities. This is the original mixture-of-experts formulation.

Softmax Gating

p(e∣x)=∑j=1E​exp(gj​(x))exp(ge​(x))​

Explanation: The router converts logits g(x) into a probability distribution over experts. Higher logits yield higher probabilities after normalization.

Top-k Renormalization

p′(e∣x)=∑j∈Sk​(x)​p(j∣x)p(e∣x)1[e∈Sk​(x)]​

Explanation: After selecting the top-k experts Sk​(x), probabilities are renormalized to sum to 1 over only those experts. This preserves probabilistic interpretation.

Expert Capacity

c=⌈Eα⋅B​⌉

Explanation: Each expert can process at most c tokens per batch. The capacity factor α controls how much slack is allowed relative to perfect balance.

Load Balancing Loss (Switch-style)

Laux​=Ee=1∑E​f^​e​p^​e​

Explanation: f^​e​ is the fraction of tokens assigned to expert e, and p^​e​ is the mean router probability for expert e. Minimizing this encourages usage to match probabilities.

Dense Compute Cost

FLOPsdense​≈B(Fgate​+EFexp​)

Explanation: For batch size B, dense MoE evaluates all E experts per token. Fg​ate is gating cost per token; Fe​xp is one expert’s cost per token.

Sparse Compute Cost

FLOPssparse​≈B(Fgate​+kFexp​)

Explanation: Top-k routing evaluates only k experts per token, greatly reducing compute when k ≪ E. The gating cost remains the same.

Expert Load

Ne​=i=1∑B​1[e∈TopK(xi​)]

Explanation: The number of tokens assigned to expert e in a batch. Monitoring Ne​ helps detect imbalance and set capacity.

Router Entropy

pentropy​(x)=−e=1∑E​p(e∣x)logp(e∣x)

Explanation: Entropy measures certainty of routing. High entropy indicates uncertain routing; entropy regularization can stabilize training.

Asymptotic Time (Sparse)

O(B⋅(din​E+k⋅Fexp​))

Explanation: With a linear router, computing logits costs O(B di​n E). Expert compute scales with k per token. This expresses overall time ignoring constants.

Complexity Analysis

Let B be the batch size, E the number of experts, k the number of selected experts per token (k << E in sparse MoE), di​n the input dimension, h the expert hidden width, and do​ut the output dimension. The router is typically a linear layer: computing logits costs O(B · di​n · E). For a dense mixture, each token evaluates all experts. If one expert is a 2-layer MLP with cost F_exp=O(di​n · h + h · do​ut), then the expert compute is O(B · E · Fe​xp). Total dense cost is O(B · di​n · E + B · E · Fe​xp) = O(B · E · (di​n + Fe​xp)), with space O(E · (di​n · h + h · do​ut)) for parameters plus O(B · do​ut) for activations. In sparse top-k MoE, only k experts run per token. The router cost remains O(B · di​n · E), but expert compute becomes O(B · k · Fe​xp). The total is O(B · di​n · E + B · k · Fe​xp). When E is large and k is small, expert compute shrinks by roughly a factor of E/k compared to dense. However, practical systems incur overheads: building per-expert queues, reordering tensors, and combining outputs. If dispatching is implemented with efficient indexing and batch packing, overhead can be near O(B + E), but in naive CPU code it may appear as O(B · k + E) per step. Space complexity adds buffers for routing probabilities O(B · E), dispatch indices O(B · k), and per-expert batches up to O(B) total, plus expert parameters as in dense. Capacity constraints limit per-expert batch sizes to c=ceil(alpha · B / E), which bounds peak memory and can cap worst-case latency per expert.

Code Examples

Dense Mixture of Experts (softmax over all experts)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple utilities for vectors and matrices
5using Vec = vector<double>;
6using Mat = vector<Vec>; // row-major: Mat[r][c]
7
8static std::mt19937 rng(42);
9
10double randn() {
11 static std::normal_distribution<double> dist(0.0, 0.02);
12 return dist(rng);
13}
14
15Vec softmax(const Vec &logits) {
16 double m = *max_element(logits.begin(), logits.end());
17 double sum = 0.0;
18 Vec probs(logits.size());
19 for (size_t i = 0; i < logits.size(); ++i) {
20 probs[i] = exp(logits[i] - m);
21 sum += probs[i];
22 }
23 for (double &p : probs) p /= (sum + 1e-12);
24 return probs;
25}
26
27struct Linear {
28 int in, out;
29 Mat W; // out x in
30 Vec b; // out
31 Linear(int in_, int out_) : in(in_), out(out_), W(out_, Vec(in_)), b(out_) {
32 for (int r = 0; r < out; ++r) {
33 for (int c = 0; c < in; ++c) W[r][c] = randn();
34 b[r] = 0.0;
35 }
36 }
37 Vec forward(const Vec &x) const {
38 Vec y(out, 0.0);
39 for (int r = 0; r < out; ++r) {
40 double s = b[r];
41 for (int c = 0; c < in; ++c) s += W[r][c] * x[c];
42 y[r] = s;
43 }
44 return y;
45 }
46};
47
48Vec relu(const Vec &x) {
49 Vec y = x;
50 for (double &v : y) v = max(0.0, v);
51 return y;
52}
53
54// A simple MLP expert: Linear -> ReLU -> Linear
55struct ExpertMLP {
56 Linear l1;
57 Linear l2;
58 ExpertMLP(int d_in, int h, int d_out) : l1(d_in, h), l2(h, d_out) {}
59 Vec forward(const Vec &x) const {
60 return l2.forward(relu(l1.forward(x)));
61 }
62};
63
64// Dense MoE: computes all experts then weights by router softmax
65struct DenseMoE {
66 int E; // number of experts
67 int d_in, d_out, h;
68 Linear router; // d_in -> E (router logits)
69 vector<ExpertMLP> experts;
70 DenseMoE(int d_in_, int d_out_, int h_, int E_)
71 : E(E_), d_in(d_in_), d_out(d_out_), h(h_), router(d_in_, E_) {
72 experts.reserve(E);
73 for (int e = 0; e < E; ++e) experts.emplace_back(d_in, h, d_out);
74 }
75
76 // Forward for a single input x
77 Vec forward_one(const Vec &x) const {
78 // 1) Router probabilities over experts
79 Vec logits = router.forward(x);
80 Vec p = softmax(logits);
81 // 2) Compute every expert and combine
82 Vec y(d_out, 0.0);
83 for (int e = 0; e < E; ++e) {
84 Vec ye = experts[e].forward(x); // expensive when E is large
85 for (int j = 0; j < d_out; ++j) y[j] += p[e] * ye[j];
86 }
87 return y;
88 }
89
90 // Batch forward (list of inputs)
91 vector<Vec> forward_batch(const vector<Vec> &X) const {
92 vector<Vec> Y; Y.reserve(X.size());
93 for (const auto &x : X) Y.push_back(forward_one(x));
94 return Y;
95 }
96};
97
98int main() {
99 ios::sync_with_stdio(false);
100 cin.tie(nullptr);
101
102 int B = 4; // batch size
103 int d_in = 8; // input dimension
104 int h = 16; // expert hidden width
105 int d_out = 6; // output dimension
106 int E = 5; // number of experts (dense computes all 5)
107
108 DenseMoE moe(d_in, d_out, h, E);
109
110 // Create a small batch of random inputs
111 vector<Vec> X(B, Vec(d_in));
112 std::uniform_real_distribution<double> ud(-1.0, 1.0);
113 for (int i = 0; i < B; ++i) for (int j = 0; j < d_in; ++j) X[i][j] = ud(rng);
114
115 auto Y = moe.forward_batch(X);
116
117 // Print results
118 cout << fixed << setprecision(4);
119 for (int i = 0; i < B; ++i) {
120 cout << "y[" << i << "]: ";
121 for (double v : Y[i]) cout << v << ' ';
122 cout << '\n';
123 }
124
125 return 0;
126}
127

This program builds a dense Mixture of Experts: a router produces softmax probabilities over E experts, every expert MLP is evaluated, and outputs are combined as a weighted sum. It demonstrates the core equation y(x) = sum_e p(e|x) f_e(x) and serves as a correctness baseline. However, it computes all experts for every input, which is computationally expensive when E is large.

Time: O(B · (d_in · E + E · (d_in · h + h · d_out)))Space: O(E · (d_in · h + h · d_out) + B · d_out)
Sparse Top-2 MoE with Capacity and Load-Balancing Metric
1#include <bits/stdc++.h>
2using namespace std;
3
4using Vec = vector<double>;
5using Mat = vector<Vec>;
6
7static std::mt19937 rng2(123);
8
9double randn2() {
10 static std::normal_distribution<double> dist(0.0, 0.02);
11 return dist(rng2);
12}
13
14Vec softmax(const Vec &logits) {
15 double m = *max_element(logits.begin(), logits.end());
16 double sum = 0.0;
17 Vec probs(logits.size());
18 for (size_t i = 0; i < logits.size(); ++i) {
19 probs[i] = exp(logits[i] - m);
20 sum += probs[i];
21 }
22 for (double &p : probs) p /= (sum + 1e-12);
23 return probs;
24}
25
26struct Linear {
27 int in, out;
28 vector<Vec> W; // out x in
29 Vec b; // out
30 Linear(int in_, int out_) : in(in_), out(out_), W(out_, Vec(in_)), b(out_) {
31 for (int r = 0; r < out; ++r) {
32 for (int c = 0; c < in; ++c) W[r][c] = randn2();
33 b[r] = 0.0;
34 }
35 }
36 Vec forward(const Vec &x) const {
37 Vec y(out, 0.0);
38 for (int r = 0; r < out; ++r) {
39 double s = b[r];
40 for (int c = 0; c < in; ++c) s += W[r][c] * x[c];
41 y[r] = s;
42 }
43 return y;
44 }
45};
46
47Vec relu(const Vec &x) {
48 Vec y = x;
49 for (double &v : y) v = max(0.0, v);
50 return y;
51}
52
53struct ExpertMLP {
54 Linear l1, l2;
55 ExpertMLP(int d_in, int h, int d_out) : l1(d_in, h), l2(h, d_out) {}
56 Vec forward(const Vec &x) const {
57 return l2.forward(relu(l1.forward(x)));
58 }
59};
60
61struct SparseTop2MoE {
62 int E, d_in, d_out, h;
63 Linear router; // d_in -> E logits
64 vector<ExpertMLP> experts; // E experts
65 double capacity_factor; // alpha
66
67 SparseTop2MoE(int d_in_, int d_out_, int h_, int E_, double alpha)
68 : E(E_), d_in(d_in_), d_out(d_out_), h(h_), router(d_in_, E_), capacity_factor(alpha) {
69 experts.reserve(E);
70 for (int e = 0; e < E; ++e) experts.emplace_back(d_in, h, d_out);
71 }
72
73 // Find top-2 indices and values from a probability vector p
74 tuple<int,int,double,double> top2(const Vec &p) const {
75 int a = -1, b = -1; double va = -1.0, vb = -1.0;
76 for (int i = 0; i < (int)p.size(); ++i) {
77 double v = p[i];
78 if (v > va) { vb = va; b = a; va = v; a = i; }
79 else if (v > vb) { vb = v; b = i; }
80 }
81 if (b == -1) { b = a; vb = 0.0; } // fallback if E==1
82 return {a, b, va, vb};
83 }
84
85 struct Assignment { int expert; int token_idx; double weight; };
86
87 // Forward on a batch with top-2 routing, per-expert capacity, and load-balance metric
88 vector<Vec> forward_batch(const vector<Vec> &X, double &aux_loss_out, double &drop_rate_out) const {
89 int B = (int)X.size();
90 int capacity = (int)ceil(capacity_factor * (double)B / (double)E);
91
92 // Router probabilities for each token
93 vector<Vec> P(B, Vec(E));
94 for (int i = 0; i < B; ++i) P[i] = softmax(router.forward(X[i]));
95
96 // Load-balance stats: mean probs per expert
97 Vec mean_p(E, 0.0);
98 for (int i = 0; i < B; ++i) for (int e = 0; e < E; ++e) mean_p[e] += P[i][e];
99 for (int e = 0; e < E; ++e) mean_p[e] /= max(1, B);
100
101 // Build per-expert queues (token indices + weights) with capacity
102 vector<vector<Assignment>> queues(E);
103 vector<int> counts(E, 0);
104 int dropped = 0;
105
106 for (int i = 0; i < B; ++i) {
107 auto [e1, e2, w1, w2] = top2(P[i]);
108 double s = w1 + w2 + 1e-12; // renormalize within top-2
109 w1 /= s; w2 /= s;
110
111 // Assign to e1 (primary)
112 if (counts[e1] < capacity) {
113 queues[e1].push_back({e1, i, w1});
114 counts[e1]++;
115 } else {
116 dropped++;
117 w1 = 0.0; // dropped contribution
118 }
119 // Assign to e2 (secondary)
120 if (counts[e2] < capacity) {
121 queues[e2].push_back({e2, i, w2});
122 counts[e2]++;
123 } else {
124 dropped++;
125 w2 = 0.0;
126 }
127 }
128
129 // Compute expert outputs only for assigned tokens
130 vector<Vec> Y(B, Vec(d_out, 0.0));
131 for (int e = 0; e < E; ++e) {
132 for (const auto &asgn : queues[e]) {
133 const Vec &x = X[asgn.token_idx];
134 Vec ye = experts[e].forward(x);
135 // Combine with weight
136 for (int j = 0; j < d_out; ++j) Y[asgn.token_idx][j] += asgn.weight * ye[j];
137 }
138 }
139
140 // Compute load balancing auxiliary loss: E * sum_e f_e * mean_p_e
141 Vec frac(E, 0.0);
142 for (int e = 0; e < E; ++e) frac[e] = (double)counts[e] / max(1, B); // per-expert assigned tokens / B
143 double aux = 0.0;
144 for (int e = 0; e < E; ++e) aux += frac[e] * mean_p[e];
145 aux_loss_out = E * aux;
146
147 drop_rate_out = (double)dropped / max(1, 2*B); // since up to 2 assignments per token
148
149 return Y;
150 }
151};
152
153int main() {
154 ios::sync_with_stdio(false);
155 cin.tie(nullptr);
156
157 int B = 6; int d_in = 8; int h = 16; int d_out = 6; int E = 4; double alpha = 1.25;
158 SparseTop2MoE moe(d_in, d_out, h, E, alpha);
159
160 // Create random inputs
161 vector<Vec> X(B, Vec(d_in));
162 std::uniform_real_distribution<double> ud(-1.0, 1.0);
163 for (int i = 0; i < B; ++i) for (int j = 0; j < d_in; ++j) X[i][j] = ud(rng2);
164
165 double aux = 0.0, drop_rate = 0.0;
166 auto Y = moe.forward_batch(X, aux, drop_rate);
167
168 cout << fixed << setprecision(4);
169 for (int i = 0; i < B; ++i) {
170 cout << "y[" << i << "]: ";
171 for (double v : Y[i]) cout << v << ' ';
172 cout << '\n';
173 }
174 cout << "Auxiliary load-balance loss (Switch-style): " << aux << "\n";
175 cout << "Drop rate (due to capacity): " << drop_rate << "\n";
176
177 return 0;
178}
179

This program implements sparse top-2 routing with per-expert capacity. It computes router probabilities, selects top-2 experts per token, renormalizes within the chosen set, builds per-expert queues up to capacity, evaluates only the needed experts, and combines outputs using the router weights. It also reports a Switch-style load-balancing auxiliary loss and a simple drop rate metric. Compared to dense MoE, this code avoids evaluating all experts for every token, illustrating conditional computation.

Time: O(B · (d_in · E) + B · k · (d_in · h + h · d_out) + E · capacity · d_out) with k=2Space: O(E · (d_in · h + h · d_out) + B · E (router probs) + B · k (queues))
#mixture of experts#moe#gating network#router#top-k routing#conditional computation#sparse experts#load balancing loss#capacity factor#switch transformer#expert specialization#dispatch combine#entropy regularization#neural networks#transformer moe