📚TheoryIntermediate

Mixture of Experts (MoE)

Key Points

•
A Mixture of Experts (MoE) routes each input to a small subset of specialized models called experts, enabling conditional computation.
•
A gating (router) network computes probabilities over experts and selects top-k experts per input, reducing compute used per example.
•
The core equation is y(x) = sum over experts of p(e $∣ x) t im ese x p er t_{o} u tp u t, w h ere p (e ∣$ x) comes from a softmax or top-k softmax.
•
Sparse MoE activates only k experts out of E, often giving O(k) expert compute per token instead of O(E).
•
Practical systems enforce per-expert capacity to avoid overload and add a load-balancing auxiliary loss during training.
•
MoE layers are commonly used in large language models to scale parameters substantially while keeping inference cost manageable.
•
Routing mistakes (e.g., imbalanced load, dropped tokens) can degrade quality, so balancing and capacity tuning are critical.
•
C++ implementations center on efficient routing, batching by expert, and combining outputs with router weights.

Prerequisites

→Linear algebra (vectors, matrices, matrix multiplication) — Experts and routers are linear or multi-layer linear transformations, and combining outputs is a weighted sum.
→Softmax and probability distributions — Routing uses softmax to convert logits into probabilities over experts.
→Neural network basics (MLP, activation functions) — Experts are often small feed-forward networks and require understanding layers and activations.
→Big-O notation and computational complexity — MoE’s advantage is conditional computation; analyzing k vs E requires complexity reasoning.
→Batch processing and indexing — Efficient MoE needs routing, batching by expert, and reassembling outputs correctly.
→Regularization and auxiliary losses — Load balancing relies on auxiliary loss terms to encourage even expert usage.
→Numerical stability — Softmax requires log-sum-exp tricks to avoid overflow; probabilities must be renormalized safely.
→Parallel and systems considerations — Real MoE deployments rely on efficient dispatch, communication, and memory management.

Detailed Explanation

Tap terms for definitions

01Overview

A Mixture of Experts (MoE) is a model architecture that divides a complex task among multiple specialized sub-networks called experts. Instead of applying every expert to every input, a small router (gating network) decides which experts should process each input, enabling conditional computation. This design allows the total number of parameters to grow large—since many experts exist—while keeping the computation per input relatively constant, by activating only a few experts per example. In neural networks, experts are often simple feed-forward networks (MLPs) and the router is a linear layer followed by a softmax that produces a distribution over experts. The model’s output is a weighted combination of the selected experts’ outputs, where the weights come from the router probabilities. Modern variants use top-k routing (e.g., k=1 or 2) and enforce per-expert capacity limits to prevent overload. Training often includes an auxiliary loss to encourage even usage of experts across a batch. MoE layers have been instrumental in scaling large language models and other systems where parameter count and capacity matter, but computation budgets must stay bounded. The key trade-offs include routing quality, load balance, memory bandwidth, and implementation complexity.

02Intuition & Analogies

Imagine a hospital triage desk. Every incoming patient (input) is briefly assessed by a triage nurse (router) who decides which specialist(s) to consult—cardiologist, neurologist, or dermatologist (experts). Not every patient needs all specialists; most only see one or two. This saves time and resources compared to sending each patient to every doctor. Similarly, in a Mixture of Experts, the router quickly decides which expert networks are most relevant for an input, and only those experts run. The final diagnosis (model output) is then a careful combination of the selected experts’ opinions, often weighted by how confident the router was in each selection. If every patient went to every doctor, the hospital would be overwhelmed (dense mixture). The MoE design avoids this by activating only a few experts (sparse mixture), keeping the workload manageable while still having a huge staff of specialists available when needed. However, if too many patients are routed to the same doctor (load imbalance), queues build up and some patients may be turned away (capacity overflow). To prevent that, the system can add gentle incentives for the triage nurse to spread patients more evenly (auxiliary loss), and also set a maximum number of patients per specialist per hour (capacity). The benefits are large-scale expertise without proportional increases in per-patient cost.

03Formal Definition

Let there be E experts, each represented by a function

f_{e}

R^{d_{in}}

\to

R^{d_{o u t}}

. A gating network computes routing scores g(x)

\in

R^{E}

for an input x

\in

R^{d_{in}}

. The probabilistic mixture uses p(e

∣

x) =

softmax

(g(x))_{e}. The output is the expectation under this discrete distribution: y(x) =

\sum_{e = 1}^{E}

p(e

∣

f_{e}

(x). In sparse MoE, we restrict computation to a subset

S_{k}

(x) of size k (top-k routing): p'(e

∣

x) =

\frac{p(e\mid x) \cdot \mathbf{1}[e \in S_{k}(x)]}

{

\sum_{j \in S_{k} (x)}

p(j

∣

x)} and y(x) =

\sum_{e \in S_{k} (x)}

p'(e

∣

f_{e}

(x). Practical implementations operate on batches X = \{

x_{i}

\}_{

i=1}^{B}

. For capacity management, each expert e has a capacity c =

⌈

α

\cdot

B / E

⌉

with capacity factor

α

\geq

1; tokens exceeding capacity may be rerouted or dropped. To improve load balance, an auxiliary loss encourages the empirical fraction of tokens assigned to each expert to match the router’s mean probabilities. Training optimizes the primary task loss plus balancing terms; inference typically uses the sparse top-k combination without auxiliary losses.

04When to Use

Use MoE layers when you want to scale parameter count and representational capacity without linearly increasing computation per input. This is common in large language models, translation systems, and multimodal models where diverse input patterns benefit from specialization. MoE is valuable when compute budgets are fixed (e.g., latency or cost constraints) but more parameters could improve quality. It is particularly useful in settings with heterogeneous data distributions, where different experts can specialize on different regimes or features. Consider MoE when the batch size is large enough to keep experts utilized and when your system can handle the added engineering for routing, batching by expert, and combining outputs. Avoid or delay MoE if you cannot afford additional memory/communication overhead, you have very small batches (leading to underutilization), or your deployment environment cannot tolerate routing variability. Also consider model maintenance: MoE may complicate profiling, debugging, and reproducibility because behavior depends on routing decisions, capacity limits, and potential token dropping under load.

⚠️Common Mistakes

• Ignoring load balance: Without an auxiliary loss or noise in router logits, a few experts can dominate, causing capacity overflow and quality loss. Add balancing losses and monitor per-expert traffic. • No capacity limits: Letting any number of tokens hit the same expert can explode latency and memory. Always set a capacity c = \lceil \alpha B / E \rceil and choose \alpha via profiling. • Dense compute by accident: Implementations that compute all experts then mask most outputs defeat the purpose. In sparse MoE, compute only selected experts’ outputs. • Unstable routing: Router logits with large magnitude can cause near-deterministic routing early in training. Use temperature, noise, or regularization to keep routing exploratory. • Incorrect combination weights: After selecting top-k experts, failing to renormalize probabilities leads to biased outputs. Renormalize within the chosen set. • Dropping too many tokens: If capacities are too small, many tokens get dropped or re-routed poorly. Track drop rate and adjust \alpha or E. • Overfitting experts: Experts that see too few varied examples may overspecialize. Use shared initialization, expert dropout, or data mixing. • Ignoring system costs: Routing and reordering overhead (CPU/GPU, memory traffic) can dominate if not optimized; batch by expert and fuse operations where possible.

Key Formulas

MoE Output (Dense)

y (x) = e = 1 \sum E p (e ∣ x) f_{e} (x)

Explanation: The model output is a weighted sum of all expert outputs, where weights are the router probabilities. This is the original mixture-of-experts formulation.

Softmax Gating

p (e ∣ x) = \frac{exp ( g _{e} ( x ))}{\sum _{j = 1}^{E} exp ( g _{j} ( x ))}

Explanation: The router converts logits g(x) into a probability distribution over experts. Higher logits yield higher probabilities after normalization.

Top-k Renormalization

p^{'} (e ∣ x) = \frac{p ( e ∣ x ) 1 [ e \in S _{k} ( x )]}{\sum _{j \in S_{k} (x)} p ( j ∣ x )}

Explanation: After selecting the top-k experts $S_{k}$ (x), probabilities are renormalized to sum to 1 over only those experts. This preserves probabilistic interpretation.

Expert Capacity

c = ⌈ \frac{α \cdot B}{E} ⌉

Explanation: Each expert can process at most c tokens per batch. The capacity factor $α$ controls how much slack is allowed relative to perfect balance.

Load Balancing Loss (Switch-style)

L_{aux} = E e = 1 \sum E \hat{f}_{e} \overset{p}{^}_{e}

Explanation: $\hat{f}_{e}$ is the fraction of tokens assigned to expert e, and $\overset{p}{^}_{e}$ is the mean router probability for expert e. Minimizing this encourages usage to match probabilities.

Dense Compute Cost

FLOPs_{dense} \approx B (F_{gate} + E F_{exp})

Explanation: For batch size B, dense MoE evaluates all E experts per token. $F_{g} a t e$ is gating cost per token; $F_{e} x p$ is one expert’s cost per token.

Sparse Compute Cost

FLOPs_{sparse} \approx B (F_{gate} + k F_{exp})

Explanation: Top-k routing evaluates only k experts per token, greatly reducing compute when k $≪$ E. The gating cost remains the same.

Expert Load

N_{e} = i = 1 \sum B 1 [e \in TopK (x_{i})]

Explanation: The number of tokens assigned to expert e in a batch. Monitoring $N_{e}$ helps detect imbalance and set capacity.

Router Entropy

p_{entropy} (x) = - e = 1 \sum E p (e ∣ x) lo g p (e ∣ x)

Explanation: Entropy measures certainty of routing. High entropy indicates uncertain routing; entropy regularization can stabilize training.

Asymptotic Time (Sparse)

O (B \cdot (d_{in} E + k \cdot F_{exp}))

Explanation: With a linear router, computing logits costs O(B $d_{i} n$ E). Expert compute scales with k per token. This expresses overall time ignoring constants.

Complexity Analysis

Let B be the batch size, E the number of experts, k the number of selected experts per token (k << E in sparse MoE),

d_{i} n

the input dimension, h the expert hidden width, and

d_{o} u t

the output dimension. The router is typically a linear layer: computing logits costs O(B ·

d_{i} n

· E). For a dense mixture, each token evaluates all experts. If one expert is a 2-layer MLP with cost F_

e x p = O (d_{i} n

· h + h ·

d_{o} u t

), then the expert compute is O(B · E ·

F_{e} x p

). Total dense cost is O(B ·

d_{i} n

· E + B · E ·

F_{e} x p

) = O(B · E · (

d_{i} n

F_{e} x p

)), with space O(E · (

d_{i} n

· h + h ·

d_{o} u t

)) for parameters plus O(B ·

d_{o} u t

) for activations. In sparse top-k MoE, only k experts run per token. The router cost remains O(B ·

d_{i} n

· E), but expert compute becomes O(B · k ·

F_{e} x p

). The total is O(B ·

d_{i} n

· E + B · k ·

F_{e} x p

). When E is large and k is small, expert compute shrinks by roughly a factor of E/k compared to dense. However, practical systems incur overheads: building per-expert queues, reordering tensors, and combining outputs. If dispatching is implemented with efficient indexing and batch packing, overhead can be near O(B + E), but in naive CPU code it may appear as O(B · k + E) per step. Space complexity adds buffers for routing probabilities O(B · E), dispatch indices O(B · k), and per-expert batches up to O(B) total, plus expert parameters as in dense. Capacity constraints limit per-expert batch sizes to

c = ce i l (a lp ha

· B / E), which bounds peak memory and can cap worst-case latency per expert.

Code Examples

Dense Mixture of Experts (softmax over all experts)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple utilities for vectors and matrices
5 using Vec = vector<double>;
6 using Mat = vector<Vec>; // row-major: Mat[r][c]
7 
8 static std::mt19937 rng(42);
9 
10 double randn() {
11     static std::normal_distribution<double> dist(0.0, 0.02);
12     return dist(rng);
13 }
14 
15 Vec softmax(const Vec &logits) {
16     double m = *max_element(logits.begin(), logits.end());
17     double sum = 0.0;
18     Vec probs(logits.size());
19     for (size_t i = 0; i < logits.size(); ++i) {
20         probs[i] = exp(logits[i] - m);
21         sum += probs[i];
22     }
23     for (double &p : probs) p /= (sum + 1e-12);
24     return probs;
25 }
26 
27 struct Linear {
28     int in, out;
29     Mat W; // out x in
30     Vec b; // out
31     Linear(int in_, int out_) : in(in_), out(out_), W(out_, Vec(in_)), b(out_) {
32         for (int r = 0; r < out; ++r) {
33             for (int c = 0; c < in; ++c) W[r][c] = randn();
34             b[r] = 0.0;
35         }
36     }
37     Vec forward(const Vec &x) const {
38         Vec y(out, 0.0);
39         for (int r = 0; r < out; ++r) {
40             double s = b[r];
41             for (int c = 0; c < in; ++c) s += W[r][c] * x[c];
42             y[r] = s;
43         }
44         return y;
45     }
46 };
47 
48 Vec relu(const Vec &x) {
49     Vec y = x;
50     for (double &v : y) v = max(0.0, v);
51     return y;
52 }
53 
54 // A simple MLP expert: Linear -> ReLU -> Linear
55 struct ExpertMLP {
56     Linear l1;
57     Linear l2;
58     ExpertMLP(int d_in, int h, int d_out) : l1(d_in, h), l2(h, d_out) {}
59     Vec forward(const Vec &x) const {
60         return l2.forward(relu(l1.forward(x)));
61     }
62 };
63 
64 // Dense MoE: computes all experts then weights by router softmax
65 struct DenseMoE {
66     int E; // number of experts
67     int d_in, d_out, h;
68     Linear router;       // d_in -> E (router logits)
69     vector<ExpertMLP> experts;
70     DenseMoE(int d_in_, int d_out_, int h_, int E_)
71         : E(E_), d_in(d_in_), d_out(d_out_), h(h_), router(d_in_, E_) {
72         experts.reserve(E);
73         for (int e = 0; e < E; ++e) experts.emplace_back(d_in, h, d_out);
74     }
75 
76     // Forward for a single input x
77     Vec forward_one(const Vec &x) const {
78         // 1) Router probabilities over experts
79         Vec logits = router.forward(x);
80         Vec p = softmax(logits);
81         // 2) Compute every expert and combine
82         Vec y(d_out, 0.0);
83         for (int e = 0; e < E; ++e) {
84             Vec ye = experts[e].forward(x); // expensive when E is large
85             for (int j = 0; j < d_out; ++j) y[j] += p[e] * ye[j];
86         }
87         return y;
88     }
89 
90     // Batch forward (list of inputs)
91     vector<Vec> forward_batch(const vector<Vec> &X) const {
92         vector<Vec> Y; Y.reserve(X.size());
93         for (const auto &x : X) Y.push_back(forward_one(x));
94         return Y;
95     }
96 };
97 
98 int main() {
99     ios::sync_with_stdio(false);
100     cin.tie(nullptr);
101 
102     int B = 4;         // batch size
103     int d_in = 8;      // input dimension
104     int h = 16;        // expert hidden width
105     int d_out = 6;     // output dimension
106     int E = 5;         // number of experts (dense computes all 5)
107 
108     DenseMoE moe(d_in, d_out, h, E);
109 
110     // Create a small batch of random inputs
111     vector<Vec> X(B, Vec(d_in));
112     std::uniform_real_distribution<double> ud(-1.0, 1.0);
113     for (int i = 0; i < B; ++i) for (int j = 0; j < d_in; ++j) X[i][j] = ud(rng);
114 
115     auto Y = moe.forward_batch(X);
116 
117     // Print results
118     cout << fixed << setprecision(4);
119     for (int i = 0; i < B; ++i) {
120         cout << "y[" << i << "]: ";
121         for (double v : Y[i]) cout << v << ' ';
122         cout << '\n';
123     }
124 
125     return 0;
126 }
127

This program builds a dense Mixture of Experts: a router produces softmax probabilities over E experts, every expert MLP is evaluated, and outputs are combined as a weighted sum. It demonstrates the core equation y(x) = sum_e p(e|x) f_e(x) and serves as a correctness baseline. However, it computes all experts for every input, which is computationally expensive when E is large.

Time: O(B · (d_in · E + E · (d_in · h + h · d_out)))Space: O(E · (d_in · h + h · d_out) + B · d_out)

Sparse Top-2 MoE with Capacity and Load-Balancing Metric

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<Vec>;
6 
7 static std::mt19937 rng2(123);
8 
9 double randn2() {
10     static std::normal_distribution<double> dist(0.0, 0.02);
11     return dist(rng2);
12 }
13 
14 Vec softmax(const Vec &logits) {
15     double m = *max_element(logits.begin(), logits.end());
16     double sum = 0.0;
17     Vec probs(logits.size());
18     for (size_t i = 0; i < logits.size(); ++i) {
19         probs[i] = exp(logits[i] - m);
20         sum += probs[i];
21     }
22     for (double &p : probs) p /= (sum + 1e-12);
23     return probs;
24 }
25 
26 struct Linear {
27     int in, out;
28     vector<Vec> W; // out x in
29     Vec b;         // out
30     Linear(int in_, int out_) : in(in_), out(out_), W(out_, Vec(in_)), b(out_) {
31         for (int r = 0; r < out; ++r) {
32             for (int c = 0; c < in; ++c) W[r][c] = randn2();
33             b[r] = 0.0;
34         }
35     }
36     Vec forward(const Vec &x) const {
37         Vec y(out, 0.0);
38         for (int r = 0; r < out; ++r) {
39             double s = b[r];
40             for (int c = 0; c < in; ++c) s += W[r][c] * x[c];
41             y[r] = s;
42         }
43         return y;
44     }
45 };
46 
47 Vec relu(const Vec &x) {
48     Vec y = x;
49     for (double &v : y) v = max(0.0, v);
50     return y;
51 }
52 
53 struct ExpertMLP {
54     Linear l1, l2;
55     ExpertMLP(int d_in, int h, int d_out) : l1(d_in, h), l2(h, d_out) {}
56     Vec forward(const Vec &x) const {
57         return l2.forward(relu(l1.forward(x)));
58     }
59 };
60 
61 struct SparseTop2MoE {
62     int E, d_in, d_out, h;
63     Linear router;                  // d_in -> E logits
64     vector<ExpertMLP> experts;      // E experts
65     double capacity_factor;         // alpha
66 
67     SparseTop2MoE(int d_in_, int d_out_, int h_, int E_, double alpha)
68         : E(E_), d_in(d_in_), d_out(d_out_), h(h_), router(d_in_, E_), capacity_factor(alpha) {
69         experts.reserve(E);
70         for (int e = 0; e < E; ++e) experts.emplace_back(d_in, h, d_out);
71     }
72 
73     // Find top-2 indices and values from a probability vector p
74     tuple<int,int,double,double> top2(const Vec &p) const {
75         int a = -1, b = -1; double va = -1.0, vb = -1.0;
76         for (int i = 0; i < (int)p.size(); ++i) {
77             double v = p[i];
78             if (v > va) { vb = va; b = a; va = v; a = i; }
79             else if (v > vb) { vb = v; b = i; }
80         }
81         if (b == -1) { b = a; vb = 0.0; } // fallback if E==1
82         return {a, b, va, vb};
83     }
84 
85     struct Assignment { int expert; int token_idx; double weight; };
86 
87     // Forward on a batch with top-2 routing, per-expert capacity, and load-balance metric
88     vector<Vec> forward_batch(const vector<Vec> &X, double &aux_loss_out, double &drop_rate_out) const {
89         int B = (int)X.size();
90         int capacity = (int)ceil(capacity_factor * (double)B / (double)E);
91 
92         // Router probabilities for each token
93         vector<Vec> P(B, Vec(E));
94         for (int i = 0; i < B; ++i) P[i] = softmax(router.forward(X[i]));
95 
96         // Load-balance stats: mean probs per expert
97         Vec mean_p(E, 0.0);
98         for (int i = 0; i < B; ++i) for (int e = 0; e < E; ++e) mean_p[e] += P[i][e];
99         for (int e = 0; e < E; ++e) mean_p[e] /= max(1, B);
100 
101         // Build per-expert queues (token indices + weights) with capacity
102         vector<vector<Assignment>> queues(E);
103         vector<int> counts(E, 0);
104         int dropped = 0;
105 
106         for (int i = 0; i < B; ++i) {
107             auto [e1, e2, w1, w2] = top2(P[i]);
108             double s = w1 + w2 + 1e-12; // renormalize within top-2
109             w1 /= s; w2 /= s;
110 
111             // Assign to e1 (primary)
112             if (counts[e1] < capacity) {
113                 queues[e1].push_back({e1, i, w1});
114                 counts[e1]++;
115             } else {
116                 dropped++;
117                 w1 = 0.0; // dropped contribution
118             }
119             // Assign to e2 (secondary)
120             if (counts[e2] < capacity) {
121                 queues[e2].push_back({e2, i, w2});
122                 counts[e2]++;
123             } else {
124                 dropped++;
125                 w2 = 0.0;
126             }
127         }
128 
129         // Compute expert outputs only for assigned tokens
130         vector<Vec> Y(B, Vec(d_out, 0.0));
131         for (int e = 0; e < E; ++e) {
132             for (const auto &asgn : queues[e]) {
133                 const Vec &x = X[asgn.token_idx];
134                 Vec ye = experts[e].forward(x);
135                 // Combine with weight
136                 for (int j = 0; j < d_out; ++j) Y[asgn.token_idx][j] += asgn.weight * ye[j];
137             }
138         }
139 
140         // Compute load balancing auxiliary loss: E * sum_e f_e * mean_p_e
141         Vec frac(E, 0.0);
142         for (int e = 0; e < E; ++e) frac[e] = (double)counts[e] / max(1, B); // per-expert assigned tokens / B
143         double aux = 0.0;
144         for (int e = 0; e < E; ++e) aux += frac[e] * mean_p[e];
145         aux_loss_out = E * aux;
146 
147         drop_rate_out = (double)dropped / max(1, 2*B); // since up to 2 assignments per token
148 
149         return Y;
150     }
151 };
152 
153 int main() {
154     ios::sync_with_stdio(false);
155     cin.tie(nullptr);
156 
157     int B = 6; int d_in = 8; int h = 16; int d_out = 6; int E = 4; double alpha = 1.25;
158     SparseTop2MoE moe(d_in, d_out, h, E, alpha);
159 
160     // Create random inputs
161     vector<Vec> X(B, Vec(d_in));
162     std::uniform_real_distribution<double> ud(-1.0, 1.0);
163     for (int i = 0; i < B; ++i) for (int j = 0; j < d_in; ++j) X[i][j] = ud(rng2);
164 
165     double aux = 0.0, drop_rate = 0.0;
166     auto Y = moe.forward_batch(X, aux, drop_rate);
167 
168     cout << fixed << setprecision(4);
169     for (int i = 0; i < B; ++i) {
170         cout << "y[" << i << "]: ";
171         for (double v : Y[i]) cout << v << ' ';
172         cout << '\n';
173     }
174     cout << "Auxiliary load-balance loss (Switch-style): " << aux << "\n";
175     cout << "Drop rate (due to capacity): " << drop_rate << "\n";
176 
177     return 0;
178 }
179

This program implements sparse top-2 routing with per-expert capacity. It computes router probabilities, selects top-2 experts per token, renormalizes within the chosen set, builds per-expert queues up to capacity, evaluates only the needed experts, and combines outputs using the router weights. It also reports a Switch-style load-balancing auxiliary loss and a simple drop rate metric. Compared to dense MoE, this code avoids evaluating all experts for every token, illustrating conditional computation.

Time: O(B · (d_in · E) + B · k · (d_in · h + h · d_out) + E · capacity · d_out) with k=2Space: O(E · (d_in · h + h · d_out) + B · E (router probs) + B · k (queues))

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple utilities for vectors and matrices
5	using Vec = vector<double>;
6	using Mat = vector<Vec>; // row-major: Mat[r][c]
7
8	static std::mt19937 rng(42);
9
10	double randn() {
11	static std::normal_distribution<double> dist(0.0, 0.02);
12	return dist(rng);
13	}
14
15	Vec softmax(const Vec &logits) {
16	double m = *max_element(logits.begin(), logits.end());
17	double sum = 0.0;
18	Vec probs(logits.size());
19	for (size_t i = 0; i < logits.size(); ++i) {
20	probs[i] = exp(logits[i] - m);
21	sum += probs[i];
22	}
23	for (double &p : probs) p /= (sum + 1e-12);
24	return probs;
25	}
26
27	struct Linear {
28	int in, out;
29	Mat W; // out x in
30	Vec b; // out
31	Linear(int in_, int out_) : in(in_), out(out_), W(out_, Vec(in_)), b(out_) {
32	for (int r = 0; r < out; ++r) {
33	for (int c = 0; c < in; ++c) W[r][c] = randn();
34	b[r] = 0.0;
35	}
36	}
37	Vec forward(const Vec &x) const {
38	Vec y(out, 0.0);
39	for (int r = 0; r < out; ++r) {
40	double s = b[r];
41	for (int c = 0; c < in; ++c) s += W[r][c] * x[c];
42	y[r] = s;
43	}
44	return y;
45	}
46	};
47
48	Vec relu(const Vec &x) {
49	Vec y = x;
50	for (double &v : y) v = max(0.0, v);
51	return y;
52	}
53
54	// A simple MLP expert: Linear -> ReLU -> Linear
55	struct ExpertMLP {
56	Linear l1;
57	Linear l2;
58	ExpertMLP(int d_in, int h, int d_out) : l1(d_in, h), l2(h, d_out) {}
59	Vec forward(const Vec &x) const {
60	return l2.forward(relu(l1.forward(x)));
61	}
62	};
63
64	// Dense MoE: computes all experts then weights by router softmax
65	struct DenseMoE {
66	int E; // number of experts
67	int d_in, d_out, h;
68	Linear router; // d_in -> E (router logits)
69	vector<ExpertMLP> experts;
70	DenseMoE(int d_in_, int d_out_, int h_, int E_)
71	: E(E_), d_in(d_in_), d_out(d_out_), h(h_), router(d_in_, E_) {
72	experts.reserve(E);
73	for (int e = 0; e < E; ++e) experts.emplace_back(d_in, h, d_out);
74	}
75
76	// Forward for a single input x
77	Vec forward_one(const Vec &x) const {
78	// 1) Router probabilities over experts
79	Vec logits = router.forward(x);
80	Vec p = softmax(logits);
81	// 2) Compute every expert and combine
82	Vec y(d_out, 0.0);
83	for (int e = 0; e < E; ++e) {
84	Vec ye = experts[e].forward(x); // expensive when E is large
85	for (int j = 0; j < d_out; ++j) y[j] += p[e] * ye[j];
86	}
87	return y;
88	}
89
90	// Batch forward (list of inputs)
91	vector<Vec> forward_batch(const vector<Vec> &X) const {
92	vector<Vec> Y; Y.reserve(X.size());
93	for (const auto &x : X) Y.push_back(forward_one(x));
94	return Y;
95	}
96	};
97
98	int main() {
99	ios::sync_with_stdio(false);
100	cin.tie(nullptr);
101
102	int B = 4; // batch size
103	int d_in = 8; // input dimension
104	int h = 16; // expert hidden width
105	int d_out = 6; // output dimension
106	int E = 5; // number of experts (dense computes all 5)
107
108	DenseMoE moe(d_in, d_out, h, E);
109
110	// Create a small batch of random inputs
111	vector<Vec> X(B, Vec(d_in));
112	std::uniform_real_distribution<double> ud(-1.0, 1.0);
113	for (int i = 0; i < B; ++i) for (int j = 0; j < d_in; ++j) X[i][j] = ud(rng);
114
115	auto Y = moe.forward_batch(X);
116
117	// Print results
118	cout << fixed << setprecision(4);
119	for (int i = 0; i < B; ++i) {
120	cout << "y[" << i << "]: ";
121	for (double v : Y[i]) cout << v << ' ';
122	cout << '\n';
123	}
124
125	return 0;
126	}
127