Groups
Category
Level
Minimum Description Length (MDL) picks the model that compresses the data best by minimizing L(M) + L(D|M).
Rรฉnyi entropy generalizes Shannon entropy by measuring uncertainty with a tunable emphasis on common versus rare outcomes.
An f-divergence measures how different two probability distributions P and Q are by averaging a convex function f of the density ratio p(x)/q(x) under Q.
A copula is a function that glues together marginal distributions to form a multivariate joint distribution while isolating dependence from the margins.
The Weak Law of Large Numbers (WLLN) says that the sample average of independent, identically distributed (i.i.d.) random variables with finite mean gets close to the true mean with high probability as the sample size grows.
Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.
Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.
Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.
Sharpness-Aware Minimization (SAM) trains models to perform well even when their weights are slightly perturbed, seeking flatter minima that generalize better.
The MooreโPenrose pseudoinverse generalizes matrix inversion to rectangular or singular matrices and is denoted Aโบ.
A sparse matrix stores only its nonzero entries, saving huge amounts of memory when most entries are zero.
The Kronecker product A โ B expands a small matrix into a larger block matrix by multiplying every entry of A with the whole matrix B.