šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
āš™ļøAlgorithmIntermediate

Dynamic Time Warping

Key Points

  • •
    Dynamic Time Warping (DTW) aligns two time series that may vary in speed to find the minimum-cost correspondence between their elements.
  • •
    It uses dynamic programming to build a cumulative cost matrix and backtracks to recover the optimal warping path.
  • •
    Basic DTW runs in O(n m) time and O(n m) space, but window constraints like the Sakoe–Chiba band can reduce time to O(n r) and space to O(m).
  • •
    DTW is robust to local stretching/compression in time, making it suitable for speech, gestures, sensor signals, and other temporal data.
  • •
    The local cost is typically Euclidean or absolute distance; the global DTW distance is the sum of local costs along the optimal path.
  • •
    Normalization (e.g., dividing by path length) and z-normalization of series are important for fair comparisons across different scales and lengths.
  • •
    Variants include multivariate DTW, step-pattern adjustments, and pruning/lower bounds such as LBK​eogh for faster search.
  • •
    Common pitfalls include ignoring normalization, using too wide/too narrow a window, and mis-implementing boundary and backtracking rules.

Prerequisites

  • →Dynamic Programming — DTW is solved by filling a DP table using a recurrence and optional backtracking.
  • →Time Series Basics — Understanding sequences, sampling, and temporal variation is essential to motivate warping.
  • →Distance Metrics — DTW relies on a local cost; knowing L1/L2 distances helps choose appropriate measures.
  • →Matrix Indexing and Boundaries — Correctly handling base cases and indices prevents off-by-one and boundary errors.
  • →Big-O Complexity — To judge feasibility and parameter choices (e.g., window radius), complexity analysis is needed.
  • →Numerical Stability and Normalization — Z-normalization and handling of NaNs/outliers affect DTW’s reliability.
  • →Backtracking in DP — Recovering an optimal warping path requires understanding predecessor choices.
  • →Vector Norms and Multivariate Data — For multivariate DTW, local costs are computed with norms across dimensions.

Detailed Explanation

Tap terms for definitions

01Overview

Dynamic Time Warping (DTW) is an algorithm that measures similarity between two temporal sequences that may be out of phase or vary in speed. Unlike Euclidean distance, which compares points at the same timestamps, DTW flexibly ā€œwarpsā€ the time axis to align similar patterns that occur at different speeds or with local delays. DTW constructs a grid where one series runs along the rows and the other along the columns. Each cell holds the cumulative cost of the best alignment up to that pair of positions, computed via dynamic programming. The final distance is the minimal cumulative cost at the bottom-right cell, and the optimal alignment path can be found by backtracking through the grid. DTW is widely used in speech recognition, handwriting/gesture matching, seismology, ECG analysis, and general time-series retrieval. While the classic algorithm has quadratic time and space complexity, practical systems often employ constraints (e.g., Sakoe–Chiba band), lower bounds (e.g., LB_Keogh), and streaming/rolling-memory techniques to scale to longer sequences. DTW can be extended to multivariate data by defining the local cost between vectors, commonly via Euclidean distance.

02Intuition & Analogies

Imagine two people singing the same song, but one sings a bit faster in the chorus and slower in the verses. If you compare their audio signals point by point in time, they won’t match well, even though they’re singing the same melody. DTW is like a smart karaoke judge: it allows the time axis to stretch and compress locally so that corresponding notes line up, judging similarity by how much bending is needed.

Another analogy is matching two paths on a map walked at different speeds. If both hikers traverse the same landmarks but pause at different spots, a naive comparison at equal time intervals says they’re apart; DTW, instead, aligns them by the sequence of landmarks, permitting one hiker’s timeline to stretch so the matching pairs are synchronized.

Visually, picture a grid with one sequence across the top and the other down the side. You want to trace a path from the top-left to bottom-right that stays monotonic (never goes backward in either sequence) and can move right, down, or diagonally. Each step pays a local mismatch cost between the two elements you’re aligning. The best path is the one with the smallest total cost. A diagonal step means both sequences advance together; a horizontal step means the top sequence waits while the bottom advances (time stretch), and a vertical step means the opposite. The more detours you take away from the diagonal, the more warping (and cost) is needed.

03Formal Definition

Given sequences X = (x1​, x2​, …, xn​) and Y = (y1​, y2​, …, ym​), and a local cost function d(xi​, yj​) ≄ 0 (e.g., ∣xiā€‹āˆ’yjā€‹āˆ£ or \∣xiā€‹āˆ’yjā€‹āˆ„_{2}), define a cumulative cost matrix D ∈ R(n+1)Ɨ(m+1) with D(0,0)=0 and D(i,0)=D(0,j)=āˆž for i>0, j>0. For i ≄ 1, j ≄ 1, compute D(i,j) = d(xi​, yj​) + min\{ D(i-1,j), D(i,j-1), D(i-1,j-1) \}. The DTW distance is DTW(X,Y) = D(n,m). A warping path W is a sequence of index pairs W = (w1​, w2​, …, wL​), where wk​ = (ik​, jk​), subject to: (1) boundary conditions: w1​=(1,1), wL​=(n,m); (2) monotonicity: ik+1​ ≄ ik​, jk+1​ ≄ jk​; and (3) step size: wk+1​ - wk​ ∈ \{(1,0),(0,1),(1,1)\}. The optimal warping path Wāˆ— achieves the minimum āˆ‘k=1L​ d(x_{ik​}, y_{jk​}) and corresponds to backtracking from (n,m) to (1,1) via the predecessor that achieved the minimum at each D(i,j). Constraints such as a Sakoe–Chiba band enforce ∣iāˆ’j∣ ≤ r to limit warping and reduce computation. Extensions to multivariate series define d using vector norms.

04When to Use

Use DTW when aligning or comparing time series that may be locally stretched, compressed, or shifted in time: speech templates vs. utterances with varied speaking rate; gesture paths recorded at different speeds; ECG beats where heart rate changes; sensor traces in activity recognition; or shape outlines sampled inconsistently. DTW is also valuable in nearest-neighbor classification of time series, motif discovery, and subsequence search (with adaptation for subsequence DTW).

Prefer DTW over Euclidean distance when the timing of features differs but their order is preserved. If global warping is modest, apply constraints like Sakoe–Chiba band to speed up without sacrificing accuracy. For very large datasets, combine DTW with lower bounds (LB_Keogh) to prune candidates before computing full DTW. For multivariate signals (e.g., 3D accelerometer), use a vector norm as the local cost. Normalize series (z-normalization) when amplitude scale differs, and consider normalizing DTW by path length to compare across sequences of different lengths.

āš ļøCommon Mistakes

  • Ignoring normalization: Comparing raw series with different scales can dominate the local cost. Apply z-normalization (subtract mean, divide by standard deviation) before DTW when appropriate.
  • Misinterpreting distance: The raw DTW cost grows with path length; comparing costs across pairs of different lengths can be misleading. Consider dividing by path length or using an average local cost.
  • Over/under-constraining: A too-narrow window (small r) may forbid the true alignment, yielding \infty or inflated costs; a too-wide window increases runtime and may overfit noise. Tune r based on domain knowledge.
  • Boundary/initialization errors: Forgetting D(i,0)=D(0,j)=\infty or mishandling indices causes incorrect paths. Carefully set base cases and use 1-based DP indexing over 0-based arrays.
  • Using expensive local costs: Taking square roots at every cell adds overhead without changing the optimal path when using Euclidean distance; prefer squared distances for speed.
  • Not handling missing values: NaNs in input must be filtered, imputed, or assigned a large penalty; otherwise, comparisons propagate NaNs.
  • Expecting invariance to amplitude/offset: DTW handles time warping, not vertical shifts or scaling unless you pre-normalize or embed such invariance into the local cost.

Key Formulas

DTW Recurrence

D(i,j)=d(xi​,yj​)+min{D(iāˆ’1,j),Ā D(i,jāˆ’1),Ā D(iāˆ’1,jāˆ’1)}

Explanation: The best cost to align prefixes ending at i and j equals the local mismatch plus the cheapest predecessor step. This is the core dynamic programming relation used to fill the cumulative cost matrix.

DTW Distance

DTW(X,Y)=D(n,m)

Explanation: After filling the cumulative cost matrix, the DTW distance is the minimal total cost to align the full sequences. It is read from the bottom-right cell of D.

Path Constraints

w1​=(1,1),Ā wL​=(n,m),Ā wk+1ā€‹āˆ’wkā€‹āˆˆ{(1,0),(0,1),(1,1)}

Explanation: A valid warping path starts at the first elements, ends at the last, and moves monotonically using right, down, or diagonal steps. These constraints ensure order is preserved and no elements are skipped backwards.

Sakoe–Chiba Window

∣iāˆ’jāˆ£ā‰¤r

Explanation: Only cells within r of the diagonal are considered, restricting excessive warping. This reduces computation and often improves robustness by avoiding unrealistic alignments.

Local Cost (Univariate/Multivariate)

d(xi​,yj​)=∣xiā€‹āˆ’yjā€‹āˆ£ord(xi​,yj​)=k=1āˆ‘p​(xi(k)ā€‹āˆ’yj(k)​)2​

Explanation: The local cost measures dissimilarity at a pair of points. For vectors (multivariate signals), Euclidean distance across p dimensions is commonly used; for scalars, absolute difference is typical.

Path-Length Normalization

DTWavg​(X,Y)=L1​k=1āˆ‘L​d(xik​​,yjk​​)

Explanation: Dividing the total DTW cost by the warping path length L yields an average mismatch per aligned pair. This helps compare distances across sequences with different lengths or warping extents.

Time and Space Complexity

T(n,m)=O(nm),S(n,m)=O(nm)

Explanation: The standard DTW fills an n-by-m matrix, giving quadratic time and space. Memory can be reduced to linear by keeping only adjacent rows or columns when path reconstruction isn’t required.

Complexity with Window

Tband​(n,m,r)=O(nr),Sband​(n,m)=O(m)

Explanation: With a window of radius r, only O(r) cells per row are computed, reducing time to O(nr). Using rolling arrays reduces space to O(m).

LB_Keogh Lower Bound

LB_Keogh(X,Y)=i=1āˆ‘nā€‹āŽ©āŽØāŽ§ā€‹(xiā€‹āˆ’Ui​)2(Liā€‹āˆ’xi​)20​xi​>Ui​xi​<Li​otherwise​

Explanation: Using upper (U) and lower (L) envelopes of Y within a window around each i, this sum gives a fast-to-compute lower bound on DTW. If the bound already exceeds the current best distance, we can prune Y without full DTW.

Complexity Analysis

Let n and m be the lengths of the two sequences. The vanilla DTW algorithm computes a dynamic programming table with (n+1)Ɨ(m+1) entries. Each entry requires O(1) work: compute the local cost and take a minimum of up to three predecessors. Therefore, the time complexity is O(n m). The space complexity is also O(n m) if we store the full matrix for later backtracking to reconstruct the optimal path. When only the distance is needed (no path), space can be reduced to O(m) by keeping only the current and previous rows (or O(n) if iterating over columns). This works because the recurrence for D(i,j) depends only on D(iāˆ’1,j), D(i,jāˆ’1), and D(iāˆ’1,jāˆ’1). With global constraints like the Sakoe–Chiba band of radius r, we compute only cells with ∣iāˆ’j∣ ≤ r. In that case, each row processes O(r) cells, reducing time to O(n r) while maintaining O(m) space with rolling arrays. Choosing r ≪ m yields a significant speedup, especially when sequences are roughly aligned near the diagonal. In nearest-neighbor search across a database of size N, naive evaluation is O(N n m). Practical systems use lower bounds (e.g., LBK​eogh) to prune most candidates, early abandoning when partial costs exceed the best-so-far, and multi-resolution or downsampling strategies. For multivariate DTW with dimensionality p, the local cost evaluation adds an O(p) factor, leading to O(p n m) time. Note that taking square roots in Euclidean distance is unnecessary for ordering comparisons; using squared distances preserves optimal paths and reduces constant factors. Overall, DTW trades flexibility in time alignment for quadratic worst-case complexity, which is often mitigated by constraints, pruning, and careful implementation choices.

Code Examples

Basic DTW distance with warping path reconstruction (univariate)
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <limits>
5#include <algorithm>
6#include <utility>
7
8// Compute DTW distance and recover the optimal warping path.
9// Uses full O(n*m) memory to enable backtracking.
10// local cost: absolute difference |x - y|
11
12typedef std::pair<int,int> Cell; // (i,j) indices in 0-based original series
13
14struct DTWResult {
15 double distance; // DTW distance (total cost)
16 std::vector<Cell> path; // Optimal warping path as pairs of 0-based indices
17};
18
19DTWResult dtw_with_path(const std::vector<double>& X, const std::vector<double>& Y) {
20 const int n = (int)X.size();
21 const int m = (int)Y.size();
22 const double INF = std::numeric_limits<double>::infinity();
23
24 // D is (n+1) x (m+1); D[0][0]=0; first row/col are +inf to enforce boundaries
25 std::vector<std::vector<double>> D(n+1, std::vector<double>(m+1, INF));
26 D[0][0] = 0.0;
27
28 // Fill DP table
29 for (int i = 1; i <= n; ++i) {
30 for (int j = 1; j <= m; ++j) {
31 double cost = std::fabs(X[i-1] - Y[j-1]); // local cost
32 double prev_min = std::min({ D[i-1][j], D[i][j-1], D[i-1][j-1] });
33 D[i][j] = cost + prev_min;
34 }
35 }
36
37 // Backtrack to recover path from (n,m) to (1,1)
38 std::vector<Cell> path;
39 int i = n, j = m;
40 while (i > 0 || j > 0) {
41 if (i == 0) { // can only move left
42 --j;
43 } else if (j == 0) { // can only move up
44 --i;
45 } else {
46 // choose the predecessor with minimum cumulative cost
47 double a = D[i-1][j]; // up
48 double b = D[i][j-1]; // left
49 double c = D[i-1][j-1]; // diag
50 if (c <= a && c <= b) { --i; --j; }
51 else if (a <= b) { --i; }
52 else { --j; }
53 }
54 if (i >= 0 && j >= 0 && i <= n && j <= m && (i>0 || j>0)) {
55 // Map DP indices (1..n,1..m) to 0-based series indices (0..n-1,0..m-1)
56 int ii = std::max(0, i-1);
57 int jj = std::max(0, j-1);
58 path.emplace_back(ii, jj);
59 }
60 }
61 // Ensure the starting cell (0,0) is included
62 path.emplace_back(0, 0);
63
64 // The path was built backwards; reverse it and remove potential duplicates
65 std::reverse(path.begin(), path.end());
66 // Optionally, deduplicate consecutive duplicates (not usually necessary)
67 std::vector<Cell> clean;
68 clean.reserve(path.size());
69 for (size_t k = 0; k < path.size(); ++k) {
70 if (k == 0 || path[k] != path[k-1]) clean.push_back(path[k]);
71 }
72
73 return DTWResult{ D[n][m], clean };
74}
75
76int main() {
77 // Example sequences
78 std::vector<double> X = {1, 2, 3, 3, 2, 0};
79 std::vector<double> Y = {0, 1, 1, 2, 3, 2, 1, 0};
80
81 DTWResult res = dtw_with_path(X, Y);
82
83 std::cout << "DTW distance: " << res.distance << "\n";
84 std::cout << "Warping path (pairs of indices i,j):\n";
85 for (const auto& cell : res.path) {
86 std::cout << "(" << cell.first << ", " << cell.second << ") ";
87 }
88 std::cout << "\n";
89
90 return 0;
91}
92

This program computes the classic DTW distance between two univariate sequences using a full dynamic programming matrix. It sets up boundary conditions with +āˆž on the top row and left column (except D[0][0]=0), fills the matrix using the DTW recurrence, and then backtracks from (n,m) to (1,1) to retrieve the optimal warping path. The local cost is the absolute difference between aligned points. The output includes both the final distance and the sequence of aligned index pairs.

Time: O(n m)Space: O(n m)
Space-efficient DTW with Sakoe–Chiba band (univariate)
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <limits>
5#include <algorithm>
6
7// Compute DTW distance using a Sakoe–Chiba band of radius r.
8// Uses rolling arrays to achieve O(m) space. No path reconstruction.
9
10double dtw_band_distance(const std::vector<double>& X, const std::vector<double>& Y, int r) {
11 const int n = (int)X.size();
12 const int m = (int)Y.size();
13 const double INF = std::numeric_limits<double>::infinity();
14
15 if (n == 0 && m == 0) return 0.0;
16 if (n == 0 || m == 0) return INF; // cannot align empty with non-empty under classic DTW
17
18 // Ensure r is at least the length difference, otherwise (n,m) may be unreachable.
19 r = std::max(r, std::abs(n - m));
20
21 std::vector<double> prev(m + 1, INF), curr(m + 1, INF);
22 prev[0] = 0.0; // D(0,0)=0; prev[j>0]=INF
23
24 for (int i = 1; i <= n; ++i) {
25 std::fill(curr.begin(), curr.end(), INF);
26 int j_start = std::max(1, i - r);
27 int j_end = std::min(m, i + r);
28
29 for (int j = j_start; j <= j_end; ++j) {
30 double cost = std::fabs(X[i-1] - Y[j-1]);
31 // Note: curr[j-1] corresponds to D(i, j-1); prev[j] is D(i-1, j); prev[j-1] is D(i-1, j-1)
32 double best_prev = std::min({ prev[j], curr[j-1], prev[j-1] });
33 curr[j] = cost + best_prev;
34 }
35 std::swap(prev, curr);
36 }
37
38 return prev[m]; // D(n,m)
39}
40
41int main() {
42 std::vector<double> X = {1, 3, 4, 9, 8, 2, 1, 5, 7};
43 std::vector<double> Y = {1, 4, 7, 7, 3, 2, 3, 6};
44
45 int r = 2; // window radius
46 double dist = dtw_band_distance(X, Y, r);
47
48 if (std::isinf(dist)) {
49 std::cout << "Alignment not possible within the given band." << "\n";
50 } else {
51 std::cout << "Band-constrained DTW distance (r=" << r << "): " << dist << "\n";
52 }
53
54 return 0;
55}
56

This code computes DTW using a Sakoe–Chiba window to limit warping to cells with |iāˆ’j| ≤ r. It maintains only the current and previous rows, reducing memory to O(m). If r is too small to reach (n,m), the code increases r to at least |nāˆ’m| to keep the end cell reachable. This approach is faster and more memory-efficient when the true alignment stays near the diagonal.

Time: O(n r)Space: O(m)
Multivariate DTW with squared Euclidean local cost
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <limits>
5#include <algorithm>
6
7// Multivariate DTW where each time step is a vector<double> of dimension p.
8// Uses squared Euclidean distance as local cost (avoids sqrt for speed).
9
10static inline double sq_euclidean(const std::vector<double>& a, const std::vector<double>& b) {
11 double sum = 0.0;
12 const size_t p = a.size();
13 for (size_t k = 0; k < p; ++k) {
14 double d = a[k] - b[k];
15 sum += d * d;
16 }
17 return sum;
18}
19
20double dtw_multivariate_distance(const std::vector<std::vector<double>>& X,
21 const std::vector<std::vector<double>>& Y) {
22 const int n = (int)X.size();
23 const int m = (int)Y.size();
24 const double INF = std::numeric_limits<double>::infinity();
25
26 if (n == 0 && m == 0) return 0.0;
27 if (n == 0 || m == 0) return INF;
28
29 // Check consistent dimensionality
30 if (!X.empty() && !Y.empty() && X[0].size() != Y[0].size()) {
31 throw std::invalid_argument("X and Y must have the same dimensionality per time step");
32 }
33
34 std::vector<std::vector<double>> D(n+1, std::vector<double>(m+1, INF));
35 D[0][0] = 0.0;
36
37 for (int i = 1; i <= n; ++i) {
38 for (int j = 1; j <= m; ++j) {
39 double cost = sq_euclidean(X[i-1], Y[j-1]);
40 double prev_min = std::min({ D[i-1][j], D[i][j-1], D[i-1][j-1] });
41 D[i][j] = cost + prev_min;
42 }
43 }
44
45 return D[n][m];
46}
47
48int main() {
49 // Two 3D accelerometer sequences (toy example)
50 std::vector<std::vector<double>> X = {
51 {0.1, 0.2, 0.0}, {0.2, 0.1, 0.1}, {0.5, 0.4, 0.2}, {0.9, 0.7, 0.3}
52 };
53 std::vector<std::vector<double>> Y = {
54 {0.0, 0.1, 0.0}, {0.3, 0.2, 0.1}, {0.6, 0.5, 0.3}
55 };
56
57 try {
58 double dist = dtw_multivariate_distance(X, Y);
59 std::cout << "Multivariate DTW (squared cost) distance: " << dist << "\n";
60 } catch (const std::exception& ex) {
61 std::cerr << "Error: " << ex.what() << "\n";
62 }
63
64 return 0;
65}
66

This example extends DTW to multivariate time series by defining the local cost as the squared Euclidean distance between vectors at each time step. The dynamic programming is identical to the univariate case, but each cell’s local cost now aggregates differences across dimensions. Using squared distances avoids square roots, preserving the optimal path while improving efficiency.

Time: O(p n m)Space: O(n m)
#dynamic time warping#dtw c++#time series alignment#warping path#cumulative cost matrix#sakoe chiba band#multivariate dtw#lower bound keogh#time series similarity#dynamic programming alignment#itakura parallelogram#speech recognition dtw#dtw normalization#path reconstruction