Short-Time Fourier Transform (STFT)
Key Points
- •The Short-Time Fourier Transform (STFT) breaks a signal into small overlapping windows and computes a Fourier transform on each window to reveal how frequencies evolve over time.
- •You control time–frequency trade-offs with the window length: longer windows give better frequency resolution but blur timing, and shorter windows do the opposite.
- •A window function (like the Hann window) reduces spectral leakage by tapering the edges of each frame before the FFT.
- •The hop size determines how much adjacent frames overlap and thus affects redundancy, reconstruction quality, and computational cost.
- •A spectrogram is the magnitude (often log-scaled) of the STFT across time and frequency bins.
- •The inverse STFT uses overlap-add with proper normalization (often with window-squared sums) to reconstruct the original signal.
- •Efficient STFT implementations rely on the FFT, giving roughly O(M K log K) time where M is the number of frames and K is the FFT size.
- •Using consistent windowing, hop size, and normalization avoids common reconstruction errors such as gain loss, gain boost, or audible artifacts.
Prerequisites
- →Discrete Fourier Transform (DFT) and FFT — STFT uses the FFT per frame to compute spectra; understanding DFT bins, magnitude, and phase is essential.
- →Complex numbers and Euler's formula — STFT coefficients are complex and rely on e^{j\theta} to represent sinusoids.
- →Sampling theory — Relates discrete indices to physical frequencies and avoids aliasing by choosing K and Fs appropriately.
- →Windowing and spectral leakage — Explains why and how windows reduce leakage and how they affect resolution.
- →Convolution and overlap-add — Inverse STFT relies on overlap-add concepts and linearity to reconstruct signals.
Detailed Explanation
Tap terms for definitions01Overview
The Short-Time Fourier Transform (STFT) is a technique to analyze how the frequency content of a signal changes over time. Instead of applying a single Fourier transform to the entire signal—which would only reveal which frequencies are present overall—the STFT slides a window across the signal, taking short snapshots (frames). For each frame, it multiplies the samples by a window function to reduce edge effects, then computes a Fourier transform. The result is a set of complex spectra indexed by time (frame index) and frequency bin, producing a time–frequency representation known as a spectrogram when magnitudes are visualized. The STFT is fundamental in audio, speech, and non-stationary signal processing because most real-world signals change over time. By tuning parameters such as window length, hop size (step between frames), FFT size (often a power of two), and window type (e.g., Hann, Hamming), one can balance frequency resolution, time resolution, and computational cost. Proper synthesis (inverse STFT) with overlap-add and normalization can reconstruct the original signal from its STFT, enabling transformative applications like noise reduction, pitch shifting, and time–frequency masking.
02Intuition & Analogies
Imagine listening to a song and pausing it every tiny fraction of a second to ask: “What notes are sounding right now?” If you looked at the entire song at once with a regular Fourier transform, you’d only learn which notes appear somewhere in the whole track, not when. The STFT is like using a small, sliding listening window that you place over the track step by step. At each position, you analyze just that slice to find its current mix of notes. Stitching all those analyses together gives a moving picture—like a musical score over time—called a spectrogram. The window length is the size of your listening snapshot: a long snapshot lets you tell apart very close pitches (good frequency resolution) but blurs quick events (poor time resolution). A short snapshot makes fast changes crisp (good time resolution) but lumps together nearby pitches (poor frequency resolution). The window function is like gently fading the volume at the edges of each snapshot; this avoids sharp cuts that would create spurious frequencies (spectral leakage). The hop size is how far you move the window each time—small hops mean more overlap and smoother time tracking but more computation. To get back the original sound, you reverse the process: convert each small spectrum back to time samples, fade them with the same window, and overlap-add them. If your fades and overlaps are balanced, the slices recombine seamlessly, just like layering translucent photos to recreate the full scene.
03Formal Definition
04When to Use
Use STFT whenever you need to understand or manipulate how frequency content evolves over time in non-stationary signals. Typical applications include speech analysis (formant tracking, voice activity detection), music information retrieval (onset detection, pitch estimation, beat tracking), audio effects (time–stretching, pitch shifting, spectral morphing), noise reduction and speech enhancement (spectral subtraction, masks), bio-signals (EEG/ECG event localization), and mechanical diagnostics (time–frequency fault signatures). Choose STFT over a single global Fourier transform when stationarity does not hold over the entire signal. Prefer longer windows when discerning closely spaced frequencies matters (e.g., harmonic analysis in steady notes) and shorter windows when timing precision is critical (e.g., transients and percussive onsets). Use smaller hop sizes for smoother spectrograms and better reconstruction at the cost of more computation, and larger hop sizes for efficiency when coarse time resolution suffices. If very different time scales coexist (e.g., chirps and transients), consider multi-resolution methods (wavelets) or multi-window STFTs, but the classic STFT is often the first, most practical tool.
⚠️Common Mistakes
• Using a rectangular window implicitly (i.e., no taper) and then being surprised by spectral leakage—use tapered windows like Hann or Hamming unless you need the narrowest main lobe and accept sidelobes. • Mismatch between FFT size K and window length N without proper zero-padding or indexing, causing frame misalignment or aliasing; ensure K \ge N and pad the windowed frame to length K. • Incorrect overlap-add normalization in inverse STFT: forgetting to divide by the accumulated window-squared sum leads to amplitude gain or loss and audible pulsation; always track a denominator array of \sum w^2. • Choosing hop size H too large for the chosen window, breaking perfect reconstruction (e.g., Hann with H = N/2 is common; H = N can fail). • Ignoring phase: manipulating magnitudes only and then reusing original or zero phases can cause artifacts; when possible, preserve or properly estimate phase (e.g., Griffin–Lim for phase reconstruction from magnitudes). • Not handling boundaries: failing to pad the signal so that frames at the end are truncated can produce clicks or zeroed tails; use zero-padding or reflect-padding as appropriate. • Misinterpreting frequency bins: forgetting that for real-valued signals, bins above K/2 are conjugate-symmetric; often only K/2+1 unique bins are needed. • Using linear magnitude for visualization can hide quiet details; log-power spectrograms (in dB) are more interpretable.
Key Formulas
Discrete STFT
Explanation: This sums a windowed slice of the signal multiplied by complex sinusoids to get the spectrum at frame m and bin k. K is the FFT size, typically a power of two and at least N.
Inverse STFT (overlap-add)
Explanation: Each frame is inverted back to time domain, re-windowed, and overlap-added. Dividing by the accumulated window-squared sum removes the effect of windowing to recover the original amplitude.
Hann window
Explanation: The Hann window smoothly tapers frame edges to reduce leakage. It supports perfect reconstruction with hop sizes like H = N/2 under standard overlap-add normalization.
Frequency bin mapping
Explanation: This maps discrete bin index k to physical frequency in Hertz given sampling rate . Adjacent bins are spaced by /K Hz.
Number of frames
Explanation: For a signal of length L, window length N, and hop H, this estimates how many frames are computed. Zero-padding can adjust the effective M near boundaries.
Time complexity
Explanation: Each frame performs an FFT of size K, giving K log K per frame. Multiplying by M frames yields the total time complexity.
Time–frequency uncertainty
Explanation: This inequality states a lower bound on the product of time and frequency spreads. It formalizes the trade-off between time and frequency resolution.
Log-magnitude spectrogram
Explanation: Converts linear magnitudes to decibels for better dynamic range visualization. A small epsilon avoids log of zero.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Iterative in-place radix-2 Cooley-Tukey FFT 5 // If invert=false: computes FFT; if invert=true: computes inverse FFT (scaled by 1) 6 // After inverse, we divide by n externally to get true inverse DFT. 7 void fft(vector<complex<double>>& a, bool invert) { 8 int n = (int)a.size(); 9 static vector<int> rev; 10 static vector<complex<double>> roots{0, 1}; 11 12 if ((int)rev.size() != n) { 13 int k = __builtin_ctz(n); 14 rev.assign(n, 0); 15 for (int i = 0; i < n; i++) { 16 rev[i] = (rev[i >> 1] >> 1) | ((i & 1) << (k - 1)); 17 } 18 } 19 if ((int)roots.size() < n) { 20 int k = __builtin_ctz(roots.size()); 21 roots.resize(n); 22 while ((1 << k) < n) { 23 double ang = 2 * M_PI / (1 << (k + 1)); 24 for (int i = 1 << (k - 1); i < (1 << k); i++) { 25 roots[2 * i] = roots[i]; 26 double angle = ang * (2 * i + 1 - (1 << k)); 27 roots[2 * i + 1] = complex<double>(cos(angle), sin(angle)); 28 } 29 k++; 30 } 31 } 32 33 for (int i = 0; i < n; i++) if (i < rev[i]) swap(a[i], a[rev[i]]); 34 35 for (int len = 1; len < n; len <<= 1) { 36 for (int i = 0; i < n; i += 2 * len) { 37 for (int j = 0; j < len; j++) { 38 complex<double> u = a[i + j]; 39 complex<double> v = a[i + j + len] * roots[len + j]; 40 a[i + j] = u + v; 41 a[i + j + len] = u - v; 42 } 43 } 44 } 45 if (invert) { 46 // Conjugate, run forward FFT, then conjugate and scale outside (common trick) 47 for (auto &x : a) x = conj(x); 48 fft(a, false); 49 for (auto &x : a) x = conj(x) / (double)n; 50 } 51 } 52 53 vector<double> hann_window(int N) { 54 vector<double> w(N); 55 if (N == 1) { w[0] = 1.0; return w; } 56 for (int n = 0; n < N; ++n) { 57 w[n] = 0.5 - 0.5 * cos(2.0 * M_PI * n / (N - 1)); 58 } 59 return w; 60 } 61 62 // Compute magnitude spectrogram |X[m,k]| for k=0..K/2 63 vector<vector<double>> stft_magnitude(const vector<double>& x, int N, int H, int K) { 64 int L = (int)x.size(); 65 if ((K & (K - 1)) != 0) throw runtime_error("K must be power of two"); 66 if (N > K) throw runtime_error("Window length N must be <= K"); 67 68 vector<double> w = hann_window(N); 69 int M = (L <= N) ? 1 : (int)ceil((double)(L - N) / H) + 1; 70 vector<vector<double>> S; // M x (K/2+1) 71 S.reserve(M); 72 73 vector<complex<double>> frame(K); 74 for (int m = 0; m < M; ++m) { 75 int start = m * H; 76 // Build windowed frame with zero-padding to K 77 for (int n = 0; n < K; ++n) { 78 double sample = 0.0; 79 if (n < N) { 80 int idx = start + n; 81 if (0 <= idx && idx < L) sample = x[idx]; 82 sample *= w[n]; 83 } 84 frame[n] = complex<double>(sample, 0.0); 85 } 86 // Forward FFT 87 fft(frame, false); 88 // Magnitude for unique bins 0..K/2 89 vector<double> mag(K / 2 + 1); 90 for (int k = 0; k <= K / 2; ++k) mag[k] = abs(frame[k]); 91 S.push_back(move(mag)); 92 } 93 return S; 94 } 95 96 int main() { 97 // Example: analyze a 1-second 440 Hz + 880 Hz tone at Fs=8000 Hz 98 int Fs = 8000; 99 int L = Fs * 1; // 1 second 100 vector<double> x(L); 101 for (int n = 0; n < L; ++n) { 102 x[n] = 0.7 * sin(2 * M_PI * 440 * n / Fs) + 0.3 * sin(2 * M_PI * 880 * n / Fs); 103 } 104 105 int N = 1024; // window length 106 int H = 512; // hop size (50% overlap) 107 int K = 1024; // FFT size 108 109 auto S = stft_magnitude(x, N, H, K); 110 111 // Print a small portion of the spectrogram: first 5 frames, first 10 bins 112 cout << fixed << setprecision(3); 113 for (int m = 0; m < min((int)S.size(), 5); ++m) { 114 cout << "Frame " << m << ": "; 115 for (int k = 0; k < min((int)S[m].size(), 10); ++k) { 116 cout << S[m][k] << (k + 1 < 10 ? ", " : "\n"); 117 } 118 } 119 return 0; 120 } 121
This program computes an STFT magnitude spectrogram using a Hann window and an in-place radix-2 FFT. It pads each windowed frame to size K, runs the FFT, and keeps only the unique bins (0..K/2) because the input is real. The printed values give a glimpse of the spectrogram; in practice, you would visualize S as an image, often on a log scale.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 void fft(vector<complex<double>>& a, bool invert) { 5 int n = (int)a.size(); 6 static vector<int> rev; 7 static vector<complex<double>> roots{0, 1}; 8 if ((int)rev.size() != n) { 9 int k = __builtin_ctz(n); 10 rev.assign(n, 0); 11 for (int i = 0; i < n; i++) rev[i] = (rev[i >> 1] >> 1) | ((i & 1) << (k - 1)); 12 } 13 if ((int)roots.size() < n) { 14 int k = __builtin_ctz(roots.size()); 15 roots.resize(n); 16 while ((1 << k) < n) { 17 double ang = 2 * M_PI / (1 << (k + 1)); 18 for (int i = 1 << (k - 1); i < (1 << k); i++) { 19 roots[2 * i] = roots[i]; 20 double angle = ang * (2 * i + 1 - (1 << k)); 21 roots[2 * i + 1] = complex<double>(cos(angle), sin(angle)); 22 } 23 k++; 24 } 25 } 26 for (int i = 0; i < n; i++) if (i < rev[i]) swap(a[i], a[rev[i]]); 27 for (int len = 1; len < n; len <<= 1) { 28 for (int i = 0; i < n; i += 2 * len) { 29 for (int j = 0; j < len; j++) { 30 complex<double> u = a[i + j]; 31 complex<double> v = a[i + j + len] * roots[len + j]; 32 a[i + j] = u + v; 33 a[i + j + len] = u - v; 34 } 35 } 36 } 37 if (invert) { 38 for (auto &x : a) x = conj(x); 39 fft(a, false); 40 for (auto &x : a) x = conj(x) / (double)n; 41 } 42 } 43 44 vector<double> hann_window(int N) { 45 vector<double> w(N); 46 if (N == 1) { w[0] = 1.0; return w; } 47 for (int n = 0; n < N; ++n) w[n] = 0.5 - 0.5 * cos(2.0 * M_PI * n / (N - 1)); 48 return w; 49 } 50 51 // Build a test STFT by forward transform, then reconstruct via ISTFT 52 int main() { 53 int Fs = 16000; 54 int L = Fs * 1; // 1 second 55 vector<double> x(L); 56 for (int n = 0; n < L; ++n) { 57 // A chirp from 400 Hz to 2000 Hz 58 double t = (double)n / Fs; 59 double f = 400.0 + 1600.0 * t; 60 x[n] = 0.8 * sin(2 * M_PI * f * t); 61 } 62 63 int N = 1024, H = 256, K = 1024; 64 if ((K & (K - 1)) != 0) throw runtime_error("K must be power of two"); 65 vector<double> w = hann_window(N); 66 67 // Forward STFT: store full complex spectrum per frame 68 int M = (L <= N) ? 1 : (int)ceil((double)(L - N) / H) + 1; 69 vector<vector<complex<double>>> X(M, vector<complex<double>>(K)); 70 vector<complex<double>> buf(K); 71 for (int m = 0; m < M; ++m) { 72 int start = m * H; 73 for (int n = 0; n < K; ++n) { 74 double s = 0.0; 75 if (n < N) { 76 int idx = start + n; 77 if (0 <= idx && idx < L) s = x[idx]; 78 s *= w[n]; 79 } 80 buf[n] = complex<double>(s, 0.0); 81 } 82 fft(buf, false); 83 X[m] = buf; // store complex spectrum 84 } 85 86 // Inverse STFT via overlap-add with window-squared normalization 87 vector<double> y(L + K, 0.0); // room for tail 88 vector<double> weight(L + K, 0.0); // accumulate w^2 89 90 for (int m = 0; m < M; ++m) { 91 // IFFT of frame m 92 buf = X[m]; 93 fft(buf, true); // now buf[n] is time-domain frame of length K 94 for (int n = 0; n < N; ++n) { 95 int idx = m * H + n; 96 double val = buf[n].real(); 97 double win = w[n]; 98 y[idx] += val * win; // re-window on synthesis 99 weight[idx] += win * win; // accumulate window-squared 100 } 101 } 102 103 // Normalize to undo windowing and overlap 104 for (size_t i = 0; i < y.size(); ++i) { 105 if (weight[i] > 1e-12) y[i] /= weight[i]; 106 } 107 108 // Trim to original length 109 y.resize(L); 110 111 // Report reconstruction error (RMS) 112 double err2 = 0.0, x2 = 0.0; 113 for (int i = 0; i < L; ++i) { 114 err2 += (y[i] - x[i]) * (y[i] - x[i]); 115 x2 += x[i] * x[i]; 116 } 117 double rms = sqrt(err2 / L); 118 double snr = 10 * log10(x2 / max(1e-20, err2)); 119 cout << fixed << setprecision(6); 120 cout << "RMS error: " << rms << "\nSNR (dB): " << snr << "\n"; 121 return 0; 122 } 123
This program builds an STFT (complex spectra per frame) and reconstructs the signal using inverse FFT per frame, windowing, and overlap-add. It accumulates the window-squared denominator to normalize the result, ensuring proper gain. The final RMS error and SNR indicate near-perfect reconstruction when parameters are consistent.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 void fft(vector<complex<double>>& a, bool invert) { 5 int n = (int)a.size(); 6 static vector<int> rev; 7 static vector<complex<double>> roots{0, 1}; 8 if ((int)rev.size() != n) { 9 int k = __builtin_ctz(n); 10 rev.assign(n, 0); 11 for (int i = 0; i < n; i++) rev[i] = (rev[i >> 1] >> 1) | ((i & 1) << (k - 1)); 12 } 13 if ((int)roots.size() < n) { 14 int k = __builtin_ctz(roots.size()); 15 roots.resize(n); 16 while ((1 << k) < n) { 17 double ang = 2 * M_PI / (1 << (k + 1)); 18 for (int i = 1 << (k - 1); i < (1 << k); i++) { 19 roots[2 * i] = roots[i]; 20 double angle = ang * (2 * i + 1 - (1 << k)); 21 roots[2 * i + 1] = complex<double>(cos(angle), sin(angle)); 22 } 23 k++; 24 } 25 } 26 for (int i = 0; i < n; i++) if (i < rev[i]) swap(a[i], a[rev[i]]); 27 for (int len = 1; len < n; len <<= 1) { 28 for (int i = 0; i < n; i += 2 * len) { 29 for (int j = 0; j < len; j++) { 30 complex<double> u = a[i + j]; 31 complex<double> v = a[i + j + len] * roots[len + j]; 32 a[i + j] = u + v; 33 a[i + j + len] = u - v; 34 } 35 } 36 } 37 if (invert) { 38 for (auto &x : a) x = conj(x); 39 fft(a, false); 40 for (auto &x : a) x = conj(x) / (double)n; 41 } 42 } 43 44 vector<double> hann_window(int N) { 45 vector<double> w(N); 46 if (N == 1) { w[0] = 1.0; return w; } 47 for (int n = 0; n < N; ++n) w[n] = 0.5 - 0.5 * cos(2.0 * M_PI * n / (N - 1)); 48 return w; 49 } 50 51 int main() { 52 int N = 512, H = 128, K = 512; 53 vector<double> w = hann_window(N); 54 55 // Simulated input stream: a 1 kHz tone at Fs=8000 Hz 56 int Fs = 8000, L = Fs / 2; // 0.5 s 57 vector<double> input(L); 58 for (int n = 0; n < L; ++n) input[n] = sin(2 * M_PI * 1000 * n / Fs); 59 60 // Ring buffer for latest N samples 61 vector<double> ring(N, 0.0); 62 int head = 0; // next position to write 63 int samples_until_frame = H; // trigger STFT every H samples 64 65 vector<complex<double>> frame(K); 66 67 for (int n = 0; n < L; ++n) { 68 // Push sample into ring buffer 69 ring[head] = input[n]; 70 head = (head + 1) % N; 71 samples_until_frame--; 72 73 if (samples_until_frame == 0) { 74 // Build frame in correct time order, apply window, zero-pad (K=N here) 75 for (int i = 0; i < N; ++i) { 76 int idx = (head + i) % N; // oldest-to-newest 77 double s = ring[idx] * w[i]; 78 frame[i] = complex<double>(s, 0.0); 79 } 80 for (int i = N; i < K; ++i) frame[i] = 0.0; 81 // FFT 82 fft(frame, false); 83 // Example: print energy in 1 kHz bin (nearest bin index) 84 int k = int(round(1000.0 * K / Fs)); 85 double mag = abs(frame[k]); 86 cout << fixed << setprecision(3) << "Frame energy near 1 kHz: " << mag << "\n"; 87 samples_until_frame = H; // reset for next frame 88 } 89 } 90 return 0; 91 } 92
This example simulates real-time STFT by maintaining a ring buffer of the most recent N samples and triggering an FFT every H samples. It applies a Hann window and computes the FFT without storing the full spectrogram, then prints the magnitude near a target frequency, which is typical in streaming analyzers.