🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Temporal Convolutions

Key Points

  • •
    Temporal (causal) convolution computes each output at time t using only the current and past inputs, ensuring no future information leakage.
  • •
    It is equivalent to applying a finite impulse response (FIR) filter with nonnegative delays on a discrete-time sequence.
  • •
    Dilated temporal convolutions expand the receptive field without increasing parameters by skipping evenly spaced past samples.
  • •
    Padding choices (left-only for causal 'same', none for 'valid') control output length and boundary behavior.
  • •
    Naive causal convolution runs in O(nk) time for sequence length n and kernel size k, while FFT-based convolution runs in O(N log N).
  • •
    Deep stacks of causal convolutions create large receptive fields useful for sequence modeling and forecasting.
  • •
    Frameworks often implement cross-correlation; flipping or interpreting kernel indices carefully preserves causality.
  • •
    Avoid circular convolution and right padding to keep strict causality and prevent future leakage.

Prerequisites

  • →Arrays and Indexing — Temporal convolution is implemented by sliding and indexing into arrays representing time series.
  • →Big-O Notation — Understanding trade-offs between O(nk) direct methods and O(N log N) FFT-based methods is essential.
  • →Discrete-Time Signals — Temporal convolution assumes discrete-time sampling and standard boundary conventions.
  • →Fourier Transform and FFT — FFT-based convolution relies on the convolution theorem and efficient DFT computation.
  • →Linear Algebra Basics — Viewing convolution as a Toeplitz matrix-vector product clarifies properties and proofs.

Detailed Explanation

Tap terms for definitions

01Overview

Temporal convolution is a way to process sequential data—like audio, sensor readings, or time series—by sliding a small weighted window (the kernel) over time and forming weighted sums of recent values. A causal temporal convolution enforces a strict rule: when producing the output at time t, it only uses inputs from times t, t−1, t−2, ... and never from the future (t+1, t+2, ...). This is crucial in forecasting and real-time systems where future data is unknown and using it would leak information. In signal processing terms, a causal temporal convolution is exactly a finite impulse response (FIR) filter with nonnegative delays. In deep learning, temporal convolutional networks (TCNs) stack many such layers—often with dilation (skipping samples) to increase the effective receptive field—so the model can capture long-range dependencies without resorting to recurrent connections.

By choosing padding and alignment carefully, we can match output length to input length while preserving causality. For example, 'causal same' padding zero-fills only on the left, so early outputs rely on fewer past values and remain causal. Implementation-wise, a direct sliding-window loop is simple and costs O(nk), while FFT-based convolution accelerates long filters to O(N log N) but must be set up to perform linear (not circular) convolution. Correct indexing—especially the kernel orientation and the handling of dilation and stride—is essential to avoid subtle bugs and future leakage.

02Intuition & Analogies

Imagine you're watching a live sports game and trying to predict the next play. At any moment t, you can only use what you've already seen (the plays up to time t). You can't peek at what will happen next. A causal temporal convolution behaves the same way: it forms a summary of recent history using a small recipe (the kernel) that says how much to weigh the current frame and the previous frames.

Think of the kernel as a short to-do list for combining the last few moments. For a simple moving average of the last 3 seconds, your list might be [1/3, 1/3, 1/3]. To predict at time t, you multiply the current value by 1/3, the previous value by 1/3, and the one before that by 1/3, then add them up. That’s a causal convolution with no future leakage because the list never asks for a value from time t+1.

Dilated convolution is like scanning your memory at coarser granularity: instead of looking at every second, you might look at the current second, two seconds ago, four seconds ago, and so on. This quickly extends your reach into the past without making your list longer. Stacking such layers is like keeping multiple notepads: one for very recent details, another for medium-term patterns, and another for long-term trends. The resulting system can react instantly (uses current input) and still respect the time arrow (uses only past), which is exactly what we need in real-time control, streaming analytics, and honest forecasting.

03Formal Definition

Let x[t] be a discrete-time sequence indexed by integers t, and let w[i], i=0, 1, ..., k−1 be a finite kernel with support on nonnegative indices only. The causal discrete convolution y[t] of x with w is defined by y[t] = ∑i=0k−1​ w[i] \, x[t - i], where we adopt zero-padding convention x[t] = 0 for t<0 when needed. This is an FIR filter with impulse response h[i] = w[i], i≥0 and h[i] = 0 for i<0, guaranteeing causality. A dilated causal convolution with dilation d ∈ N generalizes this to y[t] = ∑i=0k−1​ w[i] \, x[t - d i], so that the effective receptive field spans R=1 + (k−1)d past steps. For multi-layer stacks with layer-dependent kernel sizes kl​, dilations dl​, and strides sl​, the receptive field is R=1 + ∑l=1L​ (kl​ - 1) ∏j=1l−1​ dj​ sj​, under standard assumptions of zero-padding and unit dilation/stride conventions at boundaries. In matrix form, causal convolution can be written as y=T(x) w, where T(x) is a (Toeplitz) matrix whose t-th row contains [x[t], x[t−1], ..., x[t−k+1]] with out-of-range entries set to zero. Computing by direct summation costs O(nk) for n outputs; via FFT-based linear convolution it costs O(N log N) for N at least n + k − 1 (rounded to a convenient size).

04When to Use

Use causal temporal convolutions whenever strict time ordering matters and you must avoid future leakage:

  • Time-series forecasting: predict tomorrow’s stock price, weather, or demand using only current/past observations.
  • Real-time audio and signal processing: denoising, equalization, or onset detection where latency and causality are critical.
  • Control and robotics: filtering sensor streams to estimate state without peeking into the future.
  • Event stream analytics: online anomaly detection over telemetry logs.
  • NLP and sequence modeling: character/word-level modeling with Temporal Convolutional Networks as an alternative to RNNs/Transformers, especially when low latency and parallelizable training are desirable.

Favor dilated convolutions when you need a large receptive field but want to keep parameter count and per-step compute modest. Choose naive O(nk) implementations for small kernels (e.g., k ≤ 64) and short sequences, and FFT-based convolution for long kernels or long sequences where O(N \log N) wins. In deep models, ensure the framework's padding is left-only (causal 'same') and confirm whether the operation is cross-correlation or convolution to keep the indexing consistent.

⚠️Common Mistakes

  • Future leakage through padding or indexing: Right-padding ('same' padding split across left and right) makes early outputs depend on future zeros or, worse, future data when incorrectly implemented. For causal 'same', pad only on the left by k−1.
  • Confusing convolution with cross-correlation: Many deep learning libraries implement cross-correlation (no kernel flip). If your math assumes convolution, either reverse the kernel or consistently adopt the cross-correlation definition; otherwise, learned filters shift or invert.
  • Circular instead of linear convolution: FFT without adequate zero-padding computes circular convolution, which wraps future samples into the past, breaking causality. Always pad to at least n + k − 1 and take the correct output slice.
  • Off-by-one in receptive field: Forgetting that a k-length kernel covers exactly k positions yields wrong receptive field and dilation calculations. Remember R = 1 + (k−1)d for single-layer, stride-1.
  • Misinterpreting dilation: Dilation spaces the taps; it does not change the number of parameters. Ensure indices use t − d·i, not t − i/d or similar mistakes.
  • Ignoring boundary effects: Early outputs have fewer valid past samples; if you don’t zero-pad (valid mode), output is shorter. Be explicit about 'valid', 'same (causal)', or 'full' conventions.
  • Numerical scaling in FFT: Forgetting to divide by N after inverse FFT or mixing normalization conventions distorts amplitudes.
  • Performance pitfalls: For small k, FFT overhead dominates; for large k, naive O(nk) is too slow. Choose the method based on n and k; consider block methods (overlap-add/save) for streaming.

Key Formulas

Causal Discrete Convolution (FIR)

y[t]=i=0∑k−1​w[i]x[t−i]

Explanation: Each output uses the current and past k−1 inputs weighted by the kernel. This guarantees no future indices appear, so the operation is causal.

Dilated Causal Convolution

y[t]=i=0∑k−1​w[i]x[t−di]

Explanation: Dilation d spaces the taps by d steps, enlarging the receptive field while keeping the same number of parameters. Use this to capture longer histories efficiently.

Single-Layer Receptive Field

R=1+(k−1)d

Explanation: The number of past steps that can influence y[t] grows linearly with kernel size and dilation. This tells you how far into the past a single layer can see.

Stacked Layers Receptive Field

R=1+l=1∑L​(kl​−1)j=1∏l−1​dj​sj​

Explanation: For L layers with kernel sizes kl​, dilations dl​, and strides sl​, the receptive field combines additively across layers, scaled by preceding layers' dilations/strides.

Linear Convolution Length

nout​=nin​+k−1

Explanation: The length of the full linear convolution of sequences of lengths ni​n and k is ni​n + k − 1. Use this when sizing FFTs and slicing outputs.

Causal 'Same' Padding

pleft​=k−1,pright​=0

Explanation: To keep output the same length as input without future leakage, pad k−1 zeros only on the left. Right padding would allow looking into the future.

Convolution Theorem (DFT)

DFT(x∗h)=DFT(x)⊙DFT(h)

Explanation: The DFT of the linear convolution equals the elementwise product of the DFTs. This enables O(N log N) convolution using FFTs when properly zero-padded.

Toeplitz Matrix Form

y=T(x)w,Tt,i​(x)={x[t−i],0,​t−i≥0otherwise​

Explanation: Causal convolution can be written as a matrix-vector product with a Toeplitz matrix built from shifted copies of x. This view helps in proofs and optimization.

Time Complexity

T(n,k)=O(nk),TFFT​(N)=O(NlogN)

Explanation: Direct summation costs O(nk) for n outputs and kernel size k. FFT-based methods cost O(N log N) where N is the padded size (≈ n + k − 1).

Complexity Analysis

For a sequence of length n and kernel size k, the naive causal convolution computes each output y[t] by summing up to k products, giving O(nk) time and O(1) extra space beyond inputs and outputs. With dilation d, you still perform k multiplies per output, so the time remains O(nk). This is highly efficient for small k (e.g., k≤64) because of good cache locality and low constant factors. FFT-based convolution computes a linear convolution by zero-padding both sequences to length N≥n+k − 1 (often the next power of two), taking two FFTs (O(N log N)), multiplying pointwise (O(N)), and taking an inverse FFT (O(N log N)). The overall time is O(N log N), with O(N) extra memory for the frequency-domain buffers. This approach pays off when k (or n) is large because O(N log N) can be much smaller than O(nk). However, the constant factors are higher and numerical errors may appear due to floating-point roundoff; careful normalization after the inverse FFT is required. In streaming scenarios, a direct per-sample update maintains a circular buffer of the last k inputs and computes an O(k) dot product upon each new sample, for O(k) time and O(k) space per step. To reduce per-sample latency for very long k, block FFT methods (overlap-add or overlap-save) achieve amortized O(log k) per output with block-dependent constant factors, while preserving causality by ensuring linear (not circular) convolution over each block. Memory-wise, naive and streaming FIR forms store O(k) weights and a small buffer. FFT methods store O(N) complex arrays; N can be significantly larger than n or k due to padding to convenient FFT sizes.

Code Examples

Naive causal 1D temporal convolution (with optional dilation)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute causal convolution: y[t] = sum_{i=0}^{k-1} w[i] * x[t - d*i]
5// Zero-padding on the left: x[t] = 0 for t < 0
6vector<double> causal_convolve_naive(const vector<double>& x, const vector<double>& w, int dilation = 1) {
7 int n = (int)x.size();
8 int k = (int)w.size();
9 vector<double> y(n, 0.0);
10 for (int t = 0; t < n; ++t) {
11 // Accumulate contributions from current and past samples (no future indices)
12 for (int i = 0; i < k; ++i) {
13 int xi = t - i * dilation; // past index only
14 if (xi < 0) break; // beyond left boundary (due to zero-padding)
15 y[t] += w[i] * x[xi];
16 }
17 }
18 return y;
19}
20
21int main() {
22 // Example: x is a short time series; w is a 3-tap causal kernel
23 vector<double> x {1, 2, 3, 4, 5};
24 vector<double> w {0.5, 0.3, 0.2}; // w[0]*x[t] + w[1]*x[t-1] + w[2]*x[t-2]
25
26 vector<double> y = causal_convolve_naive(x, w, 1);
27
28 // Print result (causal 'same' length)
29 cout.setf(std::ios::fixed); cout<<setprecision(3);
30 for (double v : y) cout << v << ' ';
31 cout << endl;
32
33 // Dilated example: d=2 taps x[t], x[t-2], x[t-4]
34 vector<double> y_dil = causal_convolve_naive(x, w, 2);
35 for (double v : y_dil) cout << v << ' ';
36 cout << endl;
37 return 0;
38}
39

This program implements causal temporal convolution using a direct double loop. For each time t, it forms a weighted sum of the current and past samples, never accessing x[t + r]. The optional dilation parameter spaces taps by d steps, expanding the receptive field without adding parameters. Left-only zero-padding ensures outputs are defined for early times without future leakage.

Time: O(nk)Space: O(n) output plus O(1) extra
FFT-based linear convolution for causal 'same' output (no circular wraparound)
1#include <bits/stdc++.h>
2using namespace std;
3
4using cd = complex<double>;
5const double PI = acos(-1.0);
6
7void fft(vector<cd>& a, bool invert) {
8 int n = (int)a.size();
9 // Bit-reversal permutation
10 for (int i = 1, j = 0; i < n; ++i) {
11 int bit = n >> 1;
12 for (; j & bit; bit >>= 1) j ^= bit;
13 j ^= bit;
14 if (i < j) swap(a[i], a[j]);
15 }
16 for (int len = 2; len <= n; len <<= 1) {
17 double ang = 2 * PI / len * (invert ? -1 : 1);
18 cd wlen(cos(ang), sin(ang));
19 for (int i = 0; i < n; i += len) {
20 cd w(1);
21 for (int j = 0; j < len/2; ++j) {
22 cd u = a[i + j];
23 cd v = a[i + j + len/2] * w;
24 a[i + j] = u + v;
25 a[i + j + len/2] = u - v;
26 w *= wlen;
27 }
28 }
29 }
30 if (invert) {
31 for (int i = 0; i < n; ++i) a[i] /= n;
32 }
33}
34
35// Linear convolution y_full = x (*) h (length n+m-1) via FFT
36vector<double> linear_convolve_fft(const vector<double>& x, const vector<double>& h) {
37 int n = (int)x.size();
38 int m = (int)h.size();
39 int N = 1;
40 while (N < n + m - 1) N <<= 1; // pad to avoid circular wrap
41
42 vector<cd> A(N), B(N);
43 for (int i = 0; i < n; ++i) A[i] = cd(x[i], 0.0);
44 for (int i = 0; i < m; ++i) B[i] = cd(h[i], 0.0); // h is causal, indexed from delay 0
45
46 fft(A, false);
47 fft(B, false);
48 for (int i = 0; i < N; ++i) A[i] *= B[i];
49 fft(A, true);
50
51 vector<double> y(n + m - 1);
52 for (int i = 0; i < (int)y.size(); ++i) y[i] = A[i].real();
53 return y;
54}
55
56// Causal 'same' output: take first n samples of the linear convolution
57vector<double> causal_same_fft(const vector<double>& x, const vector<double>& w) {
58 vector<double> y_full = linear_convolve_fft(x, w);
59 int n = (int)x.size();
60 vector<double> y(n, 0.0);
61 for (int t = 0; t < n; ++t) y[t] = y_full[t];
62 return y;
63}
64
65int main() {
66 // Example sequences
67 vector<double> x {1, 2, 3, 4, 5, 6, 7};
68 vector<double> w {0.5, 0.3, 0.2, -0.1, 0.05}; // causal taps (delays 0..4)
69
70 vector<double> y = causal_same_fft(x, w);
71
72 cout.setf(std::ios::fixed); cout<<setprecision(6);
73 for (double v : y) cout << v << ' ';
74 cout << endl;
75 return 0;
76}
77

This program computes linear convolution via FFTs with sufficient zero-padding to avoid circular wraparound. Because the kernel is causal (nonnegative delays), the causal 'same' output equals the first n samples of the full linear convolution. This preserves strict causality while achieving O(N log N) complexity for long sequences/kernels.

Time: O(N log N) with N ≈ n + k − 1 (padded to power of two)Space: O(N)
Streaming causal FIR with a ring buffer (online, O(k) per sample)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct CausalFIR {
5 vector<double> w; // taps: w[0] for current sample, w[i] for i-step past
6 vector<double> buf; // circular buffer of last k samples
7 int k;
8 int head; // index to write the next sample
9 CausalFIR(const vector<double>& taps) : w(taps), k((int)taps.size()), head(0) {
10 buf.assign(k, 0.0);
11 }
12 // Process one new sample x_t and return y_t = sum_i w[i] * x[t - i]
13 double process(double x_t) {
14 buf[head] = x_t; // write current sample
15 double acc = 0.0;
16 // accumulate from current and past samples; no future indices used
17 for (int i = 0; i < k; ++i) {
18 int idx = head - i;
19 if (idx < 0) idx += k; // wrap in ring buffer
20 acc += w[i] * buf[idx];
21 }
22 head = head + 1;
23 if (head == k) head = 0;
24 return acc;
25 }
26};
27
28int main() {
29 vector<double> taps {0.6, 0.3, 0.1};
30 CausalFIR fir(taps);
31
32 vector<double> stream {1, 0, 0, 0, 0, 2, 0, 0, 0};
33 cout.setf(std::ios::fixed); cout<<setprecision(3);
34 for (double s : stream) {
35 double y = fir.process(s);
36 cout << y << ' ';
37 }
38 cout << endl;
39 return 0;
40}
41

This streaming FIR filter maintains a circular buffer of the last k samples. Each new sample produces an output using only the current and past values, ensuring zero future leakage. This is suitable for real-time processing and online inference, with predictable O(k) per-sample cost and O(k) memory.

Time: O(k) per sampleSpace: O(k)
#temporal convolution#causal convolution#fir filter#dilated convolution#temporal convolutional network#time series#no future leakage#fft convolution#overlap-add#receptive field#cross-correlation#toeplitz#online filtering#signal processing#sequence modeling