ThinkMachine
← writing
Deep Learning · Theory

What is a Convolution Anyway? (Part 1)

From dot products to sliding filters, let us build convolutions from first principles.

I want you to imagine taking dot product of two vectors x,wRdx,w \in \mathbb{R}^d with d=5d = 5

s=xw=[x1x2x3x4x5][w1w2w3w4w5]=i=15xiwis = x^\top w = \begin{bmatrix} x_1 & x_2 & x_3 & x_4 & x_5 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ w_3 \\ w_4 \\ w_5 \end{bmatrix} = \sum_{i=1}^{5} x_i w_i

And uninterestingly, the output is going to be a scalar ss. But this scalar is quite interesting in itself because it tells us how aligned xx and ww are.

Now suppose we take another vector wRkw' \in \mathbb{R}^k with k=3k=3. How do you take the dot product xwx\top w'? You can’t, because the dimensionality of the two vectors doesn’t match!

xw=[x1x2x3x4x5][w1w2w3]    Undefinedx^\top w' = \begin{bmatrix} x_1 & x_2 & x_3 & x_4 & x_5 \end{bmatrix} \begin{bmatrix} w'_1 \\ w'_2 \\ w'_3 \end{bmatrix} \implies \text{Undefined}

But suppose if we could, how would we approach it?

One way would be to slide the kk sized vector ww' over the kk sized components living inside the dd sized vector xx, in the stride/increments of 1 and taking the dot product at each step.

Slide 1:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & x_4 & x_5 & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & \end{array} s1=x1w1+x2w2+x3w3\rightarrow s_1 = x_1 w'_1 + x_2 w'_2 + x_3 w'_3

Slide 2:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & x_1 & \mathbf{x_2} & \mathbf{x_3} & \mathbf{x_4} & x_5 & ] \\ w' = & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & \end{array} s2=x2w1+x3w2+x4w3\rightarrow s_2 = x_2 w'_1 + x_3 w'_2 + x_4 w'_3

Slide 3:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & x_1 & x_2 & \mathbf{x_3} & \mathbf{x_4} & \mathbf{x_5} & ] \\ w' = & & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] \end{array} s3=x3w1+x4w2+x5w3\rightarrow s_3 = x_3 w'_1 + x_4 w'_2 + x_5 w'_3

Because we took multiple sequential dot products, the output is no longer a single scalar ss, but a new vector holding all the component-wise dot products/alignment scores:

a=[s1s2s3]a = \begin{bmatrix} s_1 & s_2 & s_3 \end{bmatrix}

This operation, is actually something called a cross-correlation, denoted by \ast, as in xwx\ast w which is a mathematical operation used in digital signal processing. It may be interesting to know that this operation is also called a convolution in machine learning, for very good reasons we’ll look at later on.

Now, time for a challenge

Suppose we are asked to estimate the dimensionality of aa in advance, given only two pieces of information:

  • dd - The dimensionality of the vector xx
  • kk - The dimensionality of the vector ww'

How would we approach it?

We can think of the vector xx (length 5) as a runway for ww' (length 3) to slide on. Since we are calculating dot products at every step increment, we know that the number of elements in aa depends on the increments ww' could take. How? If ww' and xx were to be of the same size, ww' would remain locked up in place, resulting in a single scalar as the only output.

x=[x1x2x3]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3}& ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & \end{array} s=x1w1+x2w2+x3w3\rightarrow s = x_1 w'_1 + x_2 w'_2 + x_3 w'_3

To estimate the number of empty steps ww can slide over before hitting a wall, we simply subtract dkd-k. For the above example, 5-3=2, thus 2 is the number of empty slots ww' can slide over in xx before hitting the wall.

x=[x1x2x3x4x52 empty slots]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & {\overbrace{x_4 \quad x_5}^{\text{2 empty slots}}} & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3}&&] & & \end{array}

But remember, before we take our first slide into an empty slot and compute the dot product, we already calculated our very first dot product at the starting slot!

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & x_4 & x_5 & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & \end{array} s1=x1w1+x2w2+x3w3\rightarrow s_1 = x_1 w'_1 + x_2 w'_2 + x_3 w'_3

Thus the total number of unique positions ww can occupy in xx is actually (5-3)+1 = 3. This number gives us the size of aa.

So our general formula becomes

Dimension of a=(dk)+1\text{Dimension of } a = (d - k) + 1 Dimension of a=(53)+1=3\text{Dimension of } a = (5 - 3) + 1 = 3

And looking back at our matrix a=[s1s2s3]a = \begin{bmatrix} s_1 & s_2 & s_3 \end{bmatrix}, this perfectly matches our result! And thus we may say that aR(dk+1)a \in \mathbb{R}^{(d-k+1)}

The Cliffhanger: Stride

Up until now, we’ve assumed ww' slides over by 1 component at a time. We call this step size the Stride or Increment (ss). So far, s=1s=1.

But what if we want to speed up this process and skip/leap over some components? Let’s see what happens if we increase our stride to s=2s=2.

To visualise this clearly, let’s slightly expand our runway. Let d=7d=7 and k=3k=3, and our stride =2= 2.

Slide 1 (Start):

x=[x1x2x3x4x5x6x7]w=[w1w2w3]\begin{array}{cccccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & x_4 & x_5 & x_6 & x_7 & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & & & \end{array} s1=x1w1+x2w2+x3w3\rightarrow s_1 = x_1 w'_1 + x_2 w'_2 + x_3 w'_3

Slide 2 (Take a step of 2):

x=[x1x2x3x4x5x6x7]w=[w1w2w3]\begin{array}{cccccccccc} x = & [ & x_1 & x_2 & \mathbf{x_3} & \mathbf{x_4} & \mathbf{x_5} & x_6 & x_7 & ] \\ w' = & & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & \end{array} s2=x3w1+x4w2+x5w3\rightarrow s_2 = x_3 w'_1 + x_4 w'_2 + x_5 w'_3

ww seems to have jumped over x2x^2.

Slide 3 (Take another step of 2):

x=[x1x2x3x4x5x6x7]w=[w1w2w3]\begin{array}{cccccccccc} x = & [ & x_1 & x_2 & x_3 & x_4 & \mathbf{x_5} & \mathbf{x_6} & \mathbf{x_7} & ] \\ w' = & & & & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] \end{array} s3=x5w1+x6w2+x7w3\rightarrow s_3 = x_5 w'_1 + x_6 w'_2 + x_7 w'_3

ww seems to have jumped again, but over x6x^6 this time.

With d=7d=7, k=3k=3, and s=2s=2, we ended up with exactly 3 dot products.

Now, let us use our same old formula to estimate the number of dot products aa would contain.

(dk)+1(73)+1=5\rightarrow (d - k) + 1 \Rightarrow (7 - 3) + 1 = 5

Our formula seems to fail for strides >1> 1, as the above result would be only be true if ww never hopped over x2x_2 and x6x_6, which we know isn’t possible when stride =2= 2.

Let’s go back to our slot runway analogy. The total number of empty slots ww' has available to slide into is still (dk)(d - k). For our new example, that runway is (73)=4(7 - 3) = 4 slots long.

However, because our stride is s=2s=2, we are eating up those empty slots 2 at a time! To find out how many steps we can take into the runway, we divide the empty slots by our stride:

Steps into empty slots=dks=732=2 steps\text{Steps into empty slots} = \frac{d - k}{s} = \frac{7 - 3}{2} = 2 \text{ steps}

And just like before, we must add 1 to account for our very first calculation at the starting block before we took any steps.

Dimension of a=dks+1\text{Dimension of } a = \frac{d - k}{s} + 1 Dimension of a=732+1=3\text{Dimension of } a = \frac{7 - 3}{2} + 1 = 3

Which perfectly matches our 3 slides!

There is one small caveat though. What if d=8d=8? Our empty slot runway would be 83=58 - 3 = 5. If we take steps of 2, we get 5/2=2.55 / 2 = 2.5 steps.

Since ww' cannot take half a step and hang off the edge of xx, we simply discard the decimal. In mathematics, this is called the “floor” function, denoted by \lfloor \cdot \rfloor, which rounds down to the nearest whole number.

So, our finalised, bulletproof formula for a 1D convolution is:

aRdks+1a \in \mathbb{R}^{\lfloor \frac{d-k}{s} \rfloor + 1}

Linking this to the machine learning perspective

Though technically, what we just did is more formally called a cross-correlation. But in the machine learning jargon, this operation is called a “convolution”. To understand why, we need to rephrase a few terms here and there and see what’s actually going on.

In deep learning, our long vector xx is called the Input Signal. This “signal” could represent almost anything: a row of pixels in an image, a slice of an audio Mel-spectrogram, or raw time-series data. The defining characteristic of this input signal is that it contains local structure. This means adjacent elements in xx are not just random numbers; they are highly correlated and form meaningful features when grouped together. For example, a bunch of large numbers appearing in the middle of a sequence of small numbers in a time series data (e.g. Stock prices) could indicate that the stock price went up briefly before going down.

Because xx has this local structure, our sliding vector ww', which is called a Filter or Kernel, in standard DSP and ML, acts as a pattern detector. By sliding ww' over xx and taking the dot product at every step, we are essentially scanning the input signal to see exactly where our target pattern occurs!

Because xx represents real-world data, like a picture or an audio clip, every single component holds precious, valuable information. So ideally, we would want to slide the filter across every single element in xx.

Let’s go back to our standard setup: d=5d=5, k=3k=3, and a stride of s=1s=1.

Slide 1:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & x_4 & x_5 & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & \end{array}

Slide 2:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & x_1 & \mathbf{x_2} & \mathbf{x_3} & \mathbf{x_4} & x_5 & ] \\ w' = & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & \end{array}

Slide 3:

x=[x1x2x3x4x5]w=[w1w2w3]\begin{array}{cccccccc} x = & [ & x_1 & x_2 & \mathbf{x_3} & \mathbf{x_4} & \mathbf{x_5} & ] \\ w' = & & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & \end{array}

Our kernel ww' is centered on w2\mathbf{w'_2}. This means that wherever element in xx that w2\mathbf{w'_2} lands upon, a full dot product is computed. Notice that the only elements w2\mathbf{w'_2} can touch are: x2,x3x_2, x_3 and x4x_4. It is obvious that only a partial dot product computation is happening for the edges. The elements on the edge (x1,x5x_1, x_5) only get interacted with exactly once (during Slide 1). But x3x_3 gets interacted with thrice: in Slide 1, Slide 2, and Slide 3!

Because the filter just briefly touches the edges before moving on, our operation is heavily biased toward the center of the signal. If xx is a row of pixels in an image, we are literally washing away the precious data living on the borders.

Extending the runway

Common sense suggest that if we artificially extend the size of the runway (xx), we’ll allow w2\mathbf{w'_2} to touch on more data-points. We do this by applying something called ‘zero padding’ (pp), which is literally putting pp zeroes (blank data-points) around the edges. This way, we allow our filter ww' to start scanning before it hits the actual data, and finish scanning after it leaves.

If we choose a padding of p=1p = 1, we add exactly 1 zero to the left side, and 1 zero to the right side of our vector.

Let’s see what happens to our sliding dot product now. Our runway is longer, so ww' can start earlier!

Slide 1 (Entering the signal):

xpadded=[0x1x2x3x4x50]w=[w1w2w3]\begin{array}{cccccccccc} x_{\text{padded}} = & [ & \mathbf{0} & \mathbf{x_1} & \mathbf{x_2} & x_3 & x_4 & x_5 & 0 & ] \\ w' = & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & & & \end{array} s1=(0w1)+(x1w2)+(x2w3)\rightarrow s_1 = (0 \cdot w'_1) + (x_1 \cdot w'_2) + (x_2 \cdot w'_3)

00 multiplies with w1w'_1 to give 0. This has no net effect on the result except that the data-point that used to live on the leftmost edge in xx is now interacted with twice (instead of once) during the dot product computation.

Slide 2 (Moving inward):

xpadded=[0x1x2x3x4x50]w=[w1w2w3]\begin{array}{cccccccccc} x_{\text{padded}} = & [ & 0 & \mathbf{x_1} & \mathbf{x_2} & \mathbf{x_3} & x_4 & x_5 & 0 & ] \\ w' = & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] & & & \end{array} s2=(x1w1)+(x2w2)+(x3w3)\rightarrow s_2 = (x_1 \cdot w'_1) + (x_2 \cdot w'_2) + (x_3 \cdot w'_3)

(… skipping a few slides in the middle …)

Slide 5 (Exiting the signal):

xpadded=[0x1x2x3x4x50]w=[w1w2w3]\begin{array}{cccccccccc} x_{\text{padded}} = & [ & 0 & x_1 & x_2 & x_3 & \mathbf{x_4} & \mathbf{x_5} & \mathbf{0} & ] \\ w' = & & & & & [ & \mathbf{w'_1} & \mathbf{w'_2} & \mathbf{w'_3} & ] \end{array} s5=(x4w1)+(x5w2)+(0w3)\rightarrow s_5 = (x_4 \cdot w'_1) + (x_5 \cdot w'_2) + (0 \cdot w'_3)

By adding padding, x1x_1 and x5x_5 are no longer just an afterthought; they get to sit right in the middle of our pattern detector in Slide 1 and 5 respectively! Furthermore, our output vector aa now has 5 elements instead of 3, meaning we perfectly preserved the spatial dimension of our original input signal.

The master formula

How does this update our formula? It is incredibly simple.

Padding doesn’t change the logic of our strides or our dot products; it literally just makes our starting runway longer. Because we add pp zeros to the left AND pp zeros to the right, the new dimensionality of our runway is no longer dd. It is now d+2pd + 2p.

To complete our master equation, all we have to do is take our old runway (dd) and replace it with our new, longer runway (d+2pd + 2p):

Dimension of a=Runwayks+1\text{Dimension of } a = \lfloor \frac{\text{Runway} - k}{s} \rfloor + 1 Dimension of a=(d+2p)ks+1\text{Dimension of } a = \lfloor \frac{(d + 2p) - k}{s} \rfloor + 1

Let’s plug in the numbers from our padded example (d=5d=5, k=3k=3, p=1p=1, s=1s=1) to make sure it works:

Dimension of a=(5+2(1))31+1\text{Dimension of } a = \lfloor \frac{(5 + 2(1)) - 3}{1} \rfloor + 1 Dimension of a=731+1\text{Dimension of } a = \lfloor \frac{7 - 3}{1} \rfloor + 1 Dimension of a=4+1=5\text{Dimension of } a = 4 + 1 = 5

And just like that, we have derived the universal master equation for the output dimension of a 1D convolution:

Dimension of a=(d+2p)ks+1\text{Dimension of } a = \lfloor \frac{(d + 2p) - k}{s} \rfloor + 1

This is our final bulletproof formula for computing the spatial dimensions of the output of a convolution operation xwx \ast w'.

aR(d+2p)ks+1a \in \mathbb{R}^{\lfloor \frac{(d+2p)-k}{s} \rfloor + 1}

And fascinatingly, this formula extends to arbitrary dimensions. Let us look at how:

Extending the idea to images

Up until now, we’ve dealt with a 1D signal. But if we want to apply this to images, we need to upgrade our vectors into matrices.

Let’s translate our 1D variables into 2D terminology:

  • The Input (xXx \rightarrow X): Instead of a 1D runway of length dd, our input is now a 2D grid with a Height (HH) and Width (WW). So, XRH×WX \in \mathbb{R}^{H \times W}.
  • The Filter (wKw' \rightarrow K): Instead of a 1D vector of length kk, our filter (often called a Kernel, KK) is a 2D grid. Usually, filters are square, so we denote its size as F×FF \times F. So, KRF×FK \in \mathbb{R}^{F \times F}. Note that we are using a capital FF instead of a small kk.

In 1D, our filter slid from left to right. In 2D, the logic is exactly the same, just in two directions! The filter starts at the top-left corner, slides left-to-right across the width, and when it hits the edge, it moves down a step and starts the next row.

Let’s look at the mathematical operation. In 1D, we took a dot product. In 2D, we compute the element-wise product between the F×FF \times F filter and the F×FF \times F patch of the image it’s currently resting on, and then sum all those values up to get a single scalar.

Suppose we have a 3×33 \times 3 image patch and a 3×33 \times 3 filter KK:

Patch=[x11x12x13x21x22x23x31x32x33],K=[w11w12w13w21w22w23w31w32w33]\text{Patch} = \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix}, \quad K = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \end{bmatrix}

The single scalar output ss for this specific placement of the filter is:

s=(x11w11)+(x12w12)++(x33w33)=i=1Fj=1Fxijwijs = (x_{11}w_{11}) + (x_{12}w_{12}) + \dots + (x_{33}w_{33}) = \sum_{i=1}^{F} \sum_{j=1}^{F} x_{ij} w_{ij}

The 2D master formula

Because sliding horizontally across the width is completely independent of sliding vertically down the height, we don’t need to invent a new formula! We just apply our bulletproof 1D formula twice: once for the Width dimension, and once for the Height dimension.

Let’s say we have an input image of size H×WH \times W, a filter of size F×FF \times F (such that F<H,WF < H,W) a padding of PP (which adds zeroes to the top, bottom, left, and right), and a stride of SS.

To find the Output Height (HoutH_{out}), we apply the formula to the vertical dimension:

Hout=HF+2PS+1H_{out} = \lfloor \frac{H - F + 2P}{S} \rfloor + 1

To find the Output Width (WoutW_{out}), we apply the formula to the horizontal dimension:

Wout=WF+2PS+1W_{out} = \lfloor \frac{W - F + 2P}{S} \rfloor + 1

Because our filter spits out a scalar at every valid placement on the 2D grid, our final output AA (often called a Feature Map) is a 2D matrix of size Hout×WoutH_{out} \times W_{out}.

ARHout×WoutA \in \mathbb{R}^{H_{out} \times W_{out}}

Again, extending the same logic to n-dimensions is trivial, we just compute the same formula n-times.