What is a Convolution Anyway? (Part 1)
From dot products to sliding filters, let us build convolutions from first principles.
I want you to imagine taking dot product of two vectors with
And uninterestingly, the output is going to be a scalar . But this scalar is quite interesting in itself because it tells us how aligned and are.
Now suppose we take another vector with . How do you take the dot product ? You can’t, because the dimensionality of the two vectors doesn’t match!
But suppose if we could, how would we approach it?
One way would be to slide the sized vector over the sized components living inside the sized vector , in the stride/increments of 1 and taking the dot product at each step.
Slide 1:
Slide 2:
Slide 3:
Because we took multiple sequential dot products, the output is no longer a single scalar , but a new vector holding all the component-wise dot products/alignment scores:
This operation, is actually something called a cross-correlation, denoted by , as in which is a mathematical operation used in digital signal processing. It may be interesting to know that this operation is also called a convolution in machine learning, for very good reasons we’ll look at later on.
Now, time for a challenge
Suppose we are asked to estimate the dimensionality of in advance, given only two pieces of information:
- - The dimensionality of the vector
- - The dimensionality of the vector
How would we approach it?
We can think of the vector (length 5) as a runway for (length 3) to slide on. Since we are calculating dot products at every step increment, we know that the number of elements in depends on the increments could take. How? If and were to be of the same size, would remain locked up in place, resulting in a single scalar as the only output.
To estimate the number of empty steps can slide over before hitting a wall, we simply subtract . For the above example, 5-3=2, thus 2 is the number of empty slots can slide over in before hitting the wall.
But remember, before we take our first slide into an empty slot and compute the dot product, we already calculated our very first dot product at the starting slot!
Thus the total number of unique positions can occupy in is actually (5-3)+1 = 3. This number gives us the size of .
So our general formula becomes
And looking back at our matrix , this perfectly matches our result! And thus we may say that
The Cliffhanger: Stride
Up until now, we’ve assumed slides over by 1 component at a time. We call this step size the Stride or Increment (). So far, .
But what if we want to speed up this process and skip/leap over some components? Let’s see what happens if we increase our stride to .
To visualise this clearly, let’s slightly expand our runway. Let and , and our stride .
Slide 1 (Start):
Slide 2 (Take a step of 2):
seems to have jumped over .
Slide 3 (Take another step of 2):
seems to have jumped again, but over this time.
With , , and , we ended up with exactly 3 dot products.
Now, let us use our same old formula to estimate the number of dot products would contain.
Our formula seems to fail for strides , as the above result would be only be true if never hopped over and , which we know isn’t possible when stride .
Let’s go back to our slot runway analogy. The total number of empty slots has available to slide into is still . For our new example, that runway is slots long.
However, because our stride is , we are eating up those empty slots 2 at a time! To find out how many steps we can take into the runway, we divide the empty slots by our stride:
And just like before, we must add 1 to account for our very first calculation at the starting block before we took any steps.
Which perfectly matches our 3 slides!
There is one small caveat though. What if ? Our empty slot runway would be . If we take steps of 2, we get steps.
Since cannot take half a step and hang off the edge of , we simply discard the decimal. In mathematics, this is called the “floor” function, denoted by , which rounds down to the nearest whole number.
So, our finalised, bulletproof formula for a 1D convolution is:
Linking this to the machine learning perspective
Though technically, what we just did is more formally called a cross-correlation. But in the machine learning jargon, this operation is called a “convolution”. To understand why, we need to rephrase a few terms here and there and see what’s actually going on.
In deep learning, our long vector is called the Input Signal. This “signal” could represent almost anything: a row of pixels in an image, a slice of an audio Mel-spectrogram, or raw time-series data. The defining characteristic of this input signal is that it contains local structure. This means adjacent elements in are not just random numbers; they are highly correlated and form meaningful features when grouped together. For example, a bunch of large numbers appearing in the middle of a sequence of small numbers in a time series data (e.g. Stock prices) could indicate that the stock price went up briefly before going down.
Because has this local structure, our sliding vector , which is called a Filter or Kernel, in standard DSP and ML, acts as a pattern detector. By sliding over and taking the dot product at every step, we are essentially scanning the input signal to see exactly where our target pattern occurs!
Because represents real-world data, like a picture or an audio clip, every single component holds precious, valuable information. So ideally, we would want to slide the filter across every single element in .
Let’s go back to our standard setup: , , and a stride of .
Slide 1:
Slide 2:
Slide 3:
Our kernel is centered on . This means that wherever element in that lands upon, a full dot product is computed. Notice that the only elements can touch are: and . It is obvious that only a partial dot product computation is happening for the edges. The elements on the edge () only get interacted with exactly once (during Slide 1). But gets interacted with thrice: in Slide 1, Slide 2, and Slide 3!
Because the filter just briefly touches the edges before moving on, our operation is heavily biased toward the center of the signal. If is a row of pixels in an image, we are literally washing away the precious data living on the borders.
Extending the runway
Common sense suggest that if we artificially extend the size of the runway (), we’ll allow to touch on more data-points. We do this by applying something called ‘zero padding’ (), which is literally putting zeroes (blank data-points) around the edges. This way, we allow our filter to start scanning before it hits the actual data, and finish scanning after it leaves.
If we choose a padding of , we add exactly 1 zero to the left side, and 1 zero to the right side of our vector.
Let’s see what happens to our sliding dot product now. Our runway is longer, so can start earlier!
Slide 1 (Entering the signal):
multiplies with to give 0. This has no net effect on the result except that the data-point that used to live on the leftmost edge in is now interacted with twice (instead of once) during the dot product computation.
Slide 2 (Moving inward):
(… skipping a few slides in the middle …)
Slide 5 (Exiting the signal):
By adding padding, and are no longer just an afterthought; they get to sit right in the middle of our pattern detector in Slide 1 and 5 respectively! Furthermore, our output vector now has 5 elements instead of 3, meaning we perfectly preserved the spatial dimension of our original input signal.
The master formula
How does this update our formula? It is incredibly simple.
Padding doesn’t change the logic of our strides or our dot products; it literally just makes our starting runway longer. Because we add zeros to the left AND zeros to the right, the new dimensionality of our runway is no longer . It is now .
To complete our master equation, all we have to do is take our old runway () and replace it with our new, longer runway ():
Let’s plug in the numbers from our padded example (, , , ) to make sure it works:
And just like that, we have derived the universal master equation for the output dimension of a 1D convolution:
This is our final bulletproof formula for computing the spatial dimensions of the output of a convolution operation .
And fascinatingly, this formula extends to arbitrary dimensions. Let us look at how:
Extending the idea to images
Up until now, we’ve dealt with a 1D signal. But if we want to apply this to images, we need to upgrade our vectors into matrices.
Let’s translate our 1D variables into 2D terminology:
- The Input (): Instead of a 1D runway of length , our input is now a 2D grid with a Height () and Width (). So, .
- The Filter (): Instead of a 1D vector of length , our filter (often called a Kernel, ) is a 2D grid. Usually, filters are square, so we denote its size as . So, . Note that we are using a capital instead of a small .
In 1D, our filter slid from left to right. In 2D, the logic is exactly the same, just in two directions! The filter starts at the top-left corner, slides left-to-right across the width, and when it hits the edge, it moves down a step and starts the next row.
Let’s look at the mathematical operation. In 1D, we took a dot product. In 2D, we compute the element-wise product between the filter and the patch of the image it’s currently resting on, and then sum all those values up to get a single scalar.
Suppose we have a image patch and a filter :
The single scalar output for this specific placement of the filter is:
The 2D master formula
Because sliding horizontally across the width is completely independent of sliding vertically down the height, we don’t need to invent a new formula! We just apply our bulletproof 1D formula twice: once for the Width dimension, and once for the Height dimension.
Let’s say we have an input image of size , a filter of size (such that ) a padding of (which adds zeroes to the top, bottom, left, and right), and a stride of .
To find the Output Height (), we apply the formula to the vertical dimension:
To find the Output Width (), we apply the formula to the horizontal dimension:
Because our filter spits out a scalar at every valid placement on the 2D grid, our final output (often called a Feature Map) is a 2D matrix of size .
Again, extending the same logic to n-dimensions is trivial, we just compute the same formula n-times.