Architecture / Vision and Local Structure

CNNs: Local Receptive Fields, Weight Sharing, and Feature Hierarchies

Slide a shared kernel over local windows to turn raw grid signals into edges, textures, shapes, and task features.

Mechanism Lab

Animation: how a kernel turns local windows into feature maps

The animation shows a shared convolution kernel sliding over an image grid, multiplying local windows by weights, then passing through ReLU, pooling, and a task readout.

Step 1 / 5

Patch

A convolution reads a local window rather than connecting to the whole image.

R_{i,j}

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

The core inductive bias of CNNs is locality and translation equivariance: nearby pixels, table neighborhoods, or time windows often define local patterns.

The same kernel is shared across spatial positions, so the model does not learn a different detector for every location.

Early convolutions often detect edges and local textures; deeper convolutions compose them into more abstract shapes, structures, or local economic signals.

Pooling or stride lowers resolution and expands the effective receptive field, but also loses detail; modern networks balance downsampling, residual paths, and normalization.

02 / Math

Deriving a CNN layer from discrete convolution

01 / Local window

For 2D input X, each output location looks only at a k_h by k_w local window rather than the whole image.

R_{i,j} = X[i:i+k_h, j:j+k_w]

02 / Weight sharing

The same weights W are reused at every spatial position. With C_in input channels and C_out output channels, parameter count is independent of image size.

#params = k_h k_w C_in C_out + C_out

03 / Convolution output

Output channel k is the weighted sum of the local window and kernel k, plus a bias; input channels are summed inside the window.

Y[i,j,k]=sum_{u,v,c} W[u,v,c,k] X[i+u,j+v,c]+b_k

04 / Translation equivariance

Ignoring boundary effects, translating the input and then convolving is equivalent to convolving first and translating the output.

Conv(T_delta X) = T_delta Conv(X)

05 / Nonlinearity and pooling

ReLU lets the network combine local detectors; pooling or stride compresses local activations into a coarser representation.

Z = pool(phi(Y))

06 / Effective receptive field

Stacking L layers of 3x3 stride-1 convolutions increases the theoretical receptive field by 2 per layer.

RF_L = 1 + 2L

03 / Code

NumPy demo: 2D multi-channel convolution from scratch

This framework-free example explicitly shows local windows, shared weights, ReLU, and max pooling.

import numpy as np

def conv2d_valid(x, kernels, bias):
    # x: [height, width, in_channels]
    # kernels: [kernel_h, kernel_w, in_channels, out_channels]
    h, w, c = x.shape
    kh, kw, kc, out_channels = kernels.shape
    assert c == kc
    out = np.zeros((h - kh + 1, w - kw + 1, out_channels))

    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            patch = x[i:i + kh, j:j + kw, :]
            for k in range(out_channels):
                out[i, j, k] = np.sum(patch * kernels[..., k]) + bias[k]
    return out

def max_pool2d(x, size=2, stride=2):
    h, w, channels = x.shape
    out = np.zeros(((h - size) // stride + 1, (w - size) // stride + 1, channels))
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            patch = x[i * stride:i * stride + size, j * stride:j * stride + size, :]
            out[i, j, :] = patch.max(axis=(0, 1))
    return out

rng = np.random.default_rng(5)
image = rng.normal(size=(8, 8, 1))

# A vertical-edge detector and a horizontal-edge detector.
kernels = np.zeros((3, 3, 1, 2))
kernels[:, :, 0, 0] = [[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]
kernels[:, :, 0, 1] = [[-1, -1, -1], [0, 0, 0], [1, 1, 1]]
bias = np.zeros(2)

features = conv2d_valid(image, kernels, bias)
activated = np.maximum(features, 0.0)
pooled = max_pool2d(activated, size=2, stride=2)

print("feature map:", features.shape)
print("pooled map:", pooled.shape)
print("strongest vertical edge:", activated[..., 0].max())

04 / Case

Case: turning paper figures, satellite images, and gridded economic data into local features

  • In research workflows, CNNs are not only for cat and dog photos. They can process paper-figure screenshots, night-light satellite imagery, land-use grids, microscopy images, traffic heat maps, or local time-frequency images.
  • For night-light prediction of economic activity, early convolutions can detect brightness boundaries, road textures, and urban patches; deeper features combine local patterns into regional activity intensity.
  • For paper-figure understanding, convolution layers can identify axes, points, error bars, and table borders before later models connect those visual signals to OCR, table parsing, or Transformer representations.
  • But CNN-discovered visual patterns are not causal explanations. Policy evaluation still needs treatment/control definition, timing, identification assumptions, and robustness checks.

05 / Risks

Common Pitfalls

Implementing convolution as a dense layer and losing the benefits of weight sharing and locality.
Downsampling too early, which can erase small objects, fine lines, or table borders.
Assuming CNNs are automatically rotation-, scale-, or illumination-invariant; these usually come from augmentation, architecture choices, or the training distribution.
Ignoring how padding, stride, and dilation change output size and receptive field.
Treating feature heat maps as causal explanations. They show visual regions the model used, not the true mechanism.
Ignoring spatial correlation and leakage in economic grid data, such as neighboring regions appearing in both train and test splits.

References