MLP Topic
UAT: Why MLPs Can Approximate Continuous Functions
The phrase “with enough expansion terms, we can fit almost anything” points to a shared structure: approximate a complex function by adding many simple functions. Fourier series use fixed sine and cosine bases; MLPs use learnable nonlinear basis functions. UAT is the expressiveness theorem for the latter.
One approximation template
f(x) ~= sum_{j=1}^N alpha_j phi_j(x)
Fourier: phi_j in {sin(kx), cos(kx)}
MLP: phi_j(x) = sigma(w_j^T x + b_j)01 / Intuition
A more precise version of the intuition
“Fit anything” is not an unconditional statement. A precise version names a function space, an error norm, a domain, and regularity conditions. A rich enough family is dense in that space, meaning the distance between a target f and a model g can be made smaller than any epsilon.
For every f in a function space F and every epsilon > 0,
there exists g in model class G such that
distance(f, g) < epsilon.Fourier series and UAT are both instances of this template. The difference is that Fourier bases are usually fixed, while MLP hidden units learn their location, direction, scale, and output weight.
02 / Theorem
A standard statement of UAT
Let K be a compact subset of R^d, and let C(K) be the space of continuous functions on K with the sup norm. For suitable activations, such as continuous non-polynomial activations, or bounded continuous nonconstant sigmoidal activations in Cybenko’s original theorem, single-hidden-layer networks are dense in C(K).
Given f in C(K) and epsilon > 0,
there exist N, alpha_j, w_j, b_j such that
sup_{x in K} | f(x) - sum_{j=1}^N alpha_j sigma(w_j^T x + b_j) | < epsilon.The key word is existence. UAT says a wide enough MLP is expressive enough. It does not say training will find the parameters or that finite-sample generalization will be good.
03 / Proof
Cybenko-style proof: density by measure separation
The proof outline is functional-analytic. If MLPs were not dense, a nonzero measure would annihilate every hidden unit. But sigmoidal units approximate half-space indicators, forcing the measure of every half-space to be zero. That implies the measure itself is zero, a contradiction.
Assume the network span is not dense
Let H be the set of all finite sums sum alpha_j sigma(w_j^T x + b_j). If H is not dense in C(K), some continuous function cannot be approximated arbitrarily well by H.
Hahn-Banach plus Riesz representation
By Hahn-Banach separation, there is a nonzero continuous linear functional L with L(h)=0 for every h in H. The Riesz representation theorem writes L as integration against a finite signed Borel measure mu.
L(h) = int_K h(x) d mu(x)
Therefore for every w and b:
int_K sigma(w^T x + b) d mu(x) = 0.Scale the sigmoid to get half-spaces
A sigmoidal activation approaches 1 as t goes to +infinity and 0 as t goes to -infinity. Replace sigma(w^T x+b) by sigma(lambda(w^T x+b)) and let lambda go to infinity. Away from the boundary hyperplane, it converges pointwise to a half-space indicator.
sigma(lambda(w^T x + b)) -> 1{w^T x + b > 0}
So, for almost every boundary shift b:
mu({x in K: w^T x + b > 0}) = 0.Zero on all half-spaces implies zero measure
For any direction w, push mu forward to the line: nu_w(B)=mu({x: w^T x in B}). Since all half-lines have zero mass, nu_w=0. Therefore the Fourier transform of mu vanishes at every frequency. By uniqueness of finite measures, mu=0.
hat_mu(xi) = int_K exp(i xi^T x) d mu(x)
= int_R exp(i t) d nu_xi(t)
= 0 for every xi.
Thus mu = 0, contradicting that L was nonzero.04 / Constructive Intuition
One-dimensional ReLU: approximate by piecewise-linear splines
For ReLU, the most concrete derivation starts in one dimension. The unit (x-t)_+ is a hinge that changes slope at t. Adding many hinges places many breakpoints, representing any continuous piecewise-linear function.
p(x) = beta_0 + beta_1 x + sum_{i=1}^{m-1} c_i (x - t_i)_+
If slopes on intervals are s_1, s_2, ..., s_m,
then c_i = s_{i+1} - s_i.A continuous function on a closed interval is uniformly continuous. With a fine enough grid, the polygonal interpolation p satisfies sup_x |f(x)-p(x)| < epsilon. Since p is a finite linear combination of ReLU hinges, the constructive intuition is direct: more hidden units mean more breakpoints.
05 / Fourier Comparison
Fourier series: projection onto a fixed orthogonal basis
Fourier series use a fixed orthogonal basis, sin(kx) and cos(kx), on a periodic interval. For square-integrable periodic functions, the partial sum is a projection onto a finite-frequency subspace.
f(x) ~ a_0/2 + sum_{k=1}^N [a_k cos(kx) + b_k sin(kx)]
a_k = (1/pi) int_{-pi}^{pi} f(x) cos(kx) dx
b_k = (1/pi) int_{-pi}^{pi} f(x) sin(kx) dxOrthogonality gives a clean error decomposition. With complex coefficients c_k=(1/2pi) int f(x)e^{-ikx}dx, Parseval decomposes energy by frequency; truncation error is the energy left in the discarded high frequencies.
||f - S_N f||_2^2 = 2pi * sum_{|k| > N} |c_k|^206 / Key Limits
Neither UAT nor Fourier means “free fitting of everything”
Existence
UAT guarantees that parameters exist, not that training will find them.
Function space
Fourier results depend on periodicity, L2 assumptions, continuity, or smoothness.
Generalization
Small training error does not imply good out-of-sample behavior; validation and regularization still matter.
The precise one-sentence summary is: Fourier expansions and UAT both show that complex functions can be approximated by sums of simple functions; Fourier uses fixed orthogonal frequencies, while MLPs use learnable nonlinear units. The approximation idea is shared, but the assumptions, norms, computation, and failure modes differ.
References
- Cybenko (1989), Approximation by Superpositions of a Sigmoidal Functionhttps://doi.org/10.1007/BF02551274
- Hornik, Stinchcombe, and White (1989), Multilayer Feedforward Networks are Universal Approximatorshttps://doi.org/10.1016/0893-6080(89)90020-8
- Leshno et al. (1993), Multilayer Feedforward Networks with a Nonpolynomial Activation Function can Approximate Any Functionhttps://doi.org/10.1016/S0893-6080(05)80131-5
- Stein and Shakarchi, Fourier Analysis: An Introductionhttps://press.princeton.edu/books/hardcover/9780691113845/fourier-analysis