Machine Learning

ML Basics: ERM, Validation, and Regularization

Reduce the sklearn workflow to empirical risk minimization and out-of-sample control.

Mechanism Lab

Animation: bias-variance as complexity changes

The animation shows training error falling, validation error becoming U-shaped, and regularization shifting the optimum.

Step 1 / 5

Data

Split into training and validation.

D = D_train union D_valid

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

Training selects a prediction function from a function class.

Training error is not the target; out-of-sample risk is.

Validation, cross-validation, and regularization prevent models from memorizing noise.

02 / Math

Empirical risk minimization

01 / Risk

The ideal target is expected loss under the unknown data distribution.

R(f) = E[L(Y, f(X))]

02 / Empirical risk

A sample average approximates that expectation.

R_hat(f) = (1/n) sum_i L(y_i, f(x_i))

03 / Regularization

Complexity penalties trade a little bias for lower variance.

min_f R_hat(f) + lambda * Omega(f)

03 / Code

sklearn baseline

Build an interpretable baseline before using a complex model.

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    StandardScaler(),
    LogisticRegression(C=1.0, penalty="l2", max_iter=1000),
)

scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(scores.mean(), scores.std())

04 / Case

Case: paper acceptance risk prediction

  • Features include author experience, topic, abstract length, and prior citations.
  • Start with logistic regression as an interpretable baseline.
  • Then test whether forests or neural models really improve validation performance.

05 / Risks

Common Pitfalls

Tuning on the test set.
Reporting only training accuracy.
Skipping simple baselines.

References