Back to Blog

Bayesian Optimization for Hyperparameter Tuning: A Practical Guide

·10 min read
machine-learningoptimizationstatistics

Grid search is dead. Random search is better but still wasteful. Bayesian optimization finds better hyperparameters in fewer evaluations by building a probabilistic model of the objective function.

The Core Idea

Instead of evaluating f(x)f(\mathbf{x}) at predetermined points, we:

  1. Build a surrogate model (typically a Gaussian process) of ff
  2. Use an acquisition function to decide where to evaluate next
  3. Update the surrogate with the new observation
  4. Repeat until budget exhausted
xn+1=argmaxxα(x;Dn)\mathbf{x}_{n+1} = \arg\max_{\mathbf{x}} \alpha(\mathbf{x}; \mathcal{D}_n)

where α\alpha is the acquisition function and Dn={(xi,yi)}i=1n\mathcal{D}_n = \{(\mathbf{x}_i, y_i)\}_{i=1}^n are our observations.

Gaussian Process Priors

A GP defines a distribution over functions:

f(x)GP(m(x),k(x,x))f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))

The posterior after observing Dn\mathcal{D}_n gives us both a mean prediction and uncertainty:

μn(x)=kT(K+σ2I)1y\mu_n(\mathbf{x}) = \mathbf{k}^T (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{y} σn2(x)=k(x,x)kT(K+σ2I)1k\sigma_n^2(\mathbf{x}) = k(\mathbf{x}, \mathbf{x}) - \mathbf{k}^T (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{k}

The uncertainty estimate is what makes Bayesian optimization sample-efficient. High uncertainty regions are unexplored — the acquisition function balances exploitation (low predicted loss) with exploration (high uncertainty).

Expected Improvement

The most common acquisition function, Expected Improvement, measures the expected amount by which we'll improve over the current best:

EI(x)=E[max(f(x)f,0)]\text{EI}(\mathbf{x}) = \mathbb{E}\left[\max(f(\mathbf{x}) - f^*, 0)\right]

This has a closed-form solution under the GP posterior, making it cheap to optimize.

import numpy as np
from scipy.stats import norm

def expected_improvement(X, gp_model, best_y):
    mu, sigma = gp_model.predict(X, return_std=True)
    z = (mu - best_y) / (sigma + 1e-8)
    ei = (mu - best_y) * norm.cdf(z) + sigma * norm.pdf(z)
    return ei

When to Use What

MethodEvaluationsBest For
Grid SearchO(kd)O(k^d)≤3 hyperparameters
Random SearchBudget-limitedInitial exploration
Bayesian Opt10-200Expensive evaluations
Multi-fidelity100-1000Cheap approximations available

For most deep learning tasks, I recommend starting with random search for the first 20 evaluations, then switching to Bayesian optimization with Expected Improvement for refinement.