Bayesian Optimization for Hyperparameter Tuning: A Practical Guide

July 10, 2025·10 min read

machine-learningoptimizationstatistics

Grid search is dead. Random search is better but still wasteful. Bayesian optimization finds better hyperparameters in fewer evaluations by building a probabilistic model of the objective function.

The Core Idea

Instead of evaluating $f(\mathbf{x})$ at predetermined points, we:

Build a surrogate model (typically a Gaussian process) of $f$
Use an acquisition function to decide where to evaluate next
Update the surrogate with the new observation
Repeat until budget exhausted

\mathbf{x}_{n+1} = \arg\max_{\mathbf{x}} \alpha(\mathbf{x}; \mathcal{D}_n)

where $\alpha$ is the acquisition function and $\mathcal{D}_n = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ are our observations.

Gaussian Process Priors

A GP defines a distribution over functions:

f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))

The posterior after observing $\mathcal{D}_n$ gives us both a mean prediction and uncertainty:

\mu_n(\mathbf{x}) = \mathbf{k}^T (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{y}

\sigma_n^2(\mathbf{x}) = k(\mathbf{x}, \mathbf{x}) - \mathbf{k}^T (\mathbf{K} + \sigma^2\mathbf{I})^{-1} \mathbf{k}

The uncertainty estimate is what makes Bayesian optimization sample-efficient. High uncertainty regions are unexplored — the acquisition function balances exploitation (low predicted loss) with exploration (high uncertainty).

Expected Improvement

The most common acquisition function, Expected Improvement, measures the expected amount by which we'll improve over the current best:

\text{EI}(\mathbf{x}) = \mathbb{E}\left[\max(f(\mathbf{x}) - f^*, 0)\right]

This has a closed-form solution under the GP posterior, making it cheap to optimize.

import numpy as np
from scipy.stats import norm

def expected_improvement(X, gp_model, best_y):
    mu, sigma = gp_model.predict(X, return_std=True)
    z = (mu - best_y) / (sigma + 1e-8)
    ei = (mu - best_y) * norm.cdf(z) + sigma * norm.pdf(z)
    return ei

When to Use What

Method	Evaluations	Best For
Grid Search	$O(k^d)$	≤3 hyperparameters
Random Search	Budget-limited	Initial exploration
Bayesian Opt	10-200	Expensive evaluations
Multi-fidelity	100-1000	Cheap approximations available

For most deep learning tasks, I recommend starting with random search for the first 20 evaluations, then switching to Bayesian optimization with Expected Improvement for refinement.