Gabriele Farina - Bayesian optimization

MIT 6.7220/15.084 — Nonlinear Optimization Thu, Apr 25th 2024
Lecture 16
Bayesian optimization
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
Having seen first-order and second-order methods, today we briefly take a look at a completely
different approach to optimization: Bayesian optimization. This method is particularly useful when
the function to be optimized is expensive to evaluate, and we have no information about its gradient.
Bayesian optimization is a heuristic approach that is applicable to low-dimensional optimization
problems. Since it avoids using gradient information altogether, it is a popular approach for hyper-
parameter tuning, architecture search, et cetera.
1 A Bayesian approach to optimization
In Bayesian optimization, we are still interested in finding the minimizer of a function. However,
unlike first- and second-order methods, we do not assume access to the Gradient or Hessian of the
function. Rather, we assume only that given any point 𝑥, we can query 𝑓 at 𝑥 and obtain the value
𝑓(𝑥).
In the Bayesian approach, instead of finding the best 𝑥 given 𝑓, we try to figure out the best model
of 𝑓 given its evaluation at the previously observed points 𝑥. By conditioning on the observed points,
we can then extract an expected function 𝑓 and a confidence interval, and decide what point to
query next based on this information. In other words, Bayesian optimization is a sophisticated way
of fitting the data points with a function, and then optimistically querying the point where the
function is expected to be the smallest.
Example 1.1. To build intuition, let us consider
the example depicted on the right. Five points
(the blue dots) have been computed for 𝑓; the
minimum value observed so far is 𝑓⋆ = −2. Out
of the prior over all functions, a posterior was
computed by conditioning on these points. The
blue solid curve represents the mean 𝜇(𝑥) of
the posterior, while the brown shaded area rep-
resents one standard deviation ±𝜎(𝑥) away from
the mean.
The expected improvement (EI) is shown below,
which is a heuristic that suggests where to query
next. Specifically,
EI(𝑥) ≔ 𝔼 ̃𝑓 [max{0, 𝑓⋆ − ̃ 𝑓(𝑥)}],
where at each 𝑥, the expectation is with respect to the conditional distribution of 𝑓(𝑥).
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
Of course, many details are missing from this example: What was the prior over the functions? How
was the posterior computed? How was the expected improvement computed?
In the following, we will address these aspects in more detail.
2 Gaussian processes and regression
In order to keep the conditioning tractable, a very popular modeling choice is that the function 𝑓 is
sampled from a Gaussian process. A Gaussian process postulates that the function 𝑓 must be such
that for any finite set of points 𝑥1, 𝑥2, ..., 𝑥𝑡 ∈ ℝ𝑛, the vector (𝑓(𝑥1), 𝑓(𝑥2), ..., 𝑓(𝑥𝑡)) is distributed as
a multivariate Gaussian. This means that a Gaussian process is completely defined by the following
two quantities:
• the mean function 𝑚(𝑥), assigning to each 𝑥 the expected value 𝔼[𝑓(𝑥)]; and
• the covariance function K(𝑥, 𝑦), assigning to each pair of points 𝑥, 𝑦 the covariance between
𝑓(𝑥) and 𝑓(𝑦).
2.1 Kernel functions
The covariance function K(𝑥, 𝑦) is also called the kernel of the Gaussian process. Intuitively, it mea-
sures how correlated the values of 𝑓(𝑥) and 𝑓(𝑦) are. It is reasonable that as 𝑥 and 𝑦 get father apart,
values of 𝑓(𝑥) do not inform the values of 𝑓(𝑦), and vice versa; so the covariance should decrease. In
practice, the most popular choice for the kernel is the radial basis function (RBF) kernel,¹ defined
as follows.
¹For example, this is the default in scikit-learn, a popular Python machine learning toolkit.
Definition 2.1. The radial basis function (RBF) kernel—also known as squared exponential
kernel or Gaussian kernel—is defined as
K(𝑥, 𝑦) ≔ 𝑘 ⋅ exp
⎩{⎨
{⎧
−‖𝑥 − 𝑦‖2
2
2𝜎2
⎭}⎬
}⎫
,
where 𝑘 > 0 and 𝜎 > 0 are parameters of the kernel.
The radial basis function kernel is a popular choice because it is smooth and has a simple form. It is
also a universal kernel, meaning that any continuous function can be approximated arbitrarily well
by a Gaussian process with an RBF kernel [MXZ06].
2.2 The effect of the RBF kernel parameters
While we have not yet discussed how to carry out the computations necessary for conditioning a
Gaussian process on observed points, we can already speculate how the parameters of the RBF
kernel affect the regression. Specifically,
• The parameter 𝜎 controls how fast the correlation between neighboring points decays. If 𝜎
is large, the function will be very smooth, because values will remain correlated for a long
distance; if 𝜎 is small, the function will be very wiggly and regressing often to the mean.
• The parameter 𝑘 controls the amplitude of the function. In particular, there will be more
variance in the function values (applied uniformly throughout the domain). This is easy to ap-
preciate by considering that the variance of the Gaussian process at any point 𝑥 is K(𝑥, 𝑥) = 𝑘.
Example 2.1. In the examples below, we confirm the effect of varying the RBF kernel parame-
ters 𝑘 and 𝜎 parameters when conditioning on 𝑓(−2) = −1, 𝑓(1) = −2, and 𝑓(5) = 1, under the
common assumption that 𝑚(𝑥) = 0 at all 𝑥.
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
(𝑘 = 1, 𝜎 = 1
4 )
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
(𝑘 = 1, 𝜎 = 2)
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
(𝑘 = 1
2 , 𝜎 = 1)
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
(𝑘 = 2, 𝜎 = 1)
In all plots, the blue curve represents the mean of the conditioned Gaussian process, while the
brown shaded area represents one standard deviation.
2.3 Gaussian process conditioning
Given a set of points 𝑥1, 𝑥2, ..., 𝑥𝑛 and their corresponding function values, the Gaussian process is
conditioned on these points to extract the posterior over the function. An important result in the
theory of Gaussian processes—and the reason why these stochastic processes are often considered—
is that it is possible to characterize the conditional distribution of the value of 𝑓(𝑥) for any 𝑥 not in
the set of observed points, in closed form.
Theorem 2.1. Let 𝑓 be drawn from a Gaussian process, and 𝑓(𝑥1), 𝑓 (𝑥2), ..., 𝑓 (𝑥𝑡) be the ob-
served function values at distinct points 𝑥1, 𝑥2, ..., 𝑥𝑡. The posterior distribution of the function
values at any point 𝑥 is a Gaussian distribution with mean and variance given by
𝜇(𝑥) ≔ 𝑚(𝑥) +
⎝
⎜⎜⎜⎛K(𝑥, 𝑥1)
⋮
K(𝑥, 𝑥𝑡)⎠
⎟⎟⎟⎞
⊤
⎝
⎜⎜⎜⎛K(𝑥1, 𝑥1)
⋮
K(𝑥𝑡, 𝑥1)
⋯
⋱
⋯
K(𝑥1, 𝑥𝑡)
⋮
K(𝑥𝑡, 𝑥𝑡)⎠
⎟⎟⎟⎞
−1
⎝
⎜⎜⎜⎛𝑓(𝑥1) − 𝑚(𝑥1)
⋮
𝑓(𝑥𝑡) − 𝑚(𝑥𝑡) ⎠
⎟⎟⎟⎞,
and
𝜎2(𝑥) = K(𝑥, 𝑥) −
⎝
⎜⎜⎜⎛K(𝑥, 𝑥1)
⋮
K(𝑥, 𝑥𝑡)⎠
⎟⎟⎟⎞
⊤
⎝
⎜⎜⎜⎛K(𝑥1, 𝑥1)
⋮
K(𝑥𝑡, 𝑥1)
⋯
⋱
⋯
K(𝑥1, 𝑥𝑡)
⋮
K(𝑥𝑡, 𝑥𝑡)⎠
⎟⎟⎟⎞
−1
⎝
⎜⎜⎜⎛K(𝑥, 𝑥1)
⋮
K(𝑥, 𝑥𝑡)⎠
⎟⎟⎟⎞.
The formulas above are known results regarding conditioning of multivariate Gaussians. While we
will not prove them here, we can easily verify that indeed they guarantee interpolating the data
correctly. Specifically, we have the following.
Remark 2.1. For any observed datapoint 𝑥𝑖, one has 𝜇(𝑥𝑖) = 𝑓(𝑥𝑖) and 𝜎2(𝑥𝑖) = 0 Furthermore,
𝜎2(𝑥) ≥ 0 for all 𝑥 ∈ ℝ𝑛.
Proof . To simplify notation, let
K(𝑥, 𝑋) ≔
⎝
⎜⎜⎜⎛K(𝑥, 𝑥1)
⋮
K(𝑥, 𝑥𝑡)⎠
⎟⎟⎟⎞, and K(𝑋, 𝑋) ≔
⎝
⎜⎜⎜⎛K(𝑥1, 𝑥1)
⋮
K(𝑥𝑡, 𝑥1)
⋯
⋱
⋯
K(𝑥1, 𝑥𝑡)
⋮
K(𝑥𝑡, 𝑥𝑡)⎠
⎟⎟⎟⎞.
Letting 𝑒𝑖 ∈ ℝ𝑛 denote the 𝑖-th indicator vector (that is, the vector containing a 1 in the 𝑖-th
position and 0 elsewhere), it is easy to check that
K(𝑥𝑖, 𝑋) = K(𝑋, 𝑋)𝑒𝑖.
Hence, plugging in the previous relationship in the definition of 𝜇(𝑥𝑖) we can write
𝜇(𝑥𝑖) = 𝑚(𝑥𝑖) + K(𝑥𝑖, 𝑋)⊤K(𝑋, 𝑋)−1
⎝
⎜⎜⎜⎛𝑓(𝑥1) − 𝑚(𝑥1)
⋮
𝑓(𝑥𝑡) − 𝑚(𝑥𝑡) ⎠
⎟⎟⎟⎞
= 𝑚(𝑥𝑖) + 𝑒⊤
𝑖 K(𝑋, 𝑋)⊤K(𝑋, 𝑋)−1
⎝
⎜⎜⎜⎛𝑓(𝑥1) − 𝑚(𝑥1)
⋮
𝑓(𝑥𝑡) − 𝑚(𝑥𝑡) ⎠
⎟⎟⎟⎞
= 𝑓(𝑥𝑖).
Similarly, plugging in the previous relationship in the definition of 𝜎2(𝑥𝑖) we have
𝜎2(𝑥𝑖) = K(𝑥𝑖, 𝑥𝑖) − K(𝑥𝑖, 𝑋)⊤K(𝑋, 𝑋)−1K(𝑥𝑖, 𝑋)
= K(𝑥𝑖, 𝑥𝑖) − 𝑒⊤
𝑖 K(𝑋, 𝑋)⊤K(𝑋, 𝑋)−1K(𝑋, 𝑋)𝑒𝑖
= K(𝑥𝑖, 𝑥𝑖) − K(𝑥𝑖, 𝑥𝑖) = 0.
Finally, the nonnegativity of 𝜎2(𝑥) follows from the observation that
𝜎2(𝑥) = K(𝑥, 𝑥) − K(𝑥, 𝑋)⊤K(𝑋, 𝑋)−1K(𝑥, 𝑋)
= ( 1
−K(𝑋, 𝑋)−1K(𝑥, 𝑋))
⊤
( K(𝑥, 𝑥)
K(𝑥, 𝑋)
K(𝑥, 𝑋)⊤
K(𝑋, 𝑋) )( 1
−K(𝑋, 𝑋)−1K(𝑥, 𝑋)).
Since the middle square matrix is a covariance matrix, it is positive semidefinite, and therefore
any quadratic form involving it is nonnegative. □
3 Acquisition functions
Finally, we need a heuristic to decide what point 𝑥 ∈ ℝ𝑛 to query next. This is where the acquisition
function comes in. The acquisition function is a heuristic that suggests where to query next based
on the posterior distribution of the function. The most popular acquisition function is the expected
improvement (EI), defined as follows.
Definition 3.1. The expected improvement (EI) at a point 𝑥 is defined as
EI(𝑥) ≔ 𝔼 ̃𝑓 [max{0, 𝑓⋆ − ̃ 𝑓(𝑥)}],
where 𝑓⋆ is the minimum value observed so far, and the expectation is taken with respect to ̃ 𝑓
sampled from the conditional distribution of 𝑓. In particular, for a Gaussian process, the condi-
tional distribution of 𝑓(𝑥) is the Gaussian with mean 𝜇(𝑥) and variance 𝜎2(𝑥), which means that
EI(𝑥) = 1
𝜎(𝑥)√2𝜋 ∫
+∞
−∞
max{0, 𝑓⋆ − 𝑣} exp{−
1
2(𝑣 − 𝜇(𝑥)
𝜎(𝑥) )
2
} d𝑣
= 1
𝜎(𝑥)√2𝜋 ∫
𝑓⋆
−∞
(𝑓⋆ − 𝑣) exp{−
1
2 (𝑣 − 𝜇(𝑥)
𝜎(𝑥) )
2
} d𝑣.
This can be computed in closed form starting from the PDF and CDF of the Gaussian distrib-
ution. [▷ Try to work out the details!]
The next point to query is then the one that maximizes the expected improvement.
Alternatively, one could use other acquisition functions, such as the probability of improvement (PI)
PI(𝑥) ≔ ℙ ̃𝑓 [𝑓⋆ − ̃ 𝑓(𝑥) ≥ 0] = 𝔼 ̃𝑓 [𝟙{𝑓⋆− ̃𝑓(𝑥)≥0}] =
1
𝜎(𝑥)√2𝜋 ∫
𝑓⋆
−∞
exp{− 1
2(𝑣 − 𝜇(𝑥)
𝜎(𝑥) )
2
} d𝑣.
or select the minimum lower confidence bound (LCB)
LCB(𝑥) ≔ 𝜇(𝑥) − 𝛽𝜎(𝑥), where 𝛽 ≥ 0.
The choice of acquisition function is a crucial aspect of Bayesian optimization, and it is often prob-
lem-dependent.
4 Simulation
Example 4.1. We provide an extended example with the function 𝑓(𝑥) = 1
3 𝑥 cos(𝑥), which is su-
perimposed as a dotted curve in the simulation of the Bayesian optimization process below. The
acquisition function was set to the expected improvement, and the Gaussian process regression
uses the RBF kernel with 𝑘 = 1 and 𝜎 = 1. The last-queried point is shown in red.
■ 𝑡 = 1
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 2
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 3
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 4
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 5
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 6
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 7
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 8
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 9
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
■ 𝑡 = 10
−3 −2 −1 1 2 3 4 5 6 𝑥
−2
−1
1
2
𝑦
0
EI(𝑥)
⋆
Bibliography
[MXZ06] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal Kernels,” Journal of Machine Learning
Research, vol. 7, no. 12, 2006.

Content

Typo or question?

Metadata