Gabriele Farina - Stochastic gradient descent

MIT 6.7220/15.084 — Nonlinear Optimization Tue, Apr 2nd 2024
Lecture 10
Stochastic gradient descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
As we have seen in the past few lectures, gradient descent and its family of algorithms (including
accelerated gradient descent, projected gradient descent and mirror descent) are first-order methods
that can compute approximate minima of differentiable functions. The cost of each iteration of these
algorithms is dominated (linearly) by the time it takes to evaluate the gradient of the objective
function at each point generated by the algorithm.
1 Preliminary words on stochastic methods
In some cases, including in most machine learning applications, the objective function contains a
huge number of terms (usually, a summation over training data points), rendering the gradient com-
putation prohibitively expensive. Today, we will investigate an approach to sidestep this difficulty
in practice. The idea is simple, and it is a general principle of algorithm design broadly:
If exact computation is expensive, replace it with a cheaper estimate.
In particular, as is often the case, we will investigate replacing the computation of an exact gradient
with the computation of a cheap, unbiased random estimator of it. This approach, based on ran-
domness (aka. stochasticity) leads to the family of algorithms that go under the name of stochastic
gradient descent.
As one might expect, such a stochastic approach is not free. We will be trading an exact quantity (the
exact gradient of the objective function) for a cheap estimate that is subject to variance, subjecting
the resulting algorithm to random (erratic) steps rather than clean descent steps. This tradeoff is
nonetheless extremely favorable for multiple reasons:
• As a matter of pragmatism, stochastic gradient descent algorithms can perform a significantly
larger number of steps in the same time that it takes (exact) gradient descent to perform even
a single step. In certain cases, the exact algorithm might not be able to perform a single step in
the allotted computation budget. By this metric, the choice between an exact and a stochastic
method reduces to the choice between an algorithm that is not able to start, and one that—
as erratic as it may be—at least makes some progress.
• As a subtler comment, the use of stochasticity, as opposed to using a powerful exact method,
has been observed empirically to lead to better solutions in machine learning settings, even
when the exact algorithm can be run fast enough. In machine learning, minimizing the objec-
tive function on the training data correlates, but is conceptually different from, finding a good
model for the task at hand. In other words, the role of optimization in machine learning is to
(i) produce models that interpolate well the training dataset, but also (ii) do not overfit, that
is, generalize well to unseen but in-distribution (“similar”) examples. Using stochastic gradient
descent has been linked with a reduction in overfitting and increased success on this second
goal, partly due to the presence of noise, which enables the algorithm to escape local minima
and saddle points.
2 Empirical risk minimization (ERM) problems in machine learning
Many problems in machine learning can be abstracted as follows:
• We are given a dataset of 𝑘 examples (𝑧1, 𝑦1), ..., (𝑧𝑘, 𝑦𝑘), where 𝑧𝑖 ∈ ℝ𝑛 is a feature vector and
𝑦𝑖 ∈ ℝ is a label. For example:
‣ we might be interested in building a classifier for handwritten digits, where each 𝑧𝑖 is
a vector containing 28 × 28 grayscale pixel intensities, and each 𝑦𝑖 is the digit itself (a
multinomial classification problem); or
‣ we might be interested in predicting the price of a house, where each 𝑧𝑖 is a vector con-
taining features of the house (e.g., number of bedrooms, square footage, etc.), and each
𝑦𝑖 is the price of the house (a regression problem); or
‣ we might be interested in predicting the next word in a sentence, where each 𝑧𝑖 is a vector
containing the previous words in the sentence, and each 𝑦𝑖 is the next word (a multinomial
classification problem); or
‣ we might be interested in predicting the probability of a user clicking on an ad, where
each 𝑧𝑖 is a vector containing features of the user and the ad, and each 𝑦𝑖 is the probability
of clicking (a binary—that is, two-class—classification problem); et cetera.
• We have selected a model class of functions 𝑔𝜃 : ℝ𝑛 → ℝ𝑚 that we believe can capture the
relationship between the features and the labels. These functions are parameterized by a vector
𝜃 ∈ ℝ𝑑. For example, in the case of neural networks, 𝜃 would be the set of layer weights and
biases.
‣ The output dimension 𝑚 of the model is typically 𝑚 = 1 in the case of regression and
binary classification problems (indicating the probability of the, say, positive class).
‣ In the case of multinomial classification, 𝑚 is typically the number of classes and indicates
the (unnormalized) log-probability the model assigns to the event that 𝑧𝑖 belongs to each
possible class.
• We are interested in finding a choice of parameters 𝜃 that “fits the data well”. This is typically
done by minimizing the empirical risk (or empirical loss) of the model, defined as
𝐽emp(𝜃) ≔ 1
𝑘 ∑
𝑘
𝑖=1
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖),
where ℓ : ℝ𝑚 × ℝ → ℝ is a loss function that measures the discrepancy between the model’s
prediction 𝑔𝜃(𝑧𝑖) and the true label 𝑦𝑖. For example, in regression problems, a reasonable choice
might be the squared loss ℓ(𝑎, 𝑏) = (𝑎 − 𝑏)2. For classification, in which the model output
𝑔𝜃(𝑧𝑖) = (𝑔1,𝜃(𝑧𝑖), ..., 𝑔𝑚,𝜃(𝑧𝑖)) ∈ ℝ𝑚 predicts the unnormalized log-probability of each class
and 𝑦𝑖 ∈ {1, ..., 𝑚}, the log loss (also known as cross entropy loss or softmax loss) is often used:
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖) = − ∑
𝑚
𝑗=1
𝟙[𝑦𝑖 = 𝑗] ⋅ log
⎝
⎜⎛ exp{𝑔𝑗,𝜃(𝑧𝑖)}
∑𝑚
𝑗′=1 exp{𝑔𝑗′,𝜃(𝑧𝑖)} ⎠
⎟⎞.
In many cases, an explicit regularization term is added on top of 𝐽emp to reduce overfitting.
3 Stochastic gradient descent
The empirical regret minimization objective 𝑓 = 𝐽emp defined above is a sum of 𝑘 terms, one for each
example in the dataset. When the dataset is large, evaluating the gradient of 𝐽emp at each iteration
of gradient descent can be computationally expensive. In this case, we can replace at each iteration
the exact gradient ∇𝑓 with a cheap, unbiased estimator ̃ ∇𝑓 of it. This means that, denoting with
𝔼𝑡 the expectation conditioned on all past random choices (that is, all randomization used at times
1, ..., 𝑡 − 1), the estimator ̃ ∇𝑓(𝑥𝑡) satisfies 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). This leads to the stochastic gradient
descent (SGD) algorithm,
𝑥𝑡+1 ≔ 𝑥𝑡 − 𝜂 ̃ ∇𝑓(𝑥𝑡) where 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). (1)
3.1 Typical unbiased estimator: mini-batches
Different instantiations of SGD differ in the choice of the estimator ̃ ∇𝑓(𝑥𝑡). For empirical risk min-
imization problems, the most common choice is to replace the gradient of 𝑓 = 𝐽emp with the average
gradient of a uniformly randomly subset of training data (typically sampled without replacement),
called a mini-batch. The resulting instantiation of SGD is therefore known as mini-batch SGD.
Computing the random mini-batch is usually easy to do by using a random shuffle of the dataset,
and selecting the first 𝑏 examples as the mini-batch. The size of the mini-batch 𝑏 is a hyperparameter
of the algorithm. A popular choice is 𝑏 = 32 or 𝑏 = 64.
4 Analysis of stochastic gradient descent
As we have already seen several times, the analysis of variants of the gradient descent algorithm
passes through generalizing the two key descent lemmas: the gradient descent lemma and the
(Euclidean) mirror descent lemma. In this case, we will need to modify the lemmas to account for
the fact that the gradient of the objective has been replaced with an unbiased estimator in (1).
4.1 Stochastic gradient descent lemma
Recall that an 𝐿-smooth function 𝑓 : ℝ𝑛 → ℝ satisfies the quadratic upper bound
𝑓(𝑦) ≤ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ + 𝐿
2 ‖𝑥 − 𝑦‖2
2 ∀𝑥, 𝑦 ∈ ℝ𝑛.
Exactly as we did in Lecture 7, we apply the previous bound for the specific choice 𝑦 = 𝑥𝑡+1, 𝑥 =
𝑥𝑡, and obtain
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥𝑡) − 𝜂⟨∇𝑓(𝑥𝑡), ̃ ∇𝑓(𝑥𝑡)⟩ + 𝐿
2 𝜂2‖ ̃∇𝑓(𝑥𝑡)‖2
2
Taking the (conditional) expectation on both sides and using the unbiasedness 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡)
we therefore obtain the following stochastic generalization of the gradient descent lemma.
Theorem 4.1 (Stochastic gradient descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth and 𝜂 > 0 be
an arbitrary stepsize. Two consecutive iterates (𝑥𝑡, 𝑥𝑡+1) produced by the stochastic gradient
descent algorithm (1) satisfy
𝔼𝑡[𝑓(𝑥𝑡+1)] ≤ 𝑓 (𝑥𝑡) − 𝜂‖∇𝑓 (𝑥𝑡)‖2
2 +
𝐿
2 𝜂2 𝔼𝑡[‖ ̃∇𝑓 (𝑥𝑡)‖2
2].
The quantity term 𝔼𝑡[‖ ̃∇𝑓(𝑥𝑡)‖2
2], which controls the quadratic term in the previous bound, is often
called the variance of the stochastic gradient ̃ ∇𝑓.
4.2 Convergence in gradient norm
Just like the gradient descent lemma for exact gradient descent, the stochastic gradient descent
lemma guarantees descent in function value, in expectation, when 𝜂 > 0 is sufficiently small. In the
stochastic case, perhaps unsurprisingly, this threshold value of 𝜂 depends inversely proportionally
on the variance of the unbiased estimator. Specifically, suppose that 𝔼𝑡[‖ ̃∇𝑓(𝑥𝑡)‖2
2] ≤ 𝐺 at all times
𝑡. Then, the gradient descent lemma can be written as
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 1
𝜂 𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)] +
𝐿
2 𝜂𝐺.
Summing over all times 𝑡 = 0, 1, ..., 𝑇 − 1, we therefore obtain
∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2 ≤
1
𝜂 (∑
𝑇 −1
𝑡=0
𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)]) +
𝐿
2 𝜂𝐺𝑇 .
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular expectations
𝔼, and we can therefore exploit the telescoping nature of the sum 𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1), obtaining
∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤
1
𝜂 (𝑓(𝑥0) − 𝔼[𝑓(𝑥𝑇 )]) + 𝐿
2 𝜂𝐺𝑇 .
Assuming the function is lower bounded by the value 𝑓⋆, we therefore have
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤
𝑓(𝑥0) − 𝑓⋆
𝜂𝑇 + 𝐿
2 𝜂𝐺.
Picking 𝜂 ≈ 1√𝑇 then reveals that within 𝑇 iterations, at least one 𝔼[‖∇𝑓(𝑥𝑡)‖] ≈ 1√𝑇 , and we recover
convergence in gradient norm no matter the (bounded) variance of the estimator.
4.3 Stochastic Euclidean mirror descent lemma
When the objective function is further assumed to be convex, then we have seen in Lecture 7 that the
Euclidean mirror descent lemma applies, and guarantees descent in distance to the set of minimizers.
To generalize the Euclidean mirror descent lemma, we operate as in Lecture 7, starting from the
linear lower bound property
𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ∀𝑥, 𝑦 ∈ ℝ𝑛,
that holds for any convex function. Applying the bound for the specific choice 𝑥 = 𝑥𝑡, and rearrang-
ing terms, we find
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) − ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − ⟨𝔼𝑡[ ̃∇𝑓(𝑥𝑡)], 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − 𝔼𝑡[ 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩] ∀𝑦 ∈ ℝ𝑛,
where the first equality follows from the unbiasedness of the gradient estimator ̃ ∇𝑓, and the second
from the definition of the SGD algorithm (1).
Using the three-point equality we discussed in Lecture 7,
− 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩ = 1
2𝜂 (‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2),
we obtain the stochastic version of the Euclidean mirror descent lemma.
Theorem 4.2 (Stochastic Euclidean mirror descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be convex. Then,
for any stepsize 𝜂 > 0 and 𝑦 ∈ ℝ𝑛, two consecutive iterates (𝑥𝑡, 𝑥𝑡+1) produced by the SGD
algorithm (1) satisfy
𝑓(𝑥𝑡) ≤ 𝑓 (𝑦) +
1
2𝜂 𝔼𝑡[‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2].
In particular, by setting 𝑦 = 𝑥⋆, where 𝑥⋆ is a minimizer of 𝑓, we can extract an expected decrease in
distance ‖𝑥𝑡 − 𝑥⋆‖2
2 provided 𝜂 is chosen small enough, which is again totally analogous to the exact
case seen in Lecture 7.
4.4 Convergence in convex function value
With the stochastic generalizations of the Euclidean mirror descent lemmas, we can obtain a con-
vergence rate (in terms of function value) by using a telescopic argument, similar to what we did in
Lecture 7. However, we will not be able to recover the 1
𝑡 rate of convergence for SGD.
Let 𝑥⋆ be a minimizer of the 𝐿-smooth and convex function function 𝑓, and assume 𝔼𝑡[‖ ̃∇𝑓(𝑥)‖2
2] ≤
𝐺 at all 𝑥. By summing the bound given in the stochastic Euclidean mirror descent lemma
(Theorem 4.2) for all 𝑡 = 0, ..., 𝑇 − 1, we obtain
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2]
= 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2‖ ̃∇𝑓(𝑥𝑡)‖2
2] (from (1))
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺].
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular expectations
𝔼, and we can therefore exploit the telescoping nature of the sum ‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2:
∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡)] ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 𝔼[∑
𝑇 −1
𝑡=0
(‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺)]
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥0 − 𝑥⋆‖2
2 +
𝜂
2 𝐺𝑇 .
Moving the term 𝑇 𝑓(𝑥⋆) to the left-hand side and dividing by 𝑇 , we find
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡) − 𝑓(𝑥⋆)] ≤ 1
2𝜂𝑇 ‖𝑥0 − 𝑥⋆‖2
2 +
𝜂
2 𝐺.
Hence, we have the following:
Theorem 4.3. Consider the SGD algorithm (1) run on a convex function By picking 𝜂 = 1√𝐺𝑇 ,
this immediately implies that at least one of the iterates 𝑥𝑡, 𝑡 ∈ {0, ..., 𝑇 − 1}, satisfies
𝔼[𝑓(𝑥𝑡) − 𝑓 (𝑥⋆)] ≤
√𝐺
2 (1 + ‖𝑥0 − 𝑥⋆‖2
2)
1
√𝑇 .
By convexity, the same bound holds also for 𝔼[𝑓(𝑥𝑇 ) − 𝑓(𝑥⋆)], where 𝑥𝑇 is the average of the
iterates 𝑥0, ..., 𝑥𝑇 −1.
Remark 4.1. It is important to realize that the above bound is of order 1√𝑇 rather than 1
𝑇 as
in the (non-stochastic) gradient descent case. However, in establishing this result, we have not
used the gradient descent lemma. So, the above result does not require 𝐿-smoothness. As a
consequence, the previous result shows that non-stochastic gradient descent, applied to convex
functions that are not 𝐿-smooth, guarantees 1√𝑇 convergence in function value provided that the
stepsize is chosen well.
5 Further readings
Several variations of the SGD algorithm and its analysis are known. For one, it is known that when
the objective function is strongly convex, the SGD algorithm can be shown to converge at a rate of
1/𝑇 in function value, provided the stepsize is chosen well.
Variance reduction techniques, which aim to reduce the variance of the gradient estimator, have
been proposed to accelerate the convergence of SGD. These techniques include the SVRG (stochas-
tic variance reduced gradient) algorithm [JZ13], SAG (stochastic average gradient) [SLB17], and
SAGA [DBL14].
Bibliography
[JZ13] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive vari-
ance reduction,” in Neural Information Processing Systems (NeurIPS), 2013.
[SLB17] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average
gradient,” Mathematical Programming, vol. 162, pp. 83–112, 2017.
[DBL14] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method
with support for non-strongly convex composite objectives,” in Neural Information Pro-
cessing Systems (NeurIPS), 2014.

Content

Typo or question?

Metadata