MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Tue, Apr 15th 2025
Lecture 15
Stochastic gradient descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
As we have seen in the past few lectures, gradient descent and its family of algorithms
(including accelerated gradient descent, projected gradient descent and mirror descent) are
first-order methods that can compute approximate minima of differentiable functions. The
cost of each iteration of these algorithms is dominated (linearly) by the time it takes to
evaluate the gradient of the objective function at each point generated by the algorithm.
L15.1 Preliminary words on stochastic methods
In some cases, including in most machine learning applications, the objective function
contains a huge number of terms (usually, a summation over training data points), rendering
the gradient computation prohibitively expensive. Today, we will investigate an approach
to sidestep this difficulty in practice. The idea is simple, and it is a general principle of
algorithm design broadly:
If exact computation is expensive, replace it with a cheaper estimate.
In particular, as is often the case, we will investigate replacing the computation of an exact
gradient with the computation of a cheap, unbiased random estimator of it. This approach,
based on randomness (aka. stochasticity) leads to the family of algorithms that go under
the name of stochastic gradient descent.
As one might expect, such a stochastic approach is not free. We will be trading an exact
quantity (the exact gradient of the objective function) for a cheap estimate that is subject
to variance, subjecting the resulting algorithm to random (erratic) steps rather than clean
descent steps. This tradeoff is nonetheless extremely favorable for multiple reasons:
• As a matter of pragmatism, stochastic gradient descent algorithms can perform a
significantly larger number of steps in the same time that it takes (exact) gradient
descent to perform even a single step. In certain cases, the exact algorithm might not
be able to perform a single step in the allotted computation budget. By this metric,
the choice between an exact and a stochastic method reduces to the choice between
an algorithm that is not able to start, and one that—as erratic as it may be—at least
makes some progress.
• As a subtler comment, the use of stochasticity, as opposed to using a powerful exact
method, has been observed empirically to lead to better solutions in machine learning
settings, even when the exact algorithm can be run fast enough. In machine learning,
minimizing the objective function on the training data correlates, but is conceptually
different from, finding a good model for the task at hand. In other words, the role of
optimization in machine learning is to (i) produce models that interpolate well the
training dataset, but also (ii) do not overfit, that is, generalize well to unseen but in-
distribution (“similar”) examples. Using stochastic gradient descent has been linked
with a reduction in overfitting and increased success on this second goal, partly due
to the presence of noise, which enables the algorithm to escape local minima and
saddle points.
L15.2 Empirical risk minimization (ERM) problems in machine
learning
Many problems in machine learning can be abstracted as follows:
• We are given a dataset of 𝑘 examples (𝑧1, 𝑦1), ..., (𝑧𝑘, 𝑦𝑘), where 𝑧𝑖 ∈ ℝ𝑛 is a feature
vector and 𝑦𝑖 ∈ ℝ is a label. For example:
‣ we might be interested in building a classifier for handwritten digits, where each
𝑧𝑖 is a vector containing 28 × 28 grayscale pixel intensities, and each 𝑦𝑖 is the
digit itself (a multinomial classification problem); or
‣ we might be interested in predicting the price of a house, where each 𝑧𝑖 is a vector
containing features of the house (e.g., number of bedrooms, square footage, etc.),
and each 𝑦𝑖 is the price of the house (a regression problem); or
‣ we might be interested in predicting the next word in a sentence, where each 𝑧𝑖
is a vector containing the previous words in the sentence, and each 𝑦𝑖 is the next
word (a multinomial classification problem); or
‣ we might be interested in predicting the probability of a user clicking on an ad,
where each 𝑧𝑖 is a vector containing features of the user and the ad, and each 𝑦𝑖 is
the probability of clicking (a binary—that is, two-class—classification problem);
et cetera.
• We have selected a model class of functions 𝑔𝜃 : ℝ𝑛 → ℝ𝑚 that we believe can capture
the relationship between the features and the labels. These functions are parameterized
by a vector 𝜃 ∈ ℝ𝑑. For example, in the case of neural networks, 𝜃 would be the set
of layer weights and biases.
‣ The output dimension 𝑚 of the model is typically 𝑚 = 1 in the case of regression
and binary classification problems (indicating the probability of the, say, positive
class).
‣ In the case of multinomial classification, 𝑚 is typically the number of classes and
indicates the (unnormalized) log-probability the model assigns to the event that
𝑧𝑖 belongs to each possible class.
• We are interested in finding a choice of parameters 𝜃 that “fits the data well”. This
is typically done by minimizing the empirical risk (or empirical loss) of the model,
defined as
𝐽emp(𝜃) ≔ 1
𝑘 ∑
𝑘
𝑖=1
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖),
where ℓ : ℝ𝑚 × ℝ → ℝ is a loss function that measures the discrepancy between the
model’s prediction 𝑔𝜃(𝑧𝑖) and the true label 𝑦𝑖. For example, in regression problems,
a reasonable choice might be the squared loss ℓ(𝑎, 𝑏) = (𝑎 − 𝑏)2. For classification,
in which the model output 𝑔𝜃(𝑧𝑖) = (𝑔1,𝜃(𝑧𝑖), ..., 𝑔𝑚,𝜃(𝑧𝑖)) ∈ ℝ𝑚 predicts the unnor-
malized log-probability of each class and 𝑦𝑖 ∈ {1, ..., 𝑚}, the log loss (also known as
cross entropy loss or softmax loss) is often used:
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖) = − ∑
𝑚
𝑗=1
1[𝑦𝑖 = 𝑗] ⋅ log
(
(( exp{𝑔𝑗,𝜃(𝑧𝑖)}
∑𝑚
𝑗′=1 exp{𝑔𝑗′,𝜃(𝑧𝑖)})
)).
In many cases, an explicit regularization term is added on top of 𝐽emp to reduce overfitting.
L15.3 Stochastic gradient descent
The empirical regret minimization objective 𝑓 = 𝐽emp defined above is a sum of 𝑘 terms,
one for each example in the dataset. When the dataset is large, evaluating the gradient of
𝐽emp at each iteration of gradient descent can be computationally expensive. In this case,
we can replace at each iteration the exact gradient ∇𝑓 with a cheap, unbiased estimator ̃ ∇𝑓
of it. This means that, denoting with 𝔼𝑡 the expectation conditioned on all past random
choices (that is, all randomization used at times 1, ..., 𝑡 − 1), the estimator ̃ ∇𝑓(𝑥𝑡) satisfies
𝔼𝑡[ ̃ ∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). This leads to the stochastic gradient descent (SGD) algorithm,
𝑥𝑡+1 ≔ 𝑥𝑡 − 𝜂 ̃ ∇𝑓(𝑥𝑡) where 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). (1)
L15.3.1 Typical unbiased estimator: mini-batches
Different instantiations of SGD differ in the choice of the estimator ̃ ∇𝑓(𝑥𝑡). For empirical
risk minimization problems, the most common choice is to replace the gradient of 𝑓 =
𝐽emp with the average gradient of a uniformly randomly subset of training data (typically
sampled without replacement), called a mini-batch. The resulting instantiation of SGD is
therefore known as mini-batch SGD.
Computing the random mini-batch is usually easy to do by using a random shuffle of the
dataset, and selecting the first 𝑏 examples as the mini-batch. The size of the mini-batch 𝑏
is a hyperparameter of the algorithm. A popular choice is 𝑏 = 32 or 𝑏 = 64.
L15.4 Analysis of stochastic gradient descent
As we have already seen several times, the analysis of variants of the gradient descent
algorithm passes through generalizing the two key descent lemmas: the gradient descent
lemma and the (Euclidean) mirror descent lemma. In this case, we will need to modify the
lemmas to account for the fact that the gradient of the objective has been replaced with
an unbiased estimator in (1).
L15.4.1 Stochastic gradient descent lemma
Recall that an 𝐿-smooth function 𝑓 : ℝ𝑛 → ℝ satisfies the quadratic upper bound
𝑓(𝑦) ≤ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ + 𝐿
2 ‖𝑥 − 𝑦‖2
2 ∀𝑥, 𝑦 ∈ ℝ𝑛.
Exactly as we did in Lecture 12, we apply the previous bound for the specific choice 𝑦 =
𝑥𝑡+1, 𝑥 = 𝑥𝑡, and obtain
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂⟨∇𝑓(𝑥𝑡), ̃ ∇𝑓(𝑥𝑡)⟩ + 𝐿
2 𝜂2‖ ̃∇𝑓(𝑥𝑡)‖2
2
Taking the (conditional) expectation on both sides and using the unbiasedness 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] =
∇𝑓(𝑥𝑡) we therefore obtain the following stochastic generalization of the gradient descent
lemma.
Theorem L15.1 (Stochastic gradient descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth and
𝜂 > 0 be an arbitrary stepsize. Two consecutive iterates (𝑥𝑡, 𝑥𝑡+1) produced by the
stochastic gradient descent algorithm (1) satisfy
𝔼𝑡[𝑓(𝑥𝑡+1)] ≤ 𝑓(𝑥𝑡) − 𝜂‖∇𝑓(𝑥𝑡)‖2
2 + 𝐿
2 𝜂2 𝔼𝑡[‖ ̃ ∇𝑓(𝑥𝑡)‖2
2].
The quantity term 𝔼𝑡[‖ ̃∇𝑓(𝑥𝑡)‖2
2], which controls the quadratic term in the previous bound,
is often called the variance of the stochastic gradient ̃ ∇𝑓.
L15.4.2 Convergence in gradient norm
Just like the gradient descent lemma for exact gradient descent, the stochastic gradient
descent lemma guarantees descent in function value, in expectation, when 𝜂 > 0 is suffi-
ciently small. In the stochastic case, perhaps unsurprisingly, this threshold value of 𝜂
depends inversely proportionally on the variance of the unbiased estimator. Specifically,
suppose that 𝔼𝑡[‖ ̃ ∇𝑓(𝑥𝑡)‖2
2] ≤ 𝐺 at all times 𝑡. Then, the gradient descent lemma can be
written as
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 1
𝜂 𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)] + 𝐿
2 𝜂𝐺.
Summing over all times 𝑡 = 0, 1, ..., 𝑇 − 1, we therefore obtain
∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 1
𝜂 (∑
𝑇 −1
𝑡=0
𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)]) + 𝐿
2 𝜂𝐺𝑇 .
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular
expectations 𝔼, and we can therefore exploit the telescoping nature of the sum 𝑓(𝑥𝑡) −
𝑓(𝑥𝑡+1), obtaining
∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤ 1
𝜂 (𝑓(𝑥0) − 𝔼[𝑓(𝑥𝑇 )]) + 𝐿
2 𝜂𝐺𝑇 .
Assuming the function is lower bounded by the value 𝑓⋆, we therefore have
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤ 𝑓(𝑥0) − 𝑓⋆
𝜂𝑇 + 𝐿
2 𝜂𝐺.
Picking 𝜂 ≈ 1√𝑇 then reveals that within 𝑇 iterations, at least one 𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≈ 1√𝑇 , and
we recover convergence in gradient norm no matter the (bounded) variance of the estimator.
L15.4.3 Stochastic Euclidean mirror descent lemma
When the objective function is further assumed to be convex, then we have seen in Lecture
12 that the Euclidean mirror descent lemma applies, and guarantees descent in distance to
the set of minimizers.
To generalize the Euclidean mirror descent lemma, we operate as in Lecture 12, starting
from the linear lower bound property
𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ∀𝑥, 𝑦 ∈ ℝ𝑛,
that holds for any convex function. Applying the bound for the specific choice 𝑥 = 𝑥𝑡, and
rearranging terms, we find
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) − ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − ⟨𝔼𝑡[ ̃∇𝑓(𝑥𝑡)], 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − 𝔼𝑡[ 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩] ∀𝑦 ∈ ℝ𝑛,
where the first equality follows from the unbiasedness of the gradient estimator ̃ ∇𝑓, and
the second from the definition of the SGD algorithm (1).
Using the three-point equality we discussed in Lecture 12,
− 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩ = 1
2𝜂 (‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2),
we obtain the stochastic version of the Euclidean mirror descent lemma.
Theorem L15.2 (Stochastic Euclidean mirror descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be
convex. Then, for any stepsize 𝜂 > 0 and 𝑦 ∈ ℝ𝑛, two consecutive iterates (𝑥𝑡, 𝑥𝑡+1)
produced by the SGD algorithm (1) satisfy
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 𝔼𝑡[‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2].
In particular, by setting 𝑦 = 𝑥⋆, where 𝑥⋆ is a minimizer of 𝑓, we can extract an expected
decrease in distance ‖𝑥𝑡 − 𝑥⋆‖2
2 provided 𝜂 is chosen small enough, which is again totally
analogous to the exact case seen in Lecture 12.
L15.4.4 Convergence in convex function value
With the stochastic generalizations of the Euclidean mirror descent lemmas, we can obtain
a convergence rate (in terms of function value) by using a telescopic argument, similar to
what we did in Lecture 12. However, we will not be able to recover the 1
𝑡 rate of convergence
for SGD.
Let 𝑥⋆ be a minimizer of the convex function function 𝑓, and assume 𝔼𝑡[‖ ̃∇𝑓(𝑥)‖2
2] ≤ 𝐺
at all 𝑥. By summing the bound given in the stochastic Euclidean mirror descent lemma
(Theorem L15.2) for all 𝑡 = 0, ..., 𝑇 − 1, we obtain
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2]
= 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2‖ ̃ ∇𝑓(𝑥𝑡)‖2
2] (from (1))
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺].
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular
expectations 𝔼, and we can therefore exploit the telescoping nature of the sum ‖𝑥𝑡 − 𝑥⋆‖2
2 −
‖𝑥𝑡+1 − 𝑥⋆‖2
2:
∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡)] ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 𝔼[∑
𝑇 −1
𝑡=0
(‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺)]
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥0 − 𝑥⋆‖2
2 + 𝜂
2 𝐺𝑇 .
Moving the term 𝑇 𝑓(𝑥⋆) to the left-hand side and dividing by 𝑇 , we find
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡) − 𝑓(𝑥⋆)] ≤ 1
2𝜂𝑇 ‖𝑥0 − 𝑥⋆‖2
2 + 𝜂
2 𝐺.
Hence, we have the following:
Theorem L15.3. Consider the SGD algorithm (1) run on a convex function. By picking
𝜂 = 1√𝐺𝑇 , this immediately implies that at least one of the iterates 𝑥𝑡, 𝑡 ∈ {0, ..., 𝑇 −
1}, satisfies
𝔼[𝑓(𝑥𝑡) − 𝑓(𝑥⋆)] ≤
√𝐺
2 (1 + ‖𝑥0 − 𝑥⋆‖2
2) 1
√𝑇 .
By convexity, the same bound holds also for 𝔼[𝑓(𝑥𝑇 ) − 𝑓(𝑥⋆)], where 𝑥𝑇 is the average
of the iterates 𝑥0, ..., 𝑥𝑇 −1.
Remark L15.1. It is important to realize that the above bound is of order 1√𝑇 rather
than 1
𝑇 as in the (non-stochastic) gradient descent case. However, in establishing this
result, we have not used the gradient descent lemma. So, the above result does not
require 𝐿-smoothness. As a consequence, the previous result shows that non-stochastic
gradient descent, applied to convex functions that are not 𝐿-smooth, guarantees 1√𝑇
convergence in function value provided that the stepsize is chosen well.
L15.5 Further readings
Several variations of the SGD algorithm and its analysis are known. For one, it is known
that when the objective function is strongly convex, the SGD algorithm can be shown to
converge at a rate of 1/𝑇 in function value, provided the stepsize is chosen well.
Variance reduction techniques, which aim to reduce the variance of the gradient estimator,
have been proposed to accelerate the convergence of SGD. These techniques include the
SVRG (stochastic variance reduced gradient) algorithm [JZ13], SAG (stochastic average
gradient) [SLB17], and SAGA [DBL14].
[JZ13] Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using
predictive variance reduction. Neural Information Processing Systems (Neurips).
[SLB17] Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the
stochastic average gradient. Mathematical Programming, 162, 83–112.
[DBL14] Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental
gradient method with support for non-strongly convex composite objectives.
Neural Information Processing Systems (Neurips).
Changelog
• Apr 17, 2025: Clarified 𝐿-smoothness.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
Lecture 15
Stochastic gradient descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
As we have seen in the past few lectures, gradient descent and its family of algorithms
(including accelerated gradient descent, projected gradient descent and mirror descent) are
first-order methods that can compute approximate minima of differentiable functions. The
cost of each iteration of these algorithms is dominated (linearly) by the time it takes to
evaluate the gradient of the objective function at each point generated by the algorithm.
L15.1 Preliminary words on stochastic methods
In some cases, including in most machine learning applications, the objective function
contains a huge number of terms (usually, a summation over training data points), rendering
the gradient computation prohibitively expensive. Today, we will investigate an approach
to sidestep this difficulty in practice. The idea is simple, and it is a general principle of
algorithm design broadly:
If exact computation is expensive, replace it with a cheaper estimate.
In particular, as is often the case, we will investigate replacing the computation of an exact
gradient with the computation of a cheap, unbiased random estimator of it. This approach,
based on randomness (aka. stochasticity) leads to the family of algorithms that go under
the name of stochastic gradient descent.
As one might expect, such a stochastic approach is not free. We will be trading an exact
quantity (the exact gradient of the objective function) for a cheap estimate that is subject
to variance, subjecting the resulting algorithm to random (erratic) steps rather than clean
descent steps. This tradeoff is nonetheless extremely favorable for multiple reasons:
• As a matter of pragmatism, stochastic gradient descent algorithms can perform a
significantly larger number of steps in the same time that it takes (exact) gradient
descent to perform even a single step. In certain cases, the exact algorithm might not
be able to perform a single step in the allotted computation budget. By this metric,
the choice between an exact and a stochastic method reduces to the choice between
an algorithm that is not able to start, and one that—as erratic as it may be—at least
makes some progress.
• As a subtler comment, the use of stochasticity, as opposed to using a powerful exact
method, has been observed empirically to lead to better solutions in machine learning
settings, even when the exact algorithm can be run fast enough. In machine learning,
minimizing the objective function on the training data correlates, but is conceptually
different from, finding a good model for the task at hand. In other words, the role of
optimization in machine learning is to (i) produce models that interpolate well the
training dataset, but also (ii) do not overfit, that is, generalize well to unseen but in-
distribution (“similar”) examples. Using stochastic gradient descent has been linked
with a reduction in overfitting and increased success on this second goal, partly due
to the presence of noise, which enables the algorithm to escape local minima and
saddle points.
L15.2 Empirical risk minimization (ERM) problems in machine
learning
Many problems in machine learning can be abstracted as follows:
• We are given a dataset of 𝑘 examples (𝑧1, 𝑦1), ..., (𝑧𝑘, 𝑦𝑘), where 𝑧𝑖 ∈ ℝ𝑛 is a feature
vector and 𝑦𝑖 ∈ ℝ is a label. For example:
‣ we might be interested in building a classifier for handwritten digits, where each
𝑧𝑖 is a vector containing 28 × 28 grayscale pixel intensities, and each 𝑦𝑖 is the
digit itself (a multinomial classification problem); or
‣ we might be interested in predicting the price of a house, where each 𝑧𝑖 is a vector
containing features of the house (e.g., number of bedrooms, square footage, etc.),
and each 𝑦𝑖 is the price of the house (a regression problem); or
‣ we might be interested in predicting the next word in a sentence, where each 𝑧𝑖
is a vector containing the previous words in the sentence, and each 𝑦𝑖 is the next
word (a multinomial classification problem); or
‣ we might be interested in predicting the probability of a user clicking on an ad,
where each 𝑧𝑖 is a vector containing features of the user and the ad, and each 𝑦𝑖 is
the probability of clicking (a binary—that is, two-class—classification problem);
et cetera.
• We have selected a model class of functions 𝑔𝜃 : ℝ𝑛 → ℝ𝑚 that we believe can capture
the relationship between the features and the labels. These functions are parameterized
by a vector 𝜃 ∈ ℝ𝑑. For example, in the case of neural networks, 𝜃 would be the set
of layer weights and biases.
‣ The output dimension 𝑚 of the model is typically 𝑚 = 1 in the case of regression
and binary classification problems (indicating the probability of the, say, positive
class).
‣ In the case of multinomial classification, 𝑚 is typically the number of classes and
indicates the (unnormalized) log-probability the model assigns to the event that
𝑧𝑖 belongs to each possible class.
• We are interested in finding a choice of parameters 𝜃 that “fits the data well”. This
is typically done by minimizing the empirical risk (or empirical loss) of the model,
defined as
𝐽emp(𝜃) ≔ 1
𝑘 ∑
𝑘
𝑖=1
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖),
where ℓ : ℝ𝑚 × ℝ → ℝ is a loss function that measures the discrepancy between the
model’s prediction 𝑔𝜃(𝑧𝑖) and the true label 𝑦𝑖. For example, in regression problems,
a reasonable choice might be the squared loss ℓ(𝑎, 𝑏) = (𝑎 − 𝑏)2. For classification,
in which the model output 𝑔𝜃(𝑧𝑖) = (𝑔1,𝜃(𝑧𝑖), ..., 𝑔𝑚,𝜃(𝑧𝑖)) ∈ ℝ𝑚 predicts the unnor-
malized log-probability of each class and 𝑦𝑖 ∈ {1, ..., 𝑚}, the log loss (also known as
cross entropy loss or softmax loss) is often used:
ℓ(𝑔𝜃(𝑧𝑖), 𝑦𝑖) = − ∑
𝑚
𝑗=1
1[𝑦𝑖 = 𝑗] ⋅ log
(
(( exp{𝑔𝑗,𝜃(𝑧𝑖)}
∑𝑚
𝑗′=1 exp{𝑔𝑗′,𝜃(𝑧𝑖)})
)).
In many cases, an explicit regularization term is added on top of 𝐽emp to reduce overfitting.
L15.3 Stochastic gradient descent
The empirical regret minimization objective 𝑓 = 𝐽emp defined above is a sum of 𝑘 terms,
one for each example in the dataset. When the dataset is large, evaluating the gradient of
𝐽emp at each iteration of gradient descent can be computationally expensive. In this case,
we can replace at each iteration the exact gradient ∇𝑓 with a cheap, unbiased estimator ̃ ∇𝑓
of it. This means that, denoting with 𝔼𝑡 the expectation conditioned on all past random
choices (that is, all randomization used at times 1, ..., 𝑡 − 1), the estimator ̃ ∇𝑓(𝑥𝑡) satisfies
𝔼𝑡[ ̃ ∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). This leads to the stochastic gradient descent (SGD) algorithm,
𝑥𝑡+1 ≔ 𝑥𝑡 − 𝜂 ̃ ∇𝑓(𝑥𝑡) where 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] = ∇𝑓(𝑥𝑡). (1)
L15.3.1 Typical unbiased estimator: mini-batches
Different instantiations of SGD differ in the choice of the estimator ̃ ∇𝑓(𝑥𝑡). For empirical
risk minimization problems, the most common choice is to replace the gradient of 𝑓 =
𝐽emp with the average gradient of a uniformly randomly subset of training data (typically
sampled without replacement), called a mini-batch. The resulting instantiation of SGD is
therefore known as mini-batch SGD.
Computing the random mini-batch is usually easy to do by using a random shuffle of the
dataset, and selecting the first 𝑏 examples as the mini-batch. The size of the mini-batch 𝑏
is a hyperparameter of the algorithm. A popular choice is 𝑏 = 32 or 𝑏 = 64.
L15.4 Analysis of stochastic gradient descent
As we have already seen several times, the analysis of variants of the gradient descent
algorithm passes through generalizing the two key descent lemmas: the gradient descent
lemma and the (Euclidean) mirror descent lemma. In this case, we will need to modify the
lemmas to account for the fact that the gradient of the objective has been replaced with
an unbiased estimator in (1).
L15.4.1 Stochastic gradient descent lemma
Recall that an 𝐿-smooth function 𝑓 : ℝ𝑛 → ℝ satisfies the quadratic upper bound
𝑓(𝑦) ≤ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ + 𝐿
2 ‖𝑥 − 𝑦‖2
2 ∀𝑥, 𝑦 ∈ ℝ𝑛.
Exactly as we did in Lecture 12, we apply the previous bound for the specific choice 𝑦 =
𝑥𝑡+1, 𝑥 = 𝑥𝑡, and obtain
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂⟨∇𝑓(𝑥𝑡), ̃ ∇𝑓(𝑥𝑡)⟩ + 𝐿
2 𝜂2‖ ̃∇𝑓(𝑥𝑡)‖2
2
Taking the (conditional) expectation on both sides and using the unbiasedness 𝔼𝑡[ ̃∇𝑓(𝑥𝑡)] =
∇𝑓(𝑥𝑡) we therefore obtain the following stochastic generalization of the gradient descent
lemma.
Theorem L15.1 (Stochastic gradient descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth and
𝜂 > 0 be an arbitrary stepsize. Two consecutive iterates (𝑥𝑡, 𝑥𝑡+1) produced by the
stochastic gradient descent algorithm (1) satisfy
𝔼𝑡[𝑓(𝑥𝑡+1)] ≤ 𝑓(𝑥𝑡) − 𝜂‖∇𝑓(𝑥𝑡)‖2
2 + 𝐿
2 𝜂2 𝔼𝑡[‖ ̃ ∇𝑓(𝑥𝑡)‖2
2].
The quantity term 𝔼𝑡[‖ ̃∇𝑓(𝑥𝑡)‖2
2], which controls the quadratic term in the previous bound,
is often called the variance of the stochastic gradient ̃ ∇𝑓.
L15.4.2 Convergence in gradient norm
Just like the gradient descent lemma for exact gradient descent, the stochastic gradient
descent lemma guarantees descent in function value, in expectation, when 𝜂 > 0 is suffi-
ciently small. In the stochastic case, perhaps unsurprisingly, this threshold value of 𝜂
depends inversely proportionally on the variance of the unbiased estimator. Specifically,
suppose that 𝔼𝑡[‖ ̃ ∇𝑓(𝑥𝑡)‖2
2] ≤ 𝐺 at all times 𝑡. Then, the gradient descent lemma can be
written as
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 1
𝜂 𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)] + 𝐿
2 𝜂𝐺.
Summing over all times 𝑡 = 0, 1, ..., 𝑇 − 1, we therefore obtain
∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 1
𝜂 (∑
𝑇 −1
𝑡=0
𝔼𝑡[𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)]) + 𝐿
2 𝜂𝐺𝑇 .
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular
expectations 𝔼, and we can therefore exploit the telescoping nature of the sum 𝑓(𝑥𝑡) −
𝑓(𝑥𝑡+1), obtaining
∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤ 1
𝜂 (𝑓(𝑥0) − 𝔼[𝑓(𝑥𝑇 )]) + 𝐿
2 𝜂𝐺𝑇 .
Assuming the function is lower bounded by the value 𝑓⋆, we therefore have
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≤ 𝑓(𝑥0) − 𝑓⋆
𝜂𝑇 + 𝐿
2 𝜂𝐺.
Picking 𝜂 ≈ 1√𝑇 then reveals that within 𝑇 iterations, at least one 𝔼[‖∇𝑓(𝑥𝑡)‖2
2] ≈ 1√𝑇 , and
we recover convergence in gradient norm no matter the (bounded) variance of the estimator.
L15.4.3 Stochastic Euclidean mirror descent lemma
When the objective function is further assumed to be convex, then we have seen in Lecture
12 that the Euclidean mirror descent lemma applies, and guarantees descent in distance to
the set of minimizers.
To generalize the Euclidean mirror descent lemma, we operate as in Lecture 12, starting
from the linear lower bound property
𝑓(𝑦) ≥ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ ∀𝑥, 𝑦 ∈ ℝ𝑛,
that holds for any convex function. Applying the bound for the specific choice 𝑥 = 𝑥𝑡, and
rearranging terms, we find
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) − ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − ⟨𝔼𝑡[ ̃∇𝑓(𝑥𝑡)], 𝑦 − 𝑥𝑡⟩ ∀𝑦 ∈ ℝ𝑛
= 𝑓(𝑦) − 𝔼𝑡[ 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩] ∀𝑦 ∈ ℝ𝑛,
where the first equality follows from the unbiasedness of the gradient estimator ̃ ∇𝑓, and
the second from the definition of the SGD algorithm (1).
Using the three-point equality we discussed in Lecture 12,
− 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩ = 1
2𝜂 (‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2),
we obtain the stochastic version of the Euclidean mirror descent lemma.
Theorem L15.2 (Stochastic Euclidean mirror descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be
convex. Then, for any stepsize 𝜂 > 0 and 𝑦 ∈ ℝ𝑛, two consecutive iterates (𝑥𝑡, 𝑥𝑡+1)
produced by the SGD algorithm (1) satisfy
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 𝔼𝑡[‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2].
In particular, by setting 𝑦 = 𝑥⋆, where 𝑥⋆ is a minimizer of 𝑓, we can extract an expected
decrease in distance ‖𝑥𝑡 − 𝑥⋆‖2
2 provided 𝜂 is chosen small enough, which is again totally
analogous to the exact case seen in Lecture 12.
L15.4.4 Convergence in convex function value
With the stochastic generalizations of the Euclidean mirror descent lemmas, we can obtain
a convergence rate (in terms of function value) by using a telescopic argument, similar to
what we did in Lecture 12. However, we will not be able to recover the 1
𝑡 rate of convergence
for SGD.
Let 𝑥⋆ be a minimizer of the convex function function 𝑓, and assume 𝔼𝑡[‖ ̃∇𝑓(𝑥)‖2
2] ≤ 𝐺
at all 𝑥. By summing the bound given in the stochastic Euclidean mirror descent lemma
(Theorem L15.2) for all 𝑡 = 0, ..., 𝑇 − 1, we obtain
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + ‖𝑥𝑡 − 𝑥𝑡+1‖2
2]
= 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2‖ ̃ ∇𝑓(𝑥𝑡)‖2
2] (from (1))
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ∑
𝑇 −1
𝑡=0
𝔼𝑡[‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺].
Taking expectations on both sides, the conditional expectations 𝔼𝑡 decay into regular
expectations 𝔼, and we can therefore exploit the telescoping nature of the sum ‖𝑥𝑡 − 𝑥⋆‖2
2 −
‖𝑥𝑡+1 − 𝑥⋆‖2
2:
∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡)] ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 𝔼[∑
𝑇 −1
𝑡=0
(‖𝑥𝑡 − 𝑥⋆‖2
2 − ‖𝑥𝑡+1 − 𝑥⋆‖2
2 + 𝜂2𝐺)]
≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥0 − 𝑥⋆‖2
2 + 𝜂
2 𝐺𝑇 .
Moving the term 𝑇 𝑓(𝑥⋆) to the left-hand side and dividing by 𝑇 , we find
1
𝑇 ∑
𝑇 −1
𝑡=0
𝔼[𝑓(𝑥𝑡) − 𝑓(𝑥⋆)] ≤ 1
2𝜂𝑇 ‖𝑥0 − 𝑥⋆‖2
2 + 𝜂
2 𝐺.
Hence, we have the following:
Theorem L15.3. Consider the SGD algorithm (1) run on a convex function. By picking
𝜂 = 1√𝐺𝑇 , this immediately implies that at least one of the iterates 𝑥𝑡, 𝑡 ∈ {0, ..., 𝑇 −
1}, satisfies
𝔼[𝑓(𝑥𝑡) − 𝑓(𝑥⋆)] ≤
√𝐺
2 (1 + ‖𝑥0 − 𝑥⋆‖2
2) 1
√𝑇 .
By convexity, the same bound holds also for 𝔼[𝑓(𝑥𝑇 ) − 𝑓(𝑥⋆)], where 𝑥𝑇 is the average
of the iterates 𝑥0, ..., 𝑥𝑇 −1.
Remark L15.1. It is important to realize that the above bound is of order 1√𝑇 rather
than 1
𝑇 as in the (non-stochastic) gradient descent case. However, in establishing this
result, we have not used the gradient descent lemma. So, the above result does not
require 𝐿-smoothness. As a consequence, the previous result shows that non-stochastic
gradient descent, applied to convex functions that are not 𝐿-smooth, guarantees 1√𝑇
convergence in function value provided that the stepsize is chosen well.
L15.5 Further readings
Several variations of the SGD algorithm and its analysis are known. For one, it is known
that when the objective function is strongly convex, the SGD algorithm can be shown to
converge at a rate of 1/𝑇 in function value, provided the stepsize is chosen well.
Variance reduction techniques, which aim to reduce the variance of the gradient estimator,
have been proposed to accelerate the convergence of SGD. These techniques include the
SVRG (stochastic variance reduced gradient) algorithm [JZ13], SAG (stochastic average
gradient) [SLB17], and SAGA [DBL14].
[JZ13] Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using
predictive variance reduction. Neural Information Processing Systems (Neurips).
[SLB17] Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the
stochastic average gradient. Mathematical Programming, 162, 83–112.
[DBL14] Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental
gradient method with support for non-strongly convex composite objectives.
Neural Information Processing Systems (Neurips).
Changelog
• Apr 17, 2025: Clarified 𝐿-smoothness.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.