Gabriele Farina - Distributed optimization and ADMM

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Thu, Apr 17th 2025
Lecture 16
Distributed optimization and ADMM
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In this last lecture on first-order methods, we will touch on an important aspect of
optimization in machine learning: distributed optimization.
L16.1 Setting
Consider again an empirical risk minimization problem, where we aim to minimize the
average loss over a large training set. When the training set is too large to fit on a single
machine or privacy concerns require different parts of the data to be stored on different
machines, it is common to distribute the optimization problem across multiple machines.
In this setting, we assume that we have 𝑚 machines, each
with access to a local dataset and a local function 𝑓𝑗 : ℝ𝑛 →
ℝ. The goal is to minimize the average of the local functions,
𝑓(𝑥) = 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥),
where the functions 𝑓𝑗 are differentiable.
In our distributed optimization model, each machine can
communicate with a central orchestrator, which can send
messages to all machines and collect messages from all
machines. Furthermore, each machine is powerful enough to
solve most optimization problems on their local dataset.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
Notation: We will denote machine indices as superscripts (these are not powers!) and time
indices as subscripts.
L16.1.1 First attempt: independent optimization
One might be tempted to distribute the optimization problem by simply letting each
machine compute the optimal solution on their local dataset, and then try to combine
the solutions in some way. However, this approach in general has no hope without further
assumptions about how the different datasets relate to each other. In the extreme, each
solution might carry virtually zero information regarding the global optimum; for example,
consider a multi-class classification problem where each machine only has access to training
data from the same class. Then, a classifier that deterministically outputs the machine’s
class would be an optimal local solution and yet carry no meaning globally.
L16.1.2 Second attempt: distributed gradient computation
A more sophisticated approach is to compute the gradient of the loss function at each
machine and then combine the gradients at a central location. This approach would work,
but has two major drawbacks:
1. The total communication cost is mainly determined by the convergence rate of
gradient descent. In particular, when the smoothness or the Lipschitzness of 𝑓 is not
very good, then the convergence rate of gradient descent is not very good. We will
discuss today a solution that does not depend on the smoothness of 𝑓.
2. Gradient descent does not leverage the computation power of each machine very well.
At each iteration, each machine simply computes a gradient. A more cost-effective
approach would try to directly use the computational power of each machine to find
some sort of optimal solution of the problem (while avoiding the naive approach of
independent optimization discussed in Section L16.1.1).
L16.2 The ADMM algorithm
The ADMM algorithm is a distributed optimization algorithm that combines the best of
both worlds: it uses the computational power of each machine to find an optimal solution,
while also ensuring that the global solution is meaningful. The algorithm can be thought
of as a two-step process parameterized by the learning rate 𝜂 > 0 and a multiplier 𝜆 > 0,
as follows.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
■ Phase I: Local update. The orchestrator sends the
current 𝑥𝑡 to each machine. Each machine computes
their updated multiplier
𝛼𝑗
𝑡 = 𝛼𝑗
𝑡−1 − 𝜂(𝑥𝑡 − 𝑥𝑗
𝑡 )
and optimizes on their local training set
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
■ Phase II: Synchronization. Each machine reports
their local solution 𝑥𝑗
𝑡+1 and their local multiplier 𝛼𝑗
𝑡 to
the orchestrator.
The orchestrator computes
𝑥𝑡+1 = 1
𝑚 (∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1) + 1
2𝑚𝜆 (∑
𝑚
𝑗=1
𝛼𝑗
𝑡 ).
L16.2.1 ADMM as Lagrangian relaxation
At this point, one might wonder how the update rules of ADMM were picked. The answer
is pretty ingenious. ADMM is based on the fundamental idea of rewriting the problem
𝑣 ≔ min
𝑥
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥)
as the equivalent problem
min
𝑥,𝑥𝑗
s.t.
1
𝑚 ∑
𝑚
𝑗=1
(𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2)
𝑥𝑗 = 𝑥.
(1)
The next step is to relax the equality constraint 𝑥𝑗
𝑡+1 = 𝑥𝑡+1 by introducing the Lagrange
relaxation of the problem, in which the constraint is replaced by a penalization term in the
objective function. The following key result shows that the maximum value of the Lagrange
relaxation is the same as the value of the original problem when all 𝑓𝑗 are convex.
Theorem L16.1. Assuming all 𝑓𝑗 are convex, the optimal value of the original problem
is equal to the maximum of the Lagrange function ℒ over all multipliers 𝛼𝑗, that is,
𝑣 = max
𝛼1,...,𝛼𝑚
min
𝑥,𝑥𝑗
{ℒ(𝑥, 𝑥𝑗; 𝛼𝑗) ≔ 1
𝑚 ∑
𝑚
𝑗=1
(𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2 + ⟨𝛼𝑗, 𝑥𝑗 − 𝑥⟩)}.
In light of this result, the ADMM algorithm can be seen as a way to maximize the Lagrange
function ℒ over the multipliers 𝛼𝑗 by performing alternating optimization steps on different
groups of variables:
• Assuming fixed 𝑥𝑡 and 𝑥𝑗
𝑡 , the multipliers 𝛼𝑗
𝑡−1 are updated by gradient ascent,
performing a step in the direction of the gradient of the objective function:
𝛼𝑗
𝑡 = 𝛼𝑗
𝑡−1 + 𝜂(𝑥𝑗
𝑡 − 𝑥𝑡) = 𝛼𝑗
𝑡−1 − 𝜂(𝑥𝑡 − 𝑥𝑗
𝑡 ).
• Assuming now fixed 𝛼𝑗
𝑡 and 𝑥𝑡, each machine now solves the optimization problem
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
• Assuming now fixed 𝛼𝑗
𝑡 and 𝑥𝑗
𝑡+1, the orchestrator computes the next iterate 𝑥𝑡 by
solving the optimization problem
𝑥𝑡+1 = arg min
𝑥∈ℝ𝑛
{ 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥⟩}
= arg min
𝑥∈ℝ𝑛
{ 𝜆
𝑚 ∑
𝑚
𝑗=1
‖𝑥𝑗 − 𝑥‖2
2 − ⟨𝛼𝑗
𝑡 , 𝑥⟩}.
By setting the gradient with respect to 𝑥 to 0 and solving for 𝑥, we obtain
1
𝑚 ∑
𝑚
𝑗=1
(2𝜆(𝑥 − 𝑥𝑗) − 𝛼𝑗
𝑡 ) = 0
So, the optimal 𝑥𝑡+1 is written in closed form as
𝑥𝑡+1 = 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1 + 1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 .
L16.3 Analysis (Optional)
The analysis of the correctness of ADMM can be broken down into several step. Perhaps
unsurprisingly at this point, at its core the proof revolves around some appropriate gener-
alization of the Euclidean mirror descent lemma.
L16.3.1 Part I: Orchestrating machine
Theorem L16.2. If the initial multipliers 𝛼𝑗
0 are chosen such that ∑𝑚
𝑗=1 𝛼𝑗
0 = 0, then, at
all times 𝑡,
∑
𝑚
𝑗=1
𝛼𝑗
𝑡 = 0, and 𝑥𝑡+1 = 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1.
Proof. By definition,
𝛼𝑗
𝑡+1 = 𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 − 𝑥𝑗
𝑡+1).
So, averaging over 𝑗 = 1, ..., 𝑚, we can write
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡+1 = 1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 − 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1). (2)
Using the definition of 𝑥𝑡+1, we also have
𝑥𝑡+1 = ( 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1) + 1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 ⟹ 𝑥𝑡+1 − 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1 = 1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 .(3)
Substituting the last expression into (2), we get
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡+1 = 1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 − 𝜂
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 = 1
𝑚(1 − 𝜂
2𝜆 ) ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 .
This shows that by induction, ∑𝑚
𝑗=1 𝛼𝑗
𝑡 = 0 at all 𝑡 as long as the assumption holds for
𝑡 = 0.
Plugging the expression ∑𝑚
𝑗=1 𝛼𝑗
𝑡 = 0 into (3), then yields
𝑥𝑡+1 = 1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1
at all times, which is the second part of the claim. □
L16.3.2 Part II: Worker machines
Theorem L16.3 (Euclidean mirror descent lemma for ADMM). Let each 𝑓𝑗 : ℝ𝑛 → ℝ
be convex, and 𝜂 = 2𝜆. Then, for any 𝑦 ∈ ℝ𝑛 we have
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓(𝑦) + 𝜂
2 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 − ‖𝑥𝑡 − 𝑥𝑡+1‖2
2)
+ 1
𝑚 ∑
𝑚
𝑗=1
1
2𝜂 (‖𝛼𝑗
𝑡 ‖2
2 − ‖𝛼𝑗
𝑡+1‖2
2 − ‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖2
2).
Proof. Recall that each machine computes the update
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
As the domain is open, the first-order necessary optimality conditions imply that
∇𝑓𝑗(𝑥𝑗
𝑡+1) + 2𝜆(𝑥𝑗
𝑡+1 − 𝑥𝑡) + 𝛼𝑗
𝑡 = 0. (4)
Since 𝑓𝑗 is convex, the linear lower bound property of convex functions we can write
𝑓𝑗(𝑦) ≥ 𝑓𝑗(𝑥𝑗
𝑡+1) + ⟨∇𝑓𝑗(𝑥𝑗
𝑡+1), 𝑦 − 𝑥𝑗
𝑡+1⟩
= 𝑓𝑗(𝑥𝑗
𝑡+1) + ⟨−2𝜆(𝑥𝑗
𝑡+1 − 𝑥𝑡) − 𝛼𝑗
𝑡 , 𝑦 − 𝑥𝑗
𝑡+1⟩. (from (4))
Rearranging the terms and using the hypothesis 𝜂 = 2𝜆, we get
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓𝑗(𝑦) + 𝜂⟨𝑥𝑗
𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡 , 𝑦 − 𝑥𝑗
𝑡+1⟩. (5)
At this point, we can use the update rule 𝛼𝑗
𝑡+1 = 𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 − 𝑥𝑗
𝑡+1) to write
𝑥𝑗
𝑡+1 = 𝑥𝑡+1 + 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡
𝜂 .
Substituting the above expression into the first inner product of (5) then yields
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓𝑗(𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑗
𝑡+1⟩.
(Note that change in time index from 𝛼𝑗
𝑡 to 𝛼𝑗
𝑡+1 in the last term.) Substituting the
expression for 𝑥𝑗
𝑡+1 into the last inner product, we finally obtain
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓𝑗(𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑡+1⟩ − 1
𝜂 ⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩.(6)
From Theorem L16.2, we know that
1
𝑚 ∑
𝑚
𝑗=1
⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑡+1⟩ = 0, and 1
𝑚 ∑
𝑚
𝑗=1
⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ = ⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩
for all 𝑡. Furthermore, we know that 𝑓(𝑦) = 1
𝑚 ∑𝑚
𝑗=1 𝑓𝑗(𝑦). Hence, averaging (6) across
all machines 𝑗 = 1, ..., 𝑚, we obtain
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓(𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩ − 1
𝑚𝜂 ∑
𝑚
𝑗=1
⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩.
Subtituting the three-point equalities
⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩ = 1
2 (‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 − ‖𝑥𝑡 − 𝑥𝑡+1‖2
2)
and −⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩ = 1
2 (‖𝛼𝑗
𝑡 ‖2
2 − ‖𝛼𝑗
𝑡+1‖2
2 − ‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖2
2)
yields the statement. □
L16.3.3 Putting the pieces together: telescoping step
The sum on the right-hand side of Theorem L16.3 telescopes. In particular, since 𝛼𝑗
0 = 0
for all 𝑗, we have
1
𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓(𝑦) + 𝜂
2𝑇 ‖𝑦 − 𝑥0‖2
2 − 1
2𝜂𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖2
2.
Since by definition 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 = 𝜂(𝑥𝑗
𝑡+1 − 𝑥𝑡+1), the last term satisfies
− 1
2𝜂𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖2
2 = − 𝜂
2𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖2
2,
leading to
1
𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓(𝑦) + 𝜂
2𝑇 ‖𝑦 − 𝑥0‖2
2 − 𝜂
2𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖2
2.
Rearranging the terms and applying 𝑦 = 𝑥⋆, we therefore have
1
𝑇 ∑
𝑇
𝑡=0
( 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) + 𝜂
2 ‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖2
2) ≤ 𝑓(𝑥⋆) + 𝜂
2𝑇 ‖𝑥⋆ − 𝑥0‖2
2.
Thus, we have the following guarantee.
Theorem L16.4. Let each 𝑓𝑗 : ℝ𝑛 → ℝ be convex and nonnegative, 𝜆 = √𝑇 and 𝜂 = 2𝜆.
Then, for any 𝑇 > 0, at least one of the iterates 𝑡 ∈ {0, ..., 𝑇 − 1} satisfies
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓(𝑥⋆) + 1
√𝑇 ‖𝑥⋆ − 𝑥0‖2
2,
and 1
𝑚 ∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖2
2 ≤ 1
√𝑇 𝑓(𝑥⋆) + 1
𝑇 ‖𝑥⋆ − 𝑥0‖2
2.
So, as time progresses, the iterates 𝑥𝑗
𝑡+1 concentrate around the average 𝑥𝑡+1, and the
average of the function values 𝑓𝑗(𝑥𝑗
𝑡+1), which concentrates around 𝑓(𝑥𝑡+1), converges to
the optimal value at a rate of 1√𝑇 .
L16.4 Further readings
The review article by Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., & others.
[Boy+11] is a modern and approachable introduction to ADMM and its variants and
extensions.
[Boy+11] Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., & others. (2011).
Distributed optimization and statistical learning via the alternating direction
method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.

Content

Typo or question?

Metadata