Gabriele Farina - Distributed optimization and ADMM

MIT 6.7220/15.084 — Nonlinear Optimization Thu, Apr 4th 2024
Lecture 11
Distributed optimization and ADMM
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
In this last lecture on first-order methods, we will touch on an important aspect of optimization in
machine learning: distributed optimization.
1 Setting
Consider again an empirical risk minimization problem, where we aim to minimize the average loss
over a large training set. When the training set is too large to fit on a single machine, or privacy
concerns require different parts of the data to be stored on different machines, it is common to
distribute the optimization problem across multiple machines.
In this setting, we assume that we have 𝑚 machines, each with
access to a local dataset and a local function 𝑓𝑗 : ℝ𝑛 → ℝ. The
goal is to minimize the average of the local functions, that is,
𝑓(𝑥) = 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥),
where the functions 𝑓𝑗 are differentiable.
In our distributed optimization model, each machine can com-
municate with a central orchestrator, which can send messages
to all machines and collect messages from all machines. Further-
more, each machine is powerful enough to solve most optimization
problems on their local dataset.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
Notation: We will denote machine indices as superscripts (these are not powers!) and time in-
dices as subscripts.
1.1 First attempt: independent optimization
One might be tempted to distribute the optimization problem by simply letting each machine com-
pute the optimal solution on their local dataset, and then try to combine the solutions in some way.
However, this approach in general has no hope without further assumptions about how the different
datasets relate to each other. In the extreme, each solution might carry virtually zero information
regarding the global optimum; for example, consider a multi-class classification problem where each
machine only has access to training data from the same class. Then, a classifier that deterministically
outputs the machine’s class would be an optimal local solution and yet carry no meaning globally.
1.2 Second attempt: distributed gradient computation
A more sophisticated approach is to compute the gradient of the loss function at each machine and
then combine the gradients at a central location. This approach would work, but has two major
drawbacks:
1. The total communication cost is mainly determined by the convergence rate of gradient de-
scent. In particular, when the smoothness or the Lipschitzness of 𝑓 is not very good, then the
convergence rate of gradient descent is not very good. We will discuss today a solution that
does not depend on the smoothness of 𝑓.
2. Gradient descent does not leverage the computation power of each machine very well. At each
iteration, each machine simply computes a gradient. A more cost-effective approach would try
to directly use the computational power of each machine to find some sort of optimal solution
of the problem (while avoiding the naive approach of independent optimization discussed in
Section 1.1).
2 The ADMM algorithm
The ADMM algorithm is a distributed optimization algorithm that combines the best of both worlds:
it uses the computational power of each machine to find an optimal solution, while also ensuring
that the global solution is meaningful.
The algorithm can be thought of as a two-step process parameterized by the learning rate 𝜂 > 0 and
a multiplier 𝜆 > 0, as follows.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
■ Phase I: Local update.
The orchestrator sends the current 𝑥𝑡 to each machine. Each
machine computes their updated multiplier
𝛼𝑗
𝑡 = 𝛼𝑗
𝑡−1 − 𝜂(𝑥𝑡 − 𝑥𝑗
𝑡 )
and optimizes on their local training set
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
Orchestrator
𝑥𝑡
Machine 1
𝑥1
𝑡 , 𝛼1
𝑡
Machine 2
𝑥2
𝑡 , 𝛼2
𝑡
Machine 3
𝑥3
𝑡 , 𝛼3
𝑡
■ Phase II: Synchronization.
Each machine reports their local solution 𝑥𝑗
𝑡+1 and their local
multiplier 𝛼𝑗
𝑡 to the orchestrator.
The orchestrator computes
𝑥𝑡+1 =
1
𝑚(∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1) +
1
2𝑚𝜆(∑
𝑚
𝑗=1
𝛼𝑗
𝑡 ).
2.1 ADMM as Lagrangian relaxation
At this point, one might wonder how the update rules of ADMM were picked. The answer is pretty
ingenious. ADMM is based on the fundamental idea of rewriting the problem
𝑣 ≔ min
𝑥
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥)
as the equivalent problem
min
𝑥,𝑥𝑗
s.t.
1
𝑚 ∑
𝑚
𝑗=1
(𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2)
𝑥𝑗 = 𝑥.
(1)
The next step is to relax the equality constraint 𝑥𝑗
𝑡+1 = 𝑥𝑡+1 by introducing the Lagrange relaxation
of the problem, in which the constraint is replaced by a penalization term in the objective function.
In particular, the following key result shows that the maximum value of the Lagrange relaxation is
the same as the value of the original problem when all 𝑓𝑗 are convex.
Theorem 2.1. Assuming all 𝑓𝑗 are convex, the optimal value of the original problem is equal to
the maximum of the Lagrange function ℒ over all multipliers 𝛼𝑗, that is,
𝑣 = max
𝛼1,...,𝛼𝑚
min
𝑥,𝑥𝑗
{ℒ(𝑥, 𝑥𝑗; 𝛼𝑗) ≔
1
𝑚 ∑
𝑚
𝑗=1
(𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2 + ⟨𝛼𝑗, 𝑥𝑗 − 𝑥⟩)}.
In light of this result, the ADMM algorithm can be seen as a way to maximize the Lagrange func-
tion ℒ over the multipliers 𝛼𝑗 by performing alternating optimization steps on different groups of
variables:
• Assuming fixed 𝑥𝑡−1 and 𝑥𝑗
𝑡−1, the multipliers 𝛼𝑗
𝑡−1 are updated by gradient ascent, performing
a step in the direction of the gradient of the objective function:
𝛼𝑗
𝑡 = 𝛼𝑗
𝑡−1 + 𝜂(𝑥𝑗
𝑡−1 − 𝑥𝑡−1) = 𝛼𝑗
𝑡−1 − 𝜂(𝑥𝑡−1 − 𝑥𝑗
𝑡−1).
• Assuming now fixed 𝛼𝑗
𝑡 and 𝑥𝑡, each machine now solves the optimization problem
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
• Assuming now fixed 𝛼𝑗
𝑡 and 𝑥𝑗
𝑡+1, the orchestrator computes the next iterate 𝑥𝑡 by solving the
optimization problem
𝑥𝑡+1 = arg min
𝑥∈ℝ𝑛
{ 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥⟩}
= arg min
𝑥∈ℝ𝑛
{ 𝜆
𝑚 ∑
𝑚
𝑗=1
‖𝑥𝑗 − 𝑥‖2
2 − ⟨𝛼𝑗
𝑡 , 𝑥⟩}.
By setting the gradient with respect to 𝑥 to 0 and solving for 𝑥, we obtain
1
𝑚 ∑
𝑚
𝑗=1
(2𝜆(𝑥 − 𝑥𝑗) − 𝛼𝑗
𝑡 ) = 0
So, the optimal 𝑥𝑡+1 is written in closed form as
𝑥𝑡+1 =
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1 +
1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 .
3 Analysis
The analysis of the correctness of ADMM can be broken down into several steps.
3.1 Part I: Orchestrating machine
Theorem 3.1. If the initial multipliers 𝛼𝑗
0 are chosen such that ∑𝑚
𝑗=1 𝛼𝑗
0 = 0, then, at all times 𝑡,
∑
𝑚
𝑗=1
𝛼𝑗
𝑡 = 0, and 𝑥𝑡+1 =
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1.
Proof . By definition,
𝛼𝑗
𝑡+1 = 𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 − 𝑥𝑗
𝑡+1).
So, averaging over 𝑗 = 1, ..., 𝑚, we can write
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡+1 =
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 −
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1). (2)
Using the definition of 𝑥𝑡+1, we also have
𝑥𝑡+1 = (
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1) +
1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 ⟹ 𝑥𝑡+1 −
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1 =
1
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 . (3)
Substituting the last expression into (2), we get
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡+1 =
1
𝑚 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 −
𝜂
2𝑚𝜆 ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 =
1
𝑚(1 − 𝜂
2𝜆) ∑
𝑚
𝑗=1
𝛼𝑗
𝑡 .
This shows that by induction, ∑𝑚
𝑗=1 𝛼𝑗
𝑡 = 0 at all 𝑡 as long as the assumption holds for 𝑡 = 0.
Plugging the expression ∑𝑚
𝑗=1 𝛼𝑗
𝑡 = 0 into (3), then yields
𝑥𝑡+1 =
1
𝑚 ∑
𝑚
𝑗=1
𝑥𝑗
𝑡+1
at all times, which is the second part of the claim. □
3.2 Part II: Worker machines
Theorem 3.2 (Euclidean mirror descent lemma for ADMM). Let each 𝑓𝑗 : ℝ𝑛 → ℝ be convex,
and 𝜂 = 2𝜆. Then, for any 𝑦 ∈ ℝ𝑛 we have
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 (𝑦) +
𝜂
2 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 − ‖𝑥𝑡 − 𝑥𝑡+1‖2
2)
+ 1
𝑚 ∑
𝑚
𝑗=1
1
2𝜂 (‖𝛼𝑗
𝑡 ‖
2
2 − ‖𝛼𝑗
𝑡+1‖
2
2 − ‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖
2
2).
Proof . Recall that each machine computes the update
𝑥𝑗
𝑡+1 = arg min
𝑥𝑗∈ℝ𝑛
{𝑓𝑗(𝑥𝑗) + 𝜆‖𝑥𝑗 − 𝑥𝑡‖2
2 + ⟨𝛼𝑗
𝑡 , 𝑥𝑗 − 𝑥𝑡⟩}.
As the domain is open, the first-order necessary optimality conditions imply that
∇𝑓𝑗(𝑥𝑗
𝑡+1) + 2𝜆(𝑥𝑗
𝑡+1 − 𝑥𝑡) + 𝛼𝑗
𝑡 = 0. (4)
Since 𝑓𝑗 is convex, the linear lower bound property of convex functions we can write
𝑓𝑗(𝑦) ≥ 𝑓𝑗(𝑥𝑗
𝑡+1) + ⟨∇𝑓 𝑗(𝑥𝑗
𝑡+1), 𝑦 − 𝑥𝑗
𝑡+1⟩
= 𝑓𝑗(𝑥𝑗
𝑡+1) + ⟨−2𝜆(𝑥𝑗
𝑡+1 − 𝑥𝑡) − 𝛼𝑗
𝑡 , 𝑦 − 𝑥𝑗
𝑡+1⟩. (from (4))
Rearranging the terms and using the hypothesis 𝜂 = 2𝜆, we get
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 𝑗(𝑦) + 𝜂⟨𝑥𝑗
𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡 , 𝑦 − 𝑥𝑗
𝑡+1⟩. (5)
At this point, we can use the update rule 𝛼𝑗
𝑡+1 = 𝛼𝑗
𝑡 − 𝜂(𝑥𝑡+1 − 𝑥𝑗
𝑡+1) to write
𝑥𝑗
𝑡+1 = 𝑥𝑡+1 +
𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡
𝜂 .
Substituting the above expression into the first inner product of (5) then yields
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 𝑗(𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑗
𝑡+1⟩.
(Note that change in time index from 𝛼𝑗
𝑡 to 𝛼𝑗
𝑡+1 in the last term.) Substituting the expression
for 𝑥𝑗
𝑡+1 into the last inner product, we finally obtain
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 𝑗(𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ + ⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑡+1⟩ −
1
𝜂 ⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩. (6)
From Theorem 3.1, we know that
1
𝑚 ∑
𝑚
𝑗=1
⟨𝛼𝑗
𝑡+1, 𝑦 − 𝑥𝑡+1⟩ = 0, and 1
𝑚 ∑
𝑚
𝑗=1
⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑗
𝑡+1⟩ = ⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩
for all 𝑡. Furthermore, we know that 𝑓(𝑦) = 1
𝑚 ∑𝑚
𝑗=1 𝑓𝑗(𝑦). Hence, averaging (6) across all ma-
chines 𝑗 = 1, ..., 𝑚, we obtain
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 (𝑦) + 𝜂⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩ −
1
𝑚𝜂 ∑
𝑚
𝑗=1
⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩.
Subtituting the three-point equalities
⟨𝑥𝑡+1 − 𝑥𝑡, 𝑦 − 𝑥𝑡+1⟩ =
1
2 (‖𝑥𝑡 − 𝑦‖2
2 − ‖𝑥𝑡+1 − 𝑦‖2
2 − ‖𝑥𝑡 − 𝑥𝑡+1‖2
2)
and −⟨𝛼𝑗
𝑡+1, 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ⟩ =
1
2 (‖𝛼𝑗
𝑡 ‖
2
2 − ‖𝛼𝑗
𝑡+1‖
2
2 − ‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖
2
2)
yields the statement. □
3.3 Putting the pieces together: telescoping step
The sum on the right-hand side of Theorem 3.2 telescopes. In particular, since 𝛼𝑗
0 = 0 for all 𝑗,
we have
1
𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 (𝑦) +
𝜂
2𝑇 ‖𝑦 − 𝑥0‖2
2 −
1
2𝜂𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖
2
2.
Since by definition 𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 = 𝜂(𝑥𝑗
𝑡+1 − 𝑥𝑡+1), the last term satisfies
− 1
2𝜂𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝛼𝑗
𝑡+1 − 𝛼𝑗
𝑡 ‖
2
2 = −
𝜂
2𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖
2
2,
leading to
1
𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 (𝑦) +
𝜂
2𝑇 ‖𝑦 − 𝑥0‖2
2 −
𝜂
2𝑚𝑇 ∑
𝑇 −1
𝑡=0
∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖
2
2.
Rearranging the terms and applying 𝑦 = 𝑥⋆, we therefore have
1
𝑇 ∑
𝑇
𝑡=0
( 1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) +
𝜂
2 ‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖
2
2) ≤ 𝑓 (𝑥⋆) + 𝜂
2𝑇 ‖𝑥⋆ − 𝑥0‖2
2.
Thus, we have the following guarantee.
Theorem 3.3. Let each 𝑓𝑗 : ℝ𝑛 → ℝ be convex and nonnegative, 𝜆 = √𝑇 and 𝜂 = 2𝜆. Then, for
any 𝑇 > 0, at least one of the iterates 𝑡 ∈ {0, ..., 𝑇 − 1} satisfies
1
𝑚 ∑
𝑚
𝑗=1
𝑓𝑗(𝑥𝑗
𝑡+1) ≤ 𝑓 (𝑥⋆) +
1
√𝑇 ‖𝑥⋆ − 𝑥0‖2
2,
and 1
𝑚 ∑
𝑚
𝑗=1
‖𝑥𝑗
𝑡+1 − 𝑥𝑡+1‖
2
2 ≤
1
√𝑇 𝑓(𝑥⋆) +
1
𝑇 ‖𝑥⋆ − 𝑥0‖2
2.
So, as time progresses, the iterates 𝑥𝑗
𝑡+1 concentrate around the average 𝑥𝑡+1, and the average of the
function values 𝑓𝑗(𝑥𝑗
𝑡+1), which concentrates around 𝑓(𝑥𝑡+1), converges to the optimal value at a
rate of 1√𝑇 .
4 Further readings
The review article by Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., & others. [Boy+11] is
a modern and approachable introduction to ADMM and its variants and extensions.
[Boy+11] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, and others, “Distributed optimization
and statistical learning via the alternating direction method of multipliers,” Foundations
and Trends in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.

Content

Typo or question?

Metadata