Unconstrained Minima and Small Gradients

1. Introduction

Differentiability plays a big role in finding (unconstrained) local minima or maxima of functions in practice. Indeed, let $f: E \to\, \mathbb{R}$ be a differentiable real-valued function defined on the Euclidean domain $E$ (if this sounds too vague, feel free to assume $E =\,\mathbb{R}^n$ with the usual scalar product). If $x \in E$ is a local minimum of $f$, Fermat’s theorem guarantees that $\nabla f(x) = 0$. This post is concerned with the following generalization of Fermat’s theorem:
Question 1 (Existence of small gradients).

If $f: E \to\, \mathbb{R}^n$ is (Gâteaux) differentiable, lower semicontinuous and lower bounded, can we conclude that it has points in which it is “arbitrarily flat”?

More formally, is it true that $\forall\ \epsilon > 0$ there exists $x_\epsilon\in E$ such that $\|\nabla f(x_\epsilon) \| \le \epsilon$?

Clearly, if $f$ is not lower bounded, there is no hope that the theorem is true: a trivial counterexample is the identity function $f(x) = x$, whose gradients are constant throughout the whole domain. Furthermore, the question is only relevant for functions without critical points (like $f(x) = \tanh x$), or Fermat’s theorem would trivially answer the question for the positive.

2. Attainment

This section contains a quick review of a few essential results about the existence of unconstrained optima.
Theorem 1 (Weierstrass).
Let $f: D \to\ \mathbb{R}$ be a lower semicontinuous function defined over the nonempty closed and bounded domain $D \subseteq E$, where $E$ is a Euclidean space. Then $f$ admits a global minimizer $x^* \in D$, that is \[ f(x^*) \le f(x) \quad \forall\,x\in D. \]
Proof. Let $m := \inf f(D)$, where $m$ might be $-\infty$, and let $\{x_i\} \to m$ be a sequence converging to $m$. Since $\{x_i\}$ is a bounded sequence, by Bolzano-Weierstrass there must exist a subsequence $\{x_{k_i}\}$ converging to some $x^* \in D$. Since every subsequence of a converging sequence converges to the same limit, we have $\lim f(x_{k_i}) = m$. Using the fact that $f$ is lower semicontinuous, we have that $f(x^*) \le m$, which by construction of $m$ implies $f(x^*) = m$. $\square$ An immediate consequence of Weierstrass’ theorem is that if $f$ is continuous and it has a bounded sublevel set, then $f$ achieves a global minimizer on $E$:
Corollary 1.
Let $f: E \to\,\mathbb{R}$ be a lower semicontinuous function with a bounded sublevel set $S_\alpha := \{x \in E: f(x) \le \alpha\}$. Then $f$ has a global minimizer $x^* \in E$.
Proof. Note that $S_\alpha = f^{-1}((-\infty, \alpha])$ is closed (by definition of lower semicontinuous function). Since by hypothesis $S_\alpha$ is bounded, by Weierstrass’ theorem above there exists $x^* \in S_\alpha$ such that $f(x^*) \le f(x)$ for all $x \in S_\alpha$. Since $f(x) \le f(y)$ for all $x \in S_\alpha, y\notin S_\alpha$, we conclude that $f(x^*) \le f(x)$ for all $x \in E$. $\square$

2.1. Coercive functions

It turns out that in practice we can relax Corollary 1 and use a much weaker condition.
Definition 1 (Coercive function).
A function $f: D \to\, \mathbb{R}$ defined on a subset $D$ of the Euclidean space $E$ is called coercive if $f(x) \to \infty$ as $\|x\|\to\infty$.
A direct consequence of the definition above is that all sublevel sets of a coercive function are bounded. Hence, the following is an easy corollary of Corollary 1.
Corollary 2.
Let $f: E \to\,\mathbb{R}$ be a lower semicontinuous coercive function. Then $f$ has a global minimizer $x^* \in E$.

3. Existence of small gradients

When $f$ is Gâteaux differentiable, the following well-known necessary condition can be proven:
Theorem 2 (Fermat).
Let $f: D \to\,\mathbb{R}$ be Gâteaux differentiable on the open domain $D \subseteq E$, where $E$ is a Euclidean space, and let $x^* \in D$ be a local minimizer of $f$. Then $\nabla f(x^*) = 0$.
Proof. By definition of local minimizer, and since $D$ is open, for each $d \in E$ we find that \[ f'(x^*; d) = \lim_{t \downarrow 0} (f(x + td) - f(x)) \ge 0 \] Since for a Gâteaux differentiable function it must be $f'(x^*; d) = -f'(x^*; -d)$, we conclude that $f'(x^*; d) = 0$ for all $d \in E$. Hence, $\nabla f(x^*) = 0$. $\square$ But what about cases where $f$ does not have any local minimizer (or maximizer)? Perhaps surprisingly, the following can be proven:
Theorem 3 (Approximate Fermat’s theorem).
Let $f: E \to\,\mathbb{R}$ be a Gâteaux differentiable, lower semicontinuous and lower bounded function. Then, for all $\epsilon > 0$ there exists $x_\epsilon \in E$ such that $\|\nabla f(x_\epsilon)\| \le \epsilon$.
Proof. For each $\epsilon > 0$, consider the function $f_\epsilon(x) = f(x) + \epsilon \|x\|$. Since $f$ is lower bounded and lower semicontinuous, $f_\epsilon$ is coercive and lower semicontinuous and by Corollary 2 it attains a (global) minimum in $x^*_\epsilon$. At this point, it could seem tempting to consider the gradient of $f_\epsilon$ at $x^*_\epsilon$; however, this would require $f_\epsilon$ to be Gâteaux differentiable, which is generally not true. Instead, we go a different route, much akin to our proof of Fermat’s theorem. Assume $\nabla f(x^*_\epsilon) \neq 0$ (or otherwise we are done). For all directions $d \in E$ and for all $t > 0$ we must have \[ \frac{1}{t}\left(f_\epsilon(x^*_\epsilon - td) - f_\epsilon(x^*_\epsilon)\right) \ge 0. \] Expanding the definition of $f_\epsilon$ we obtain \[ \frac{1}{t}\left(f(x^*_\epsilon - td) - f(x^*_\epsilon) + \epsilon (\|x^*_\epsilon - td\| - \|x^*_\epsilon\|)\right) \ge 0. \] Using the triangle inquality $t\|d\| \ge \|x^*_\epsilon - td\| - \|x^*_\epsilon\|$ we find \[ \frac{1}{t}\left(f(x^*_\epsilon - td) - f(x^*_\epsilon)\right) + \epsilon \|d\| \ge 0. \] Since $f$ is Gâteaux differentiable, taking the limit as $t \downarrow 0$, we find \[ -\langle \nabla f(x^*_\epsilon), d \rangle + \epsilon \|d\| \ge 0. \] Substituting $d = \nabla f(x^*_\epsilon)$ and dividing by $\|\nabla f(x^*_\epsilon)\|$, the claim follows. $\square$ Notice in the proof above that as $\epsilon \downarrow 0$, $\min f_\epsilon \to \inf f$. Indeed, fix any $\delta > 0$, and let $\bar x$ be any point such that $f(\bar x) \le \delta + \inf f$ ($\bar x$ must exist by definition of infimum). Then, any value of $\epsilon$ small enough that $\epsilon \|\bar x\| \le \delta$ is such that \[ \begin{aligned} \min_{x\in E} f_\epsilon(x) &\le f_\epsilon(\bar x)\\ &= f(\bar x) + \epsilon \|\bar x\|\\ &\le \inf f(x) + 2\delta, \end{aligned} \] and therefore the minimum of $f_\epsilon$ converges to the infimum of $f$. We have just proved the following.
Theorem 4.
Let $f: E \to\,\mathbb{R}$ be a Gâteaux differentiable, lower semicontinuous and lower bounded function. Then, there exists a sequence $\{x_i\}$ such that \[ f(x_i) \to \inf f, \quad \nabla f(x_i) \to 0. \]
In fact, Theorem 4 can be seen as a precursor of Ekeland’s variational principle.

4. A cool application: Gordan’s lemma

Lemma 1 (Gordan).
Let $a_1, \ldots, a_m$ be point in $\mathbb{R}^n$. Then exactly one of the following is true:
  1. The origin is in the convex hull of $\{a_1, \ldots, a_m\}$, that is there exist $\lambda_1, \ldots, \lambda_m \ge 0$, $\lambda_1 + \cdots + \lambda_m = 1$ such that $\lambda_1 a_1 + \cdots + \lambda_m a_m = 0$.
  2. All points are strictly on the same side of a hyperplane passing through the origin, that is there exists $\mu \in \mathbb{R}^n$ such that $\langle \mu, a_i \rangle < 0$ for all $i = 1, \ldots, m$.
Theorem 4 can be used to prove Gordan’s lemma without resorting to a separation argument. Perhaps mysteriously, the proof relies on the following Fréchet differentiable (hence in particular Gâteaux differentiable and lower semicontinuous) function
\[h:\,\mathbb{R}^n \to \mathbb{R},\quad \mu \mapsto \log\left(\sum_{i=1}^m \exp \langle \mu, a_i \rangle\right).\]
The reason why this function is interesting is that its gradient provides a convex combination of the $a_i$’s:
\[\nabla h(\mu) = \sum_{i=1}^m \left( \frac{\exp \langle \mu, a_i\rangle}{\sum_{j=1}^m \exp \langle \mu, a_j \rangle}\right) a_i.\] (1)
where the $\beta_i(\mu) := \frac{\exp \langle \mu, a_i\rangle}{\sum_{j=1}^m \exp \langle \mu, a_j \rangle}$ are nonnegative and sum to 1. Proof of Gordan’s lemma. We break the proof into two parts.
  • Condition 1 true $\Rightarrow$ condition 2 false.
    For contradiction, assume that both conditions hold, and take a $\mu \in \mathbb{R}^n$ that satisfies condition 2. Then, \[ \begin{aligned} 0 &= \langle \mu, 0 \rangle\\ &= \langle \mu, \lambda_1 a_1 + \cdots + \lambda_m a_m \rangle\\ &= \lambda_1 \langle \mu, a_1 \rangle + \cdots + \lambda_m \langle \mu, a_m \rangle\\ &< 0, \end{aligned} \] where the last inequality holds since at least one of the $\lambda$’s must be strictly positive.
  • Condition 2 false $\Rightarrow$ condition 1 true.
    If condition 2 is false, then for each $\mu \in\mathbb{R}^n$ there exists an $i$ such that $\langle \mu, a_i \rangle \ge 0$. Hence, $h(\mu) \ge 0$ and $h$ is lower bounded. We can then use Theorem 3 and extract a sequence $\{\mu_k\}$ such that $\nabla h(\mu_k) \to 0$. Using Equation 1, we can rewrite this as \[ \sum_{i=1}^m \beta_i(\mu_k) a_i \to 0. \] Since each convex combination vector $\{(\beta_1(\mu_k), \ldots, \beta_m(\mu_k))\}$ lives in a bounded set, we can assume without loss of generality (by using the Bolzano-Weierstrass theorem) that $\beta_i(\mu_k) \to {\bar \beta}_i \ge 0$, with ${\bar\beta}_1+\cdots +{\bar\beta}_m=1$, and therefore \[ \sum_{i=1}^m {\bar\beta}_i a_i = 0. \] $\square$