Definition (Gradient)

Let $f : R^{n} \to R$ be sufficiently differentiable. The Gradient at $x \in R^{n}$ is defined as

\nabla f (x) = [\frac{\partial f}{\partial x _{1}}, \dots, \frac{\partial f}{\partial x _{n}}]^{T}

for each component $x_{i}$ of $f$ .

First-Order Taylor Expansion

Optimization algorithms use the following expansion to predict how the function value will change if we move a tiny amount in a certain direction (which is $Δ x$ ).

f (x + Δ x) = f (x) + \nabla f (x)^{T} Δ (x) + O (∣∣Δ x ∣ ∣_{2}^{2})

which is essentially linear approximation but in $R^{n}$ .

Recall how we do linear approximation for some simple curve in $R^{2}$ :
$f (x + Δ x) = f (x) + \frac{d}{d x} [f (x)] \cdot Δ x$

The Taylor Expansion generalizes the tangent line into $n$ dimensions. The $O (∣∣Δ x ∣ ∣_{2}^{2})$ term is essentially the remainder of the rest of the terms of the Taylor Series. As $Δ x \to 0$ , then the $O \to 0$ . In first-order optimization, we assume $Δ x$ is so small, that $Δ x^{2}$ is essentially negligible.

From the previous equation, by the Multivariate MVT¹ we get

f (x + Δ x) = f (x) + \nabla f (x)^{T} Δ (x) + O (∣∣Δ x ∣ ∣_{2}^{2}) = f (x) + \nabla f (x + t Δ x)^{T} Δ x

for some $t \in (0, 1)$ . The last line needs some justification. Let

g (t) = f (x + t Δ x) [0, 1] \to R

such that when $t = 0, g (0) = f (x)$ and $t = 1, g (1) = f (x + Δ x)$ . By applying MVT on this parameterized curve,

g (1) - g (0) = g^{'} (t) \cdot (1 - 0)

for some $t \in (0, 1)$ . But $g^{'} (t)$ is calculated via Chain Rule, it becomes the gradient of $f$ at that middle point.

g^{'} (t) = \nabla f (x + t Δ x) \cdot Δ x = \nabla f (x + t Δ x)^{T} Δ x

and by substitution,

f (x + Δ x) - f (x) = \nabla f (x + t Δ x)^{T} Δ x

which is our desired result.

Maximizing Function Value via Level Sets

Suppose we wanted to maximize $f (x)$ via some tiny distance traveled $Δ x$ . In particular, we want to pick a direction for some magnitude $∣Δ x ∣$ such that

f (x + Δ x) - f (x) > 0

Visually, we can see that the space at which the $x + Δ x$ provides any effort to this function is “half” of the directions we can go. Imagine in $R^{3}$ , a sphere of directions we can go from some arbitrary point on some curve. Half the sphere are directions that would advance our objective function.

More precisely, via the First-Order Taylor Expansion. we get

f (x + Δ x) = f (x) + \nabla f (x)^{T} Δ (x) + O (∣∣Δ x ∣ ∣_{2}^{2})

the second term is positive. Conversely, if we wanted to perform Gradient Descent, we want to minimize $f$ , and our second term is negative.

Naive Algorithm for Gradient Ascent/Descent

With the same picture, if we pick any direction $Δ x$ , and by picking its inverse $- Δ x$ , we have a $100%$ chance to maximize $f$ by picking the better option. Unfortunately, this has the same pitfalls as Gradient Descent. In particular, it has no “bigger picture” of the space it inhabits. So we can stuck at a local minima, skip the global minima via too large $Δ x$ , or etc.

Also known as the Three-Point Algorithm. This kind of optimization is called zero-order optimization. We only evaluate things via $f$ . In other words, we do zero derivatives to determine a better direction to travel.

This algorithm can perform better than Gradient Descent.

If there are places with no gradient.
If there are many local minimas. Consider in $R^{2}$ the curve $x + sin (x)$ . Gradient descent can get stuck in these wells.

Second-Order Taylor Expansion

We can further expand the First-Order Taylor Expansion.

f (x + Δ x) = f (x) + \nabla f (x)^{T} Δ x + \frac{1}{2} Δ x^{T} \nabla^{2} f (x + ξ Δ x) Δ x = f (x) + \nabla f (x)^{T} Δ x + \frac{1}{2} Δ x^{T} \nabla^{2} f (x) Δ x + O (∣∣Δ x ∣ ∣^{3})

Note that the term $\nabla^{2} f (x + ξ Δ x)$ is known as the Hessian.

See these notes for more information about the Generalized Mean Value Theorem for multivariate real-valued functions. ↩

kyle's notes

Explorer

Gradient

Definition (Gradient)

First-Order Taylor Expansion

Maximizing Function Value via Level Sets

Naive Algorithm for Gradient Ascent/Descent

Second-Order Taylor Expansion

Graph View

Table of Contents

Backlinks

kyle's notes

Explorer

Gradient

Definition (Gradient)

First-Order Taylor Expansion

Maximizing Function Value via Level Sets

Naive Algorithm for Gradient Ascent/Descent

Second-Order Taylor Expansion

Footnotes

Graph View

Table of Contents

Backlinks