Definition (Gradient)

Let be sufficiently differentiable. The Gradient at is defined as

for each component of .

First-Order Taylor Expansion

Optimization algorithms use the following expansion to predict how the function value will change if we move a tiny amount in a certain direction (which is ).

which is essentially linear approximation but in .

Recall how we do linear approximation for some simple curve in :

The Taylor Expansion generalizes the tangent line into dimensions. The term is essentially the remainder of the rest of the terms of the Taylor Series. As , then the . In first-order optimization, we assume is so small, that is essentially negligible.

From the previous equation, by the Multivariate MVT1 we get

for some . The last line needs some justification. Let

such that when and . By applying MVT on this parameterized curve,

for some . But is calculated via Chain Rule, it becomes the gradient of at that middle point.

and by substitution,

which is our desired result.

Maximizing Function Value via Level Sets

Suppose we wanted to maximize via some tiny distance traveled . In particular, we want to pick a direction for some magnitude such that

Visually, we can see that the space at which the provides any effort to this function is “half” of the directions we can go. Imagine in , a sphere of directions we can go from some arbitrary point on some curve. Half the sphere are directions that would advance our objective function.

More precisely, via the First-Order Taylor Expansion. we get

the second term is positive. Conversely, if we wanted to perform Gradient Descent, we want to minimize , and our second term is negative.

Naive Algorithm for Gradient Ascent/Descent

With the same picture, if we pick any direction , and by picking its inverse , we have a chance to maximize by picking the better option. Unfortunately, this has the same pitfalls as Gradient Descent. In particular, it has no “bigger picture” of the space it inhabits. So we can stuck at a local minima, skip the global minima via too large , or etc.

Also known as the Three-Point Algorithm. This kind of optimization is called zero-order optimization. We only evaluate things via . In other words, we do zero derivatives to determine a better direction to travel.

This algorithm can perform better than Gradient Descent.

  1. If there are places with no gradient.
  2. If there are many local minimas. Consider in the curve . Gradient descent can get stuck in these wells.

Second-Order Taylor Expansion

We can further expand the First-Order Taylor Expansion.

Note that the term is known as the Hessian.

Footnotes

  1. See these notes for more information about the Generalized Mean Value Theorem for multivariate real-valued functions.