Definition (Gradient)
Let be sufficiently differentiable. The Gradient at is defined as
for each component of .
First-Order Taylor Expansion
Optimization algorithms use the following expansion to predict how the function value will change if we move a tiny amount in a certain direction (which is ).
which is essentially linear approximation but in .
Recall how we do linear approximation for some simple curve in :
The Taylor Expansion generalizes the tangent line into dimensions. The term is essentially the remainder of the rest of the terms of the Taylor Series. As , then the . In first-order optimization, we assume is so small, that is essentially negligible.
From the previous equation, by the Multivariate MVT1 we get
for some . The last line needs some justification. Let
such that when and . By applying MVT on this parameterized curve,
for some . But is calculated via Chain Rule, it becomes the gradient of at that middle point.
and by substitution,
which is our desired result.
Maximizing Function Value via Level Sets
Suppose we wanted to maximize via some tiny distance traveled . In particular, we want to pick a direction for some magnitude such that
Visually, we can see that the space at which the provides any effort to this function is “half” of the directions we can go. Imagine in , a sphere of directions we can go from some arbitrary point on some curve. Half the sphere are directions that would advance our objective function.
More precisely, via the First-Order Taylor Expansion. we get
the second term is positive. Conversely, if we wanted to perform Gradient Descent, we want to minimize , and our second term is negative.
Naive Algorithm for Gradient Ascent/Descent
With the same picture, if we pick any direction , and by picking its inverse , we have a chance to maximize by picking the better option. Unfortunately, this has the same pitfalls as Gradient Descent. In particular, it has no “bigger picture” of the space it inhabits. So we can stuck at a local minima, skip the global minima via too large , or etc.
Also known as the Three-Point Algorithm. This kind of optimization is called zero-order optimization. We only evaluate things via . In other words, we do zero derivatives to determine a better direction to travel.
This algorithm can perform better than Gradient Descent.
- If there are places with no gradient.
- If there are many local minimas. Consider in the curve . Gradient descent can get stuck in these wells.
Second-Order Taylor Expansion
We can further expand the First-Order Taylor Expansion.
Note that the term is known as the Hessian.