General Idea

In calculus, we take derivatives of functions of a few variables. In calculus of variations, we take derivatives of functions of functions (also called functionals). The main purpose of this is to formulate the optimization problem over function spaces and derive its optimality condition (KKT Condition).

These optimality conditions are often differential equations, called Euler-Lagrange Equations, and most physical equations arise as Euler-Lagrange equations of some optimization problem. That way, we can “design one optimal functional” instead of “modeling forces”.

Example 1

Let

M = {y : [0, T] \to R ∣ y (0) = y_{0}, y (T) = y_{T}}

be the manifold of all possible valid trajectories of a particle in one dimension, from $y_{0}$ to $y_{T}$ in time $T$ . A single “point” on this manifold is an entire function $y (t)$ . Because it takes infinitely many values (one for each $t$ ), we can think of $M$ as an infinite-dimensional manifold.

Let $E : M \to R$ be the covector where

E (y) := \int_{t = 0}^{T} (\frac{1}{2} y^{'} (t)^{2} + cos (y (t))) d t

The problem is, what is the optimality condition for

y \in M min E (y)

If we fixed some $p \in M$ , the tangent space $T_{y} M$ represents all valid perturbations we can apply to $y$ without leaving the Manifold. This perturbation is $\overset{y}{˚}$ . So, for perturbed path $y (t) + ε \overset{y}{˚} (t) \in M$ , it must satisfy the fixed boundary conditions. In particular,

T_{y} M = {\overset{y}{˚} : [0, T] \to R ∣ \overset{y}{˚} (0) = 0, \overset{y}{˚} (T) = 0}

Visually, we have

where the blue curve at the bottom represents one example of a perturbation $\overset{y}{˚}$ , and the purple curves represent some examples of $y$ .

The optimality condition is

d E_{y} [[\overset{y}{˚}]] = 0

for all $\overset{y}{˚} \in T_{y} M$ , just like the first derivative test for optimality in regular calculus. Geometrically, this means the covector itself is zero. I.e. if we were at the bottom of some “energy bowl”, no matter which direction $\overset{y}{˚}$ we perturb, the energy will not change. Indeed,

0 = d E_{y} [[\overset{y}{˚}]] = \frac{d}{d ε}_{ε = 0} E (y + ε \overset{y}{˚}) = \frac{d}{d ε}_{ε = 0} \int_{t = 0}^{T} (\frac{1}{2} (y^{'} (t) + ε \overset{y}{˚}^{'} (t))^{2} + cos (y (t) + ε \overset{y}{˚} (t))) d t = \int_{t = 0}^{T} (y^{'} \overset{y}{˚}^{'} - sin (y) \overset{y}{˚}) d t = - \int_{t = 0}^{T} (y^{''} + sin (y)) \overset{y}{˚} d t via integration by parts ⟹ 0 = y^{''} (t) + sin (y (t))

which is precisely the physical equation of motion for a pendulum.

Physics is just optimization over function spaces.

Euler-Lagrange Equation

We can derive the optimality condition for a more general functional. Suppose we have the same conditions as above but with

E (y) := \int_{t = 0}^{T} L (t, y (t), y^{'} (t)) d t

for some integrand $L : [0, T] \times R^{m} \times R^{m} \to R$ . The directional derivative is then

d E_{y} [[\overset{˚}{y}]] = \frac{d}{d ε}_{ε = 0} E (y + ε \overset{˚}{y}) = \int_{0}^{T} \frac{\partial}{\partial ε}_{ε = 0} L (t, y (t) + ε \overset{˚}{y} (t), y^{'} (t) + ε \overset{˚}{y}^{'} (t)) d t = \int_{0}^{T} ⟨ \frac{\partial L}{\partial y}_{(t, y (t), y^{'} (t))} \overset{˚}{y} (t) ⟩ d t + \int_{0}^{T} ⟨ \frac{\partial L}{\partial y ^{'}}_{(t, y (t), y^{'} (t))} \overset{˚}{y}^{'} (t) ⟩ d t = \int_{0}^{T} ⟨ \frac{\partial L}{\partial y}_{(t, y (t), y^{'} (t))} \overset{˚}{y} (t) ⟩ d t - \int_{0}^{T} ⟨ \frac{d}{d t} (\frac{\partial L}{\partial y ^{'}}_{(t, y (t), y^{'} (t))}) \overset{˚}{y} (t) ⟩ d t = \int_{0}^{T} ⟨ \frac{\partial L}{\partial y}_{(t, y (t), y^{'} (t))} - \frac{d}{d t} (\frac{\partial L}{\partial y ^{'}}_{(t, y (t), y^{'} (t))}) \overset{˚}{y} (t) ⟩ d t

in particular,

We expanded the definition of directional derivative
We applied the multivariate chain rule since $ε$ appears in both $y$ and $y^{'}$
We took the partial derivative of $L$ with respect to $ε$ . The use of $⟨, ⟩$ is just to denote the dot product between the gradient and the perturbation. Here, it is the inner product in $R^{m}$ .
We applied integration by parts to move the derivative from $\overset{˚}{y}^{'}$ to $\frac{\partial L}{\partial y ^{'}}$ . The boundary term vanishes since $\overset{˚}{y} (0) = \overset{˚}{y} (T) = 0$ .
We factored out $\overset{˚}{y}$ via linearity of the inner product.

Theorem (Euler-Lagrange Equation)

The optimality condition $d E_{y} [[\overset{˚}{y}]] = 0$ for all $\overset{˚}{y} \in T_{y} M$ is equivalent to the following system of ordinary differential equations:

\frac{\partial E}{\partial y} (t) = \frac{\partial L}{\partial y}_{(t, y (t), y^{'} (t))} - \frac{d}{d t} (\frac{\partial L}{\partial y ^{'}}_{(t, y (t), y^{'} (t))}) = 0

called the Euler-Lagrange Equation.

Applications in Physics

The Euler-Lagrange equations are foundational to physics. For detailed derivations of Newton’s laws, conservation of energy, and rotational dynamics from variational principles, see Lagrangian Mechanics.

Example 2 (Hanging Chain)

What is the shape of a hanging chain on two poles? It is the curve with the lowest potential energy with fixed total length. Let the shape be the function graph

y = f (x)

First, we want to describe the physical geometry of the curve. Any section of the chain forms a right triangle microscopically. With the Pythagorean theorem, we have that the length of the chain

d s = d x^{2} + d y^{2} = 1 + (\frac{d y}{d x})^{2} d x = 1 + f^{'2} d x

The total length is found by integrating $d s$ along the curve

L (f) = \int_{a}^{b} d s = \int_{a}^{b} 1 + f^{'2} d x

The total gravitational potential energy is

E (f) = \int_{a}^{b} f 1 + f^{'2} d x

We must minimize $E$ . However, without any constraints, the chain would drop straight down into infinity (to achieve negative infinite potential energy). So we must constrain it to have a fixed length $ℓ$ . Thus, the constraint is $L (f) - ℓ = 0$ . The KKT Condition for minimizing $E$ with constraint $L - ℓ = 0$ says that there exists some $λ$ such that

d E + λ d L = 0 ⟺ d \tilde{E} = 0 where \tilde{E} = E + λ L = \int_{a}^{b} (f + λ) 1 + f^{'2} d x

We want to minimize $\tilde{E}$ . By applying Noether’s theorem of time independence (in this case, “time” is just the $x$ -axis), we have that

\frac{\partial L}{\partial f ^{'}} f^{'} - L = c

implying that

(f + λ) (\frac{f ^{'2}}{1 + f ^{'2}} - 1 + f^{'2}) = c

such that

f^{'} \frac{df}{( f + λ ) ^{2} - c ^{2}} \int \frac{df}{( f + λ ) ^{2} - c ^{2}} cosh^{- 1} (\frac{f + λ}{c}) f (x) = \pm \frac{1}{c} (f + λ)^{2} - c^{2} = \pm \frac{1}{c} d x = \pm \int \frac{1}{c} d x = \pm \frac{x}{c} + c_{1} = c cosh (\frac{x}{c} + c_{2}) - λ

Example 3 (Mass Spring System)

Consider a graph $G = (V, E)$ where each vertex is assigned a mass $m_{i}$ and position $x_{i}$ . Each edge is assigned a spring with a spring constant $k_{e}$ . The total kinetic energy and potential energy is

K = i \in V \sum \frac{1}{2} m_{i} ∣ \overset{x}{˙}_{i} ∣^{2} U = e = (i, j) \in E \sum \frac{1}{2} k_{e} ∣ x_{i} - x_{j} ∣^{2}

The Euler-Lagrange equation for this gives

\frac{d}{d t} \frac{\partial L}{\partial x ˙ _{i}} = - \frac{\partial L}{\partial x _{i}} ⟹ m_{i} \overset{x}{¨}_{i} = - \frac{\partial U}{\partial x _{i}}

We can find the derivative with respect to one vertex position

"\\usepackage{amsmath}\n\\usepackage{amssymb}\n\\usepackage{amsfonts}\n\\usepackage{xcolor}\n\n\\begin{document}\n\\begin{tikzpicture}[>=stealth, scale=1.0]\n\n% Define colors\n\\definecolor{poscol}{RGB}{112, 48, 160} % Purple\n\\definecolor{straincol}{RGB}{68, 114, 196} % Blue\n\\definecolor{energycol}{RGB}{158, 72, 14} % Brown\n\\definecolor{botcol}{RGB}{69, 160, 144} % Teal\n\n% Matrices\n\\node[draw=poscol, very thick, inner sep=2pt] (Pos) at (0,0) {\n $\\begin{pmatrix}\n \\color{poscol}\\mathbf{x}_1 \\\\ \\vdots \\\\ \\color{poscol}\\mathbf{x}_i \\\\ \\vdots \\\\ \\color{poscol}\\mathbf{x}_{|\\mathcal{V}|}\n \\end{pmatrix}$\n};\n\n\\node[draw=straincol, very thick, inner sep=2pt] (StrainX) at (4.5,0) {\n $\\begin{pmatrix}\n (d{\\color{poscol}\\mathbf{x}})_1 \\\\ \\vdots \\\\ (d{\\color{poscol}\\mathbf{x}})_e \\\\ \\vdots \\\\ (d{\\color{poscol}\\mathbf{x}})_{|\\mathcal{E}|}\n \\end{pmatrix}$\n};\n\n\\node[draw=straincol, very thick, inner sep=2pt] (StrainL) at (8.5,0) {\n $\\begin{pmatrix}\n \\ell_1 \\\\ \\vdots \\\\ \\ell_e \\\\ \\vdots \\\\ \\ell_{|\\mathcal{E}|}\n \\end{pmatrix}$\n};\n\n\\node[draw=energycol, very thick, inner sep=2pt] (Energy) at (12,0) {\n $\\begin{pmatrix}\n \\frac{1}{2}k_1\\ell_1^2 \\\\ \\vdots \\\\ \\frac{1}{2}k_e\\ell_e^2 \\\\ \\vdots \\\\ \\frac{1}{2}k_{|\\mathcal{E}|}\\ell_{|\\mathcal{E}|}^2\n \\end{pmatrix}$\n};\n\n% Titles\n\\node[align=left, text=poscol, font=\\bfseries] at (0, 2.8) {Position\\\\variable};\n\\node[align=left, text=straincol, font=\\bfseries] at (6.25, 2.8) {Strains\\\\\\normalfont(measurement of deformation)};\n\\node[align=left, text=energycol, font=\\bfseries] at (12, 2.8) {Local energy\\\\\\normalfont(depending only on\\\\certain strains)};\n\n% Bottom Boxes\n\\node[draw=botcol, very thick, text=botcol, inner sep=6pt] (Force) at (0, -3.2) {\\Large $\\mathbf{f}_i = \\mathrm{div}(\\boldsymbol{\\sigma})$};\n\\node[align=center, text=botcol, font=\\bfseries] at (0, -4.2) {Force\\\\\\normalfont(aggregate of stress)};\n\n\\node[draw=botcol, very thick, text=botcol, inner sep=6pt, minimum width=6cm] (Stress) at (6.25, -3.2) {\\Large $\\boldsymbol{\\sigma}_e = k_e(d\\mathbf{x})_e \\hspace{2cm} k_e\\ell_e$};\n\\node[align=center, text=botcol, font=\\bfseries] at (6.25, -4.2) {Stress\\\\\\normalfont(dual space of strains)};\n\n% Connecting Arrows & Text\n% Pos -> StrainX\n\\draw[->, lightgray, thick] ([yshift=1.2cm]Pos.east) -- ([yshift=0.7cm]StrainX.west);\n\\draw[->, lightgray, thick] ([yshift=0.7cm]Pos.east) -- ([yshift=0.2cm]StrainX.west);\n\\draw[->, lightgray, thick] ([yshift=0.2cm]Pos.east) -- ([yshift=0.2cm]StrainX.west) node[midway, above, text=straincol, yshift=-2pt] {\\textbf{+}};\n\\draw[->, lightgray, thick] ([yshift=-0.2cm]Pos.east) -- ([yshift=0.7cm]StrainX.west) node[midway, above, text=straincol, yshift=-2pt] {\\textbf{--}};\n\\draw[->, lightgray, thick] ([yshift=-0.9cm]Pos.east) -- ([yshift=-0.9cm]StrainX.west) node[midway, above, text=straincol, yshift=-2pt] {\\textbf{--}};\n\\draw[->, lightgray, thick] ([yshift=-1.4cm]Pos.east) -- ([yshift=-0.9cm]StrainX.west) node[midway, above, text=straincol, yshift=-2pt] {\\textbf{+}};\n\n\\node[align=left, anchor=west] at (1, 1.8) {define \\\\ $(d\\mathbf{x})_e =$ \\\\ ${\\color{poscol}\\mathbf{x}}_{\\text{dst}(e)} - {\\color{poscol}\\mathbf{x}}_{\\text{src}(e)}$};\n\n% StrainX -> StrainL\n\\draw[->, lightgray, thick] ([yshift=1.2cm]StrainX.east) -- ([yshift=1.2cm]StrainL.west);\n\\draw[->, lightgray, thick] ([yshift=0.7cm]StrainX.east) -- ([yshift=0.7cm]StrainL.west);\n\\draw[->, lightgray, thick] ([yshift=0cm]StrainX.east) -- ([yshift=0cm]StrainL.west) node[midway, below, text=straincol] {\\Large $\\frac{d\\mathbf{x}_e}{\\ell_e}$};\n\\draw[->, lightgray, thick] ([yshift=-1.4cm]StrainX.east) -- ([yshift=-1.4cm]StrainL.west);\n\n\\node[align=center] at (6.70, 1.8) {define \\\\ $\\ell_e = |(d{\\color{poscol}\\mathbf{x}})_e|$};\n\n% StrainL -> Energy\n\\draw[->, lightgray, thick] ([yshift=1.2cm]StrainL.east) -- ([yshift=1.2cm]Energy.west);\n\\draw[->, lightgray, thick] ([yshift=0cm]StrainL.east) -- ([yshift=0cm]Energy.west) node[midway, below, text=straincol] {\\large $k_e\\ell_e$};\n\\draw[->, lightgray, thick] ([yshift=-1.4cm]StrainL.east) -- ([yshift=-1.4cm]Energy.west);\n\n% Energy -> U (arrows converging)\n\\node[text=energycol] (EndPt) at (14.2, 0) {$U$};\n\\draw[->, lightgray, thick] ([yshift=1.4cm]Energy.east) -- (EndPt.west);\n\\draw[->, lightgray, thick] ([yshift=0.6cm]Energy.east) -- (EndPt.west);\n\\draw[->, lightgray, thick] ([yshift=0cm]Energy.east) -- (EndPt.west);\n\\draw[->, lightgray, thick] ([yshift=-0.6cm]Energy.east) -- (EndPt.west);\n\\draw[->, lightgray, thick] ([yshift=-1.4cm]Energy.east) -- (EndPt.west);\n\n\\end{tikzpicture}\n\\end{document}"

source code

One interesting thing to note is that this is similar to a feed forward neural network. We input raw vertex positions and they transform step by step into a single scalar output $U$ .

The first transformation layer subtracts the position of the source vertex from the position of the destination vertex to get the edge vector.
The second transformation layer normalizes (calculates the magnitude) the edge vector to get the current physical length of the string.
The third transformation layer computes the local energy of each edge via the formula $\frac{1}{2} k_{e} ℓ_{e}^{2}$ .
The sum of the individual spring energies gives us the total potential energy of the system.

Now, like in machine learning, we do a backward pass. To find the force on a specific vertex $i$ , we must compute $\frac{\partial U}{\partial x _{i}}$ . Because the energy was calculated through a chain of operations, we must use the multivariate chain rule to backpropagate the gradient through the computational graph.

On each edge $e$ of the graph, we can compute $σ_{e} = k_{e} (d x_{e})$ , which is the stress (stiffness times strain) on that edge. To get the actual force $f_{i}$ actuing on each node, the chain rule requires us to sum up the stress vectors from all the edges that touch that specific node. Indeed,

f_{i} = div (σ)

where $div$ is the graph divergence operator. Thus, the force on a node is the aggregate of the stress from all edges touching that node.

Remark

But of course, since this is just like machine learning, we can efficiently compute the forces on all nodes by matrix multiplications. Truly, we have

m_{1} ⋱ m_{i} ⋱ m_{∣ V ∣} \ddot{x}_{1} ⋮ \ddot{x}_{i} ⋮ \ddot{x}_{∣ V ∣} = Graph Laplacian - d^{⊤} k_{1} ⋱ k_{e} ⋱ k_{∣ E ∣} d x_{1} ⋮ x_{i} ⋮ x_{∣ V ∣}

where the left side is global Newton’s Second Law $F = ma$ and the right side is the restorative spring forces. Note that $K (dx)$ here is Hooke’s Law (forces on linear springs). The $d^{⊤}$ is actually the adjoint linear map (recall that the transpose is the adjoint of a linear map with respect to the standard inner product) acting as the discrete gradient. The multiplication of $d^{⊤} (K dx)$ sums up all the converging spring tensions onto their shared nodes, calculating the net force on each node.

It’s particularly important to mention the Graph Laplacian shown in the diagram. The final system becomes

M \ddot{x} = - Lx

where $L$ encodes the connectivity of the graph and the stiffness of each edge.

This is a beautiful result.

Explorer

kyle's notes

Calculus of Variations