Definition (Linear Map)

Let $U, V$ be two vector spaces. A map $A : U \to V$ is linear if

A (u_{1} + u_{2}) = A (u_{1}) + A (u_{2}) and A (c u) = c A (u)

Theorem (Set of Linear Maps is a Vector Space)

The set $(U linear V)$ of all linear maps from $U$ to $V$ is a vector space.

Definition (Covector)

A covector on a vector space $V$ is a scalar-valued linear function

α : V linear R

A covector is a linear function that takes in a vector and outputs a scalar. Geometrically, it is best to visualize a covector as a set of equidistant, parallel hyperplanes. In 2D, these are parallel lines. In 3D, they are parallel planes. They are like “slabs”. It is best to visualize them by its level sets.

Definition (Dual Space)

The dual space

V^{*} = {α : V linear R}

of a vector space $V$ is the vector space of all covectors on $V$ . The elements of $V^{*}$ are called dual vectors or covectors. The dual pairing is denoted by

α (u) = ⟨ α ∣ u ⟩ = α [[u]]

It is the act of “feeding” a vector $u$ into a covector $α$ to get a scalar value. Visually, evaluating the dual pairing means counting the number of times the vector’s arrow pierces through the hyperplanes of the covector. If the arrow runs perfectly parallel to the covector’s lines, it pierces $0$ , and the pairing is $0$ . If the arrow stretches across $3$ level-set lines, the pairing evaluates to $3$ .

Theorem (Dimension of Dual Space)

The dimension of the dual space $V^{*}$ is equal to the dimension of the original vector space $V$ . That is,

dim V^{*} = dim V

If $V$ is a finite-dimensional vector space, then $dim V^{**} = dim V$ .

Theorem (Dual Basis)

Let $e_{1}, e_{2}, \dots, e_{n}$ be a basis for a finite-dimensional vector space $V$ . Then there is a unique dual basis $β^{1}, β^{2}, \dots, β^{n} \in V^{*}$ so that

⟨ e_{i} ∣ β^{j} ⟩ = δ_{i}^{j} = {10 if i = j if i \neq = j

where $δ_{i}^{j}$ is the Kronecker delta. If we write

u = i = 1 \sum n u^{i} e_{i}

using a basis, then we would write a covector $α$ as

α = j = 1 \sum n α_{j} β^{j}

The dual pairing simply becomes

⟨ α ∣ u ⟩ = i = 1 \sum n α_{i} u^{i}

Likewise to the visualization of the dual pairing, given some basis vector $e_{1} \in V$ , the corresponding $β^{1}$ dual vector means that the pairing evaluates to $1$ , or that it crosses the covector slab exactly once.

Conventional Notation

Vectors are conventionally columnar. Covectors are row vectors. The dual pairing is automatic

[α_{1} \dots α_{n}] u^{1} ⋮ u^{n} = i = 1 \sum n α_{i} u^{i}

Vectors are labeled with an upper index $u^{i}$ and covectors are labeled with a lower index $α_{j}$ . We can also use the bra-ket notation, where the vector $u$ is written as a “ket” $∣ u ⟩$ and the covector $α$ is written as a “bra” $⟨ α ∣$ . The dual pairing is then written as $⟨ α ∣ u ⟩$ . There is also Penrose graphical notation.

Definition (Adjoint Linear Map)

For each linear map $A : U \to V$ , there is a linear map called its adjoint $A^{⊤} : V^{*} \to U^{*}$ defined by

⟨ A^{⊤} β ∣ u ⟩ = ⟨ β ∣ A u ⟩

for all $u \in U$ and $β \in V^{*}$ . The adjoint map is also denoted as $A^{*}$ .

Definition (Tangent Space)

The tangent space $T_{p} M$ at $p$ is the vector space of perturbations for point $p$ .

T_{p} M = {(p, v)}

This is a family of vector spaces. In type theory, tangent spaces form a dependent type. This is because it depends on the point $p$ .

Definition (Tangent Bundle)

The tangent bundle is the space of all perturbation data

TM = p \in M ⨆ T_{p} M

In type theory, the tangent bundle is a sum dependent type. It helps because we can say that

x \in TM ⟺ \exists p \in M such that T_{p} M ∋ x

Definition (Cotangent Space)

The cotangent space $T_{p}^{*} M$ at $p$ is the dual space of tangent space.

Definition (Cotangent Bundle)

The cotangent bundle is the space of all covectors

T^{*} M = p \in M ⨆ T_{p}^{*} M

Definition (Vector Field, Covector Field)

A vector field is a function of the following type

X : M X (p) \to TM \in T_{p} M

We also call a vector field a section of the tangent bundle.

X \in Γ (TM) = {X : M \to TM ∣ X_{p} \in T_{p} M}

In type theory, this is a product dependent type. This is because the output type $T_{p} M$ depends on the input $p$ .

Similarly, a covector field $α \in Γ (T^{*} M)$ is a function if type

α : M α (p) \to T^{*} M \in T_{p}^{*} M

We also call a covector field a 1-form, and denote $Γ (T^{*} M) = Ω^{1} (M)$ .

Types in Differential Calculus

Let

f \in (M nonlinear R)

The differential of a function $f$ at a point $p$ is a linear map that approximates the change in $f$ near $p$ . It is a covector field

df \in Γ (T^{*} M) = Ω^{1} (M)

where at any specific point $p$ , we have $df ∣_{p} \in (T_{p} M linear R) = T_{p}^{*} M$ .

Definition (Directional Derivative)

The perturbation at a point is a tangent vector

v_{p} \in T_{p} M

The directional derivative of $f$ at $p$ in the direction of $v_{p}$ is the dual pairing:

df ∣ p [[v_{p}]] \in R

Geometrically, let $γ : (- ε, ε) \to M$ be any curve satisfying $γ (0) = p$ and $γ^{'} (0) = v$ . Then we can compute the directional derivative as the rate of change of $f$ along the curve:

d f_{p} [[v]] = \frac{d}{d t} f (γ (t))_{t = 0}

This function $df$ depends linearly in $v$ , but generally nonlinearly in $p$ . Each data $(p, v)$ admits a representing curve $γ$ that passes through $p$ with velocity $v$ at $t = 0$ .

What about with with partial derivatives?

Definition (Coordinate System)

A coordinate system is a family of functions

x^{1}, \dots, x^{n} \in (M nonlinear R)

so that $d x^{1} ∣_{p}, \dots, d x^{n} ∣_{p} \in T_{p}^{*} M$ is a basis at evert $p \in M$ .

The partial derivatives defined are the coefficients for the covector $df$ under this coordinate-induced covector basis:

df = i = 1 \sum n \frac{\partial f}{\partial x ^{i}} d x^{i}

Definition (Riemannian Manifold)

Let $M$ be a Manifold (domain). A Riemannian manifold is a manifold $M$ together with an inner product structure $♭_{p} : T_{p} M \to T_{p}^{*} M$ at every point $p \in M$ .

In general, given a Riemannian manifold, we don’t expect to find a coordinate system so that the coordinate vectors are orthonormal everywhere. The failure to find an everywhere-orthonormal coordinate is the non-Euclidean-ness of the Riemannian manifold.

Definition (Gradient Vector)

Let $(M, ♭)$ be a Riemannian manifold. The gradient of a function $f \in C^{\infty} (M)$ at $p \in M$ is defined by

grad_{p} f := ♯_{p} d_{p} f \in T_{p} M

The gradient does not require coordinates and has little to do with partial derivatives. In index notation, it is

(grad f)^{i} = ♯^{ij} \frac{\partial f}{\partial x ^{j}}

The correct way of writing gradient descent is

\frac{d}{d t} x^{i} = - ♯^{ij} \frac{\partial f}{\partial x ^{j}}

Chain Rules

The differential of a function $f : M \to R$ at a point $p$ is a linear map $df ∣_{p} : T_{p} M \to R$ . This concept generalizes to a map between manifolds $ϕ : M \to N$ .

Definition (Pushforward Operator)

The pushforward operator $ϕ_{*}$ (also called the differential or Jacobian) of a map $ϕ : M \to N$ is a map between tangent bundles:

ϕ_{*} : TM ϕ_{*} (u_{p}) \to TN := d ϕ ∣_{p} u_{p}

Geometrically, it “pushes” a velocity vector $u_{p}$ at $p \in M$ to a velocity vector at $ϕ (p) \in N$ . In coordinates, this is matrix-vector multiplication by the Jacobian.

Definition (Pullback Operator for Functions)

The pullback operator $ϕ^{*}$ for functions takes a function on the target space and “pulls” it back to the source space:

ϕ^{*} : (N \to R) ϕ^{*} (g) linear (M \to R) := g \circ ϕ

If $g$ is a scalar field on $N$ , then $ϕ^{*} g$ is a scalar field on $M$ .

Definition (Pullback Operator for Covectors)

The pullback operator $ϕ^{*}$ for covectors is the adjoint of the pushforward. It pulls covectors (1-forms) on $N$ back to $M$ :

ϕ^{*} : T^{*} N ϕ^{*} (β_{ϕ (p)}) \to T^{*} M = (d ϕ ∣_{p})^{⊤} (β_{ϕ (p)})

The defining property is the dual pairing: $⟨ ϕ^{*} β ∣ u ⟩ = ⟨ β ∣ ϕ_{*} u ⟩$ .

Backpropagation and Machine Learning

In machine learning, we often have a composition of maps (layers) $ϕ_{1}, ϕ_{2}, \dots, ϕ_{k}$ and a final scalar loss function $L : R^{n} \to R$ .

Forward Pass (Pushforward)

The forward pass is the computation of the maps themselves: $x \mapsto y_{1} \mapsto y_{2} \mapsto \dots \mapsto Loss$ . The sensitivity of the output to a perturbation $δ x$ is computed by the pushforward:

δ Loss = L_{*} (ϕ_{k})_{*} \dots (ϕ_{1})_{*} δ x

This is the “forward-mode” differentiation.

Backward Pass (Pullback)

Backpropagation is the pullback of the loss differential. We start with the differential of the loss function $d L$ , which is a covector (a row vector of partial derivatives). We want to find how the loss changes with respect to the input $x$ .

Using the property $(ψ \circ ϕ)^{*} = ϕ^{*} \circ ψ^{*}$ , we pull the covector $d L$ back through the layers:

d x = ϕ_{1}^{*} ϕ_{2}^{*} \dots ϕ_{k}^{*} d L

Notice the order is reversed. In each step, we apply the adjoint of the local Jacobian (the transpose of the weight matrix in linear layers) to the “incoming” gradient covector.

Why use pullbacks? If the input $x$ is high-dimensional (e.g., an image) and the output is a single scalar (the loss), it is computationally much cheaper to pull back a single covector than to push forward a basis of the entire tangent space. This is why “reverse-mode” autodiff (backprop) is the standard for deep learning.

CNN Example

In a CNN, the trainable parameters consist of the weights within the convolutional kernels (filters) and their associated biases. Let the manifold $M$ represent the space of all possible parameter values. Each point $p \in M$ is a specific setting of the parameters at some training step. The tangent space $T_{p} M$ represents all possible perturbations to the parameters at that point. When an optimizer like SGD updates the parameters, it is essentially moving in the tangent space $T_{p} M$ to a new point $p^{'}$ by some update rule.

CNN’s operate on feature maps (multi-dimensional arrays, tensors). Let a single convolutional layer be a map $ϕ$ from an input tensor space to an output tensor space.

ϕ : R^{H \times W \times C_{in}} \to R^{H^{'} \times W^{'} \times C_{out}}

where $H, W$ is the height and width of the input feature map, and $C_{in}, C_{out}$ are the number of input and output channels, respectively. The input tensor $x$ is a vector in this domain. The pushforward (or differential) denoted by $ϕ_{*}$ or $d ϕ$ , represents the Jacobian of the convolutional operation. Geometrically, if we perturb the input tensor $x$ by a small amount $δ x$ , the pushforward tells us how the output tensor changes:

δy = ϕ_{*} δ x

Because convolution is a linear operation, the pushforward is simply the convolution itself.

The network concludes with a scalar loss function

L : R^{H^{'} \times W^{'} \times C_{out}} \to R

The differential of the loss function $d L$ is a covector, belonging to the cotangent space $T_{ϕ (x)}^{*} R^{H^{'} \times W^{'} \times C_{out}}$ . It is a linear functional waiting to be fed a perturbation vector to tell us how the loss changes. To compute the gradient with respect to the input $x$ , we do not invert the matrix. Instead, we use the pullback of the convolutional layer, $ϕ^{*}$ . The pullback is the adjoint of the pushforward

ϕ^{*} β = (d ϕ)^{⊤} β

In the specific context of a CNN, the adjoint of a convolution operation is mathematically equivalent to a transposed convolution (also known as a deconvolution). This is why the backpropagation step in a CNN involves a transposed convolution. During backpropagation, the loss covector $β$ is pulled back through the layers by applying a transposed convolution using the same kernel weights. This mathematically mirrors the equation

d x = ϕ^{*} d L

Eventually, this yields the gradient of the loss with respect to the input $x$ , the covector

d L_{p} \in T_{p}^{*} M

Since we cannot directly add a covector to a vector (manifold), we need a tangent vector. This requires the Riemannian metric $♯_{p} : T_{p} M \to T_{p}^{*} M$ to convert the covector $d L_{p}$ into a tangent vector

grad_{p} L = ♯_{p} d L_{p} \in T_{p} M

This vector $grad_{p} L$ represents the direction of steepest ascent in the tangent space.

The following is unrelated to the class, but this section was further elaborated via Gemini 3.1 pro.

To get back to the original parameter space, we need to apply the “exponential map” $exp_{p} : T_{p} M \to M$ to move from the tangent space back to the manifold. The idea is to “start at $p$ , face in the direction of the gradient, and walk the straightest possible path (geodesic)“. To update our weights, we scale our gradient vector by a learning rate $η$ and then apply the exponential map:

p_{t + 1} = exp_{p_{t}} (- η grad_{p_{t}} L)

and so on. In practice, this notation is never used because we always assume the parameter manifold $M$ is a perfectly flat Euclidean space $R^{n}$ . It’s geometry is trivial, so the “straightest possible path” is just a straight line.

Theorem (Derivative and Pullback Commute)

Let $ϕ : M \to N$ be a map between manifolds, and let $f : N \to R$ be a function on the target manifold. Then the following holds:

ϕ^{*} (df) = d (ϕ^{*} f)

Theorem (Pushforward Distributes over Composition)

Let $ϕ : M \to N$ and $ψ : N \to P$ be maps between manifolds. Then the pushforward distributes over composition:

(ϕ_{1} \circ \dots \circ ϕ_{k})_{*} = (ϕ_{1})_{*} \dots (ϕ_{k})_{*}

This is the same as the chain rule for derivatives.

Theorem (Pullback Distributes over Composition and Reverses Order)

Let $ϕ : M \to N$ and $ψ : N \to P$ be maps between manifolds. Then the pullback distributes over composition and reverses order:

(ϕ_{1} \circ \dots \circ ϕ_{k})^{*} = (ϕ_{k})^{*} \dots (ϕ_{1})^{*}

Because the pullback is the adjoint (transpose) of the pushforward, distributing it across a composition reverses the order of operations. This is why backpropagation goes in the reverse order of the forward pass.

Explorer

kyle's notes

Dual Space