Definition (Linear Map)

Let be two vector spaces. A map is linear if

Theorem (Set of Linear Maps is a Vector Space)

The set of all linear maps from to is a vector space.

Definition (Covector)

A covector on a vector space is a scalar-valued linear function

A covector is a linear function that takes in a vector and outputs a scalar. Geometrically, it is best to visualize a covector as a set of equidistant, parallel hyperplanes. In 2D, these are parallel lines. In 3D, they are parallel planes. They are like “slabs”. It is best to visualize them by its level sets.

Definition (Dual Space)

The dual space

of a vector space is the vector space of all covectors on . The elements of are called dual vectors or covectors. The dual pairing is denoted by

It is the act of “feeding” a vector into a covector to get a scalar value. Visually, evaluating the dual pairing means counting the number of times the vector’s arrow pierces through the hyperplanes of the covector. If the arrow runs perfectly parallel to the covector’s lines, it pierces , and the pairing is . If the arrow stretches across level-set lines, the pairing evaluates to .

Theorem (Dimension of Dual Space)

The dimension of the dual space is equal to the dimension of the original vector space . That is,

If is a finite-dimensional vector space, then .

Theorem (Dual Basis)

Let be a basis for a finite-dimensional vector space . Then there is a unique dual basis so that

where is the Kronecker delta. If we write

using a basis, then we would write a covector as

The dual pairing simply becomes

Likewise to the visualization of the dual pairing, given some basis vector , the corresponding dual vector means that the pairing evaluates to , or that it crosses the covector slab exactly once.

Conventional Notation

Vectors are conventionally columnar. Covectors are row vectors. The dual pairing is automatic

Vectors are labeled with an upper index and covectors are labeled with a lower index . We can also use the bra-ket notation, where the vector is written as a “ket” and the covector is written as a “bra” . The dual pairing is then written as . There is also Penrose graphical notation.

Definition (Adjoint Linear Map)

For each linear map , there is a linear map called its adjoint defined by

for all and . The adjoint map is also denoted as .

Definition (Tangent Space)

The tangent space at is the vector space of perturbations for point .

This is a family of vector spaces. In type theory, tangent spaces form a dependent type. This is because it depends on the point .

Definition (Tangent Bundle)

The tangent bundle is the space of all perturbation data

In type theory, the tangent bundle is a sum dependent type. It helps because we can say that

Definition (Cotangent Space)

The cotangent space at is the dual space of tangent space.

Definition (Cotangent Bundle)

The cotangent bundle is the space of all covectors

Definition (Vector Field, Covector Field)

A vector field is a function of the following type

We also call a vector field a section of the tangent bundle.

In type theory, this is a product dependent type. This is because the output type depends on the input .

Similarly, a covector field is a function if type

We also call a covector field a 1-form, and denote .

Types in Differential Calculus

Let

The differential of a function at a point is a linear map that approximates the change in near . It is a covector field

where at any specific point , we have .

Definition (Directional Derivative)

The perturbation at a point is a tangent vector

The directional derivative of at in the direction of is the dual pairing:

Geometrically, let be any curve satisfying and . Then we can compute the directional derivative as the rate of change of along the curve:

This function depends linearly in , but generally nonlinearly in . Each data admits a representing curve that passes through with velocity at .

What about with with partial derivatives?

Definition (Coordinate System)

A coordinate system is a family of functions

so that is a basis at evert .

The partial derivatives defined are the coefficients for the covector under this coordinate-induced covector basis:

Definition (Riemannian Manifold)

Let be a Manifold (domain). A Riemannian manifold is a manifold together with an inner product structure at every point .

In general, given a Riemannian manifold, we don’t expect to find a coordinate system so that the coordinate vectors are orthonormal everywhere. The failure to find an everywhere-orthonormal coordinate is the non-Euclidean-ness of the Riemannian manifold.

Definition (Gradient Vector)

Let be a Riemannian manifold. The gradient of a function at is defined by

The gradient does not require coordinates and has little to do with partial derivatives. In index notation, it is

The correct way of writing gradient descent is

Chain Rules

The differential of a function at a point is a linear map . This concept generalizes to a map between manifolds .

Definition (Pushforward Operator)

The pushforward operator (also called the differential or Jacobian) of a map is a map between tangent bundles:

Geometrically, it “pushes” a velocity vector at to a velocity vector at . In coordinates, this is matrix-vector multiplication by the Jacobian.

Definition (Pullback Operator for Functions)

The pullback operator for functions takes a function on the target space and “pulls” it back to the source space:

If is a scalar field on , then is a scalar field on .

Definition (Pullback Operator for Covectors)

The pullback operator for covectors is the adjoint of the pushforward. It pulls covectors (1-forms) on back to :

The defining property is the dual pairing: .

Backpropagation and Machine Learning

In machine learning, we often have a composition of maps (layers) and a final scalar loss function .

Forward Pass (Pushforward)

The forward pass is the computation of the maps themselves: . The sensitivity of the output to a perturbation is computed by the pushforward:

This is the “forward-mode” differentiation.

Backward Pass (Pullback)

Backpropagation is the pullback of the loss differential. We start with the differential of the loss function , which is a covector (a row vector of partial derivatives). We want to find how the loss changes with respect to the input .

Using the property , we pull the covector back through the layers:

Notice the order is reversed. In each step, we apply the adjoint of the local Jacobian (the transpose of the weight matrix in linear layers) to the “incoming” gradient covector.

Why use pullbacks? If the input is high-dimensional (e.g., an image) and the output is a single scalar (the loss), it is computationally much cheaper to pull back a single covector than to push forward a basis of the entire tangent space. This is why “reverse-mode” autodiff (backprop) is the standard for deep learning.

CNN Example

In a CNN, the trainable parameters consist of the weights within the convolutional kernels (filters) and their associated biases. Let the manifold represent the space of all possible parameter values. Each point is a specific setting of the parameters at some training step. The tangent space represents all possible perturbations to the parameters at that point. When an optimizer like SGD updates the parameters, it is essentially moving in the tangent space to a new point by some update rule.

CNN’s operate on feature maps (multi-dimensional arrays, tensors). Let a single convolutional layer be a map from an input tensor space to an output tensor space.

where is the height and width of the input feature map, and are the number of input and output channels, respectively. The input tensor is a vector in this domain. The pushforward (or differential) denoted by or , represents the Jacobian of the convolutional operation. Geometrically, if we perturb the input tensor by a small amount , the pushforward tells us how the output tensor changes:

Because convolution is a linear operation, the pushforward is simply the convolution itself.

The network concludes with a scalar loss function

The differential of the loss function is a covector, belonging to the cotangent space . It is a linear functional waiting to be fed a perturbation vector to tell us how the loss changes. To compute the gradient with respect to the input , we do not invert the matrix. Instead, we use the pullback of the convolutional layer, . The pullback is the adjoint of the pushforward

In the specific context of a CNN, the adjoint of a convolution operation is mathematically equivalent to a transposed convolution (also known as a deconvolution). This is why the backpropagation step in a CNN involves a transposed convolution. During backpropagation, the loss covector is pulled back through the layers by applying a transposed convolution using the same kernel weights. This mathematically mirrors the equation

Eventually, this yields the gradient of the loss with respect to the input , the covector

Since we cannot directly add a covector to a vector (manifold), we need a tangent vector. This requires the Riemannian metric to convert the covector into a tangent vector

This vector represents the direction of steepest ascent in the tangent space.

The following is unrelated to the class, but this section was further elaborated via Gemini 3.1 pro.

To get back to the original parameter space, we need to apply the “exponential map” to move from the tangent space back to the manifold. The idea is to “start at , face in the direction of the gradient, and walk the straightest possible path (geodesic)“. To update our weights, we scale our gradient vector by a learning rate and then apply the exponential map:

and so on. In practice, this notation is never used because we always assume the parameter manifold is a perfectly flat Euclidean space . It’s geometry is trivial, so the “straightest possible path” is just a straight line.

Theorem (Derivative and Pullback Commute)

Let be a map between manifolds, and let be a function on the target manifold. Then the following holds:

Theorem (Pushforward Distributes over Composition)

Let and be maps between manifolds. Then the pushforward distributes over composition:

This is the same as the chain rule for derivatives.

Theorem (Pullback Distributes over Composition and Reverses Order)

Let and be maps between manifolds. Then the pullback distributes over composition and reverses order:

Because the pullback is the adjoint (transpose) of the pushforward, distributing it across a composition reverses the order of operations. This is why backpropagation goes in the reverse order of the forward pass.