Example 1 (Stationary Distributions)

Consider a $2 -$ state Markov Chain with $S = {0, 1}$ .

P = (3/4 1/6 1/4 5/6)

and $P (X_{1} = 0 ∣ X_{0} = 0) = 3/4$ etc. Then

n \to \infty lim P^{n} = (0.4 0.4 0.6 0.6) = [π π]

where $π = (0.4, 0.6)$ is the stationary distribution.

Proof:

Let $π_{0} = (a_{0}, a_{1}) = (P (X_{0} = 0), P (X_{0} = 1))$ . Note that $a_{0} + a_{1} = 1$ . Then

π_{n} = [P (X_{n} = 0), P (X_{n} = 1)] = π_{0} P^{n}

which implies that

n \to \infty lim π_{n} = n \to \infty lim π_{0} P^{n}

from Proposition (Stochastic Properties). We get

n \to \infty lim π_{0} P^{n} = (a_{0}, a_{1}) (0.4 0.4 0.6 0.6) = (0.4, 0.6) = π

Example 2 (Starting Distributions Don’t Matter)

Suppose we have $3 -$ states and $S = {0, 1, 2}$ and

P = 3/4 1/8 0 1/4 2/3 5/6 0 5/24 1/6

With $P$ being defined as usual. Then

n \to \infty lim P^{n} = π π π

where $π = (2/11, 4/11, 5/11)$ .

Set $π_{0} = (P (X_{0} = i))$ . Then $π_{n} = π_{0} p^{n}$ such that

n \to \infty lim π_{n} = π_{0} \cdot π π π = (π (1) \cdot i \sum π_{0} (i), \dots) = π

Because the sum of the starting distribution must equal $1$ , it does not matter what distribution we start from. We will always return to stationary vector $π$ .

Remark

For any state $j$ , the probability of a system being that state at time $n$ approaches a fixed value $π (j)$ as $n$ goes to infinity. $n \to \infty lim P (X_{n} = j) = π (j)$ Essentially, after enough time, the specific starting state is irrelevant.
Once the distribution reaches the long-term state $π$ , it stays there. The distribution is invariant under the transition matrix $P$ . $π = n \to \infty lim π_{n + 1} = n \to \infty lim π_{0} P^{n + 1} = n \to \infty lim (π_{0} P^{n}) \cdot P = π P$ So we get that $π$ is a row eigenvector of $P$ .

Theorem (Average Transition Behavior)

For a Markov Chain,

n \to \infty lim \frac{1}{n} k = 0 \sum n - 1 p_{k} (i, j) = n \to \infty lim p_{n} (i, j) = π (j)

We are essentially asking:

If we simulated all possible Markov Chains from state $i$ , what fraction of them would be in state $j$ on average?

The idea is that even if we bounce around $j$ , the average of those probabilities settles to $π (j)$ .

Proof:

\frac{1}{n} k = 0 \sum n - 1 p_{k} (i, j) = \frac{1}{n} k = 0 \sum n - 1 P (X_{k} = j ∣ X_{0} = i) = \frac{1}{n} k = 0 \sum n - 1 E [1 (X_{k} = j) ∣ X_{0} = i] = E [\frac{1}{n} k = 0 \sum n - 1 1 (X_{k} = j) ∣ X_{0} = i] \to π (j)

as $n \to \infty$ . So, the mean fraction of the time spent in state $j$ is shown by $π (j)$ .

Theorem (Time Average of the Indicator)

What if we ran every Markov Chain, but every time we reached state $i$ at some time $k$ , we counted it. What fraction of the time do we spend on state $i$ ? More formally, what is

\frac{1}{n} k = 1 \sum n 1 {X_{k} = i}

where

1 (X_{k}) = {10 X_{k} = i otherwise

If $n \to \infty$ then this value converges to $π (i)$ for all $i \in S$ .

Theorem (Law of Large Numbers for Markov Chains)

Instead of counting visits, we can count the “reward” of visiting specific states.

n \to \infty lim \frac{1}{n} k = 1 \sum n g (X_{k}) = E_{π} (g (X))

Remarks (Markov Chain Computation)

We can view Markov Chains as evolutions of probability vectors (PMFs): $π_{0}, π_{1}, \dots, π_{n + 1} = π_{n} P$ . But this is hard to compute for large state spaces (when $∣ S ∣$ is large).
Markov Chains are a random sequence of states $X_{0}, X_{1}, \dots \in S$ , which is often easier to simulate.
Recall from Discrete Case that we can generate discrete random variables $X$ where $P (X = i) = p_{i}$ for $i \in {1, 2, 3}$ and $p_{1} + p_{2} + p_{3} = 1$ .

Algorithm (Generate/Simulate Markov Chains)

Input: Transition matrix $P$ and an initial distribution $π_{0}$ .

Output: A Markov Chain $X_{0}, \dots, X_{n} \in S$ where $X_{k} \sim π_{k}$ .

Generate $X_{0} \sim π_{0}$ .
For $k = 1$ to $n$ :
1. Generate $X_{k + 1}$ from $p (X_{k}, -)$ , the PMF over $S$ given by the $X_{k}$ ‘th row of $P$ .
Repeat this $N$ times to get a sample from $π_{n}$ .

The key idea is to generate the distribution $π$ by simulating Markov Chain ${X_{n}}_{n = 0}^{\infty}$ with a stationary distribution $π$ . Often $π$ is known up to a normalizing constant. By Theorem (Average Transition Behavior), Theorem (Time Average of the Indicator), and Theorem (Law of Large Numbers for Markov Chains) this algorithm converges to $π$ .

Remark (Discreteness)

This algorithm also works for discrete time and states. Let $X \in S$ be discrete random variables, e.g. $S = {1, 2, \dots, N}$ . Then the target PMF of $X$ is

π = (π (1), \dots, π (N))

where

\forall i \in S, π (i) = P (X = i)

Stationary distribution $π$ has a special PMF form:

(i) = \frac{1}{Z} b (i) > 0

for all $i \in S$ , where $b (i) > 0$ are known and $Z > 0$ is an unknown normalization. In particular,

i \in S \sum (i) = 1 ⟺ i \in S \sum b (i) = Z

You can think of $b (i)$ as the “score” or weight of each state. If $b (i) > b (j)$ , then state $i$ is more likely than state $j$ . $Z$ is how we normalize the PMF $(i)$ .

In practice, $Z$ is hard to compute, as it is the sum of $N$ terms. $Z$ is called the partition function in statistical physics.

Explorer

kyle's notes

Long-Run Markov Chains