RL【3】：Bellman Optimality Equation

系列文章目录

文章目录

系列文章目录
前言
Definition of optimal policy
Bellman optimality equation
- Introduction
- Maximization on the right-hand side
- Contraction mapping theorem
- Solution
- Optimality
Analyzing optimal policies
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Definition of optimal policy

The state value could be used to evaluate if a policy is good or not: if

$vπ1(s)≥vπ2(s),∀s∈Sv_{\pi_1}(s) \geq v_{\pi_2}(s), \quad \forall s \in \mathcal{S}$

then $π1\pi_1$ is “better” than $π2\pi_2$ .

A policy $π∗\pi^*$ is optimal if

$vπ∗(s)≥vπ(s),∀s∈S,∀π.v_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \; \forall \pi.$

Bellman optimality equation

Introduction

Bellman optimality equation (elementwise form):

$\max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}$

$\max_{\pi} \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}$

Remarks:

$\mid s,a), p(s' \mid s,a)$ are known.
$v (s), v (s^{'})$ are unknown and to be calculated.
$π(s)\pi(s)$ is known!

结合 Bellman 方程依赖于已知策略，解释为什么在 Bellman 最优性方程 里要取 $max⁡π\max_\pi$ ，以及它和最优策略 $π∗\pi^*$ 的关系

回顾：Bellman 方程（依赖已知策略）

对于一个固定的策略 $π\pi$ ，状态价值函数定义为：

$vπ(s)=∑aπ(a∣s)[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vπ(s’)]v_\pi(s) = \sum_a \pi(a \mid s) \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v_\pi(s’) \Bigg]$

这里 $π(a∣s)\pi(a|s)$ 是 已知的动作分布，所以 Bellman 方程在这种情况下是 策略评估 (policy evaluation) 工具。
策略 = 对每个状态 $s$ ，给所有可能动作 $a$ 分配概率，即一个策略对应了一组“所有状态的动作概率分布”。
不同策略对应的就是 不同的动作概率分布集合。

策略 $π1\pi_1$ ：

在状态 $s$ ，可能给动作 $a_1$ 的概率高一些，给动作 $a_2$ 的概率低一些。

策略 $π2\pi_2$ ：

在相同状态 $s$ ，可能恰好相反，给 $a_1$ 的概率低，给 $a_2$ 的概率高。

从策略依赖到最优性

如果我们不想只评估一个具体策略，而是想找到 最优策略，那就需要考虑：对于每个状态 $s$ ，什么样的动作选择（或策略）能最大化长期回报？

于是，状态价值的定义变成：

$v∗(s)=max⁡πvπ(s),∀s∈Sv^*(s) = \max_\pi v_\pi(s), \quad \forall s \in \mathcal{S}$

这里的 $max⁡π\max_\pi$ 表示：在所有可能的策略中，找到一个能使价值函数最大的策略。

Bellman 最优性方程

把 $max⁡π\max_\pi$ 引入 Bellman 方程，得到：

$v(s)=max⁡a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]v^(s) = \max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]$

关键点：

在普通 Bellman 方程里： $π(a∣s)\pi(a|s)$ 是已知的分布。
在 Bellman 最优性方程里：我们不固定策略，而是 直接在动作层面取最大化，等价于“选择最优动作”。
因此，最优价值函数 $v^*(s)$ 不再依赖于某个具体的 $π\pi$ ，而是内含了 策略优化的过程。

和最优策略 $π∗\pi^*$ 的关系

定义：

$π(s)=arg⁡max⁡a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]\pi^(s) = \arg\max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]$

即最优策略就是在每个状态下选择能使未来回报最大的动作。

换句话说：

普通 Bellman 方程 = 已知策略的价值评估。
Bellman 最优性方程 = 在所有策略中选择最优的价值函数，从而定义了最优策略。

Bellman optimality equation (matrix-vector form):

$\max_\pi (r_\pi + \gamma P_\pi v)$

where the elements corresponding to $s$ or $s^{'}$ are

$[rπ]s≜∑aπ(a∣s)∑rp(r∣s,a)r,[r_\pi]_s \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s,a) r,$

$[Pπ]s,s′=p(s′∣s)≜∑aπ(a∣s)∑s′p(s′∣s,a)[P_\pi]{s,s'} = p(s' \mid s) \triangleq \sum_a \pi(a \mid s) \sum{s'} p(s' \mid s,a)$

Here $max⁡π\max_\pi$ is performed elementwise.

Maximization on the right-hand side

Solve the Bellman optimality equation

必须先考虑右边的式子，即先有某个最优策略 $π\pi$ ，然后才有最优的状态价值 $v (s)$

elementwise form

$\max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}$
- Fix $v^{'} (s)$ first and solve $π\pi$ :
  
  $\max_\pi \sum_a \pi(a \mid s) q(s,a)$
- Inspired by the above example, considering that $∑aπ(a∣s)=1\sum_a \pi(a \mid s) = 1$ , we have
  
  $max⁡π∑aπ(a∣s)q(s,a)=max⁡a∈A(s)q(s,a),\max_\pi \sum_a \pi(a \mid s) q(s,a) = \max_{a \in \mathcal{A}(s)} q(s,a),$
- where the optimality is achieved when
  
  $π(a∣s)={1,a=a∗0,a≠a∗\pi(a \mid s) = \begin{cases} 1, & a = a^* \\ 0, & a \neq a^* \end{cases}$
- where $a^* = \arg\max_a q(s,a)$ .
matrix-vector form

$\max_\pi (r_\pi + \gamma P_\pi v)$
- Let
  
  $\max_\pi (r_\pi + \gamma P_\pi v).$
- Then, the Bellman optimality equation becomes
  
  $v = f (v)$
- where
  
  $\max\pi \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}.$

Contraction mapping theorem

Fixed point:

$\in X$ is a fixed point of $\to X$ if $f (x) = x$

Contraction mapping (or contractive function):

$f$ is a contraction mapping if $∥f(x1)−f(x2)∥≤γ∥x1−x2∥\| f(x_1) - f(x_2) \| \leq \gamma \| x_1 - x_2 \|$ , where $γ∈(0,1)\gamma \in (0,1)$ .

$γ\gamma$ must be strictly less than $1$ so that many limits such as $γk→0\gamma^k \to 0$ as $\to \infty$ hold.
Here $∥⋅∥\| \cdot \|$ can be any vector norm.

Contraction mapping theorem

For any equation that has the form of $x = f (x)$ , if $f$ is a contraction mapping, then

Existence: There exists a fixed point $x^*$ satisfying $f(x^*) = x^*$ .
Uniqueness: The fixed point $x^*$ is unique.
Algorithm: Consider a sequence ${x_k\}$ where $x_{k+1} = f(x_k)$ , then $xk→x∗x_k \to x^*$ as $\to \infty$ . Moreover, the convergence rate is exponentially fast.

Solution

Let’s come back to the Bellman optimality equation:

$\max_\pi (r_\pi + \gamma P_\pi v)$

Contraction Property:

$f (v)$ is a contraction mapping satisfying

$∥f(v1)−f(v2)∥≤γ∥v1−v2∥\| f(v_1) - f(v_2) \| \leq \gamma \| v_1 - v_2 \|$
where $γ\gamma$ is the discount rate!

Existence, Uniqueness, and Algorithm:

For the BOE $\max_\pi (r_\pi + \gamma P_\pi v)$ , there always exists a solution $v^*$ and the solution is unique. The solution could be solved iteratively by

$vk+1=f(vk)=max⁡π(rπ+γPπvk)v_{k+1} = f(v_k) = \max_\pi (r_\pi + \gamma P_\pi v_k)$
This sequence ${v_k\}$ converges to $v^*$ exponentially fast given any initial guess $v_0$ . The convergence rate is determined by $γ\gamma$ .

Optimality

Suppose $v^*$ is the solution to the Bellman optimality equation. It satisfies

$v∗=max⁡π(rπ+γPπv∗)v^* = \max_\pi (r_\pi + \gamma P_\pi v^*)$

Suppose

$π∗=arg⁡max⁡π(rπ+γPπv∗)\pi^* = \arg\max_\pi (r_\pi + \gamma P_\pi v^*)$

Then

$v∗=rπ∗+γPπ∗v∗v^* = r_{\pi^*} + \gamma P_{\pi^*} v^*$

Therefore, $π∗\pi^*$ is a policy and $v∗=vπ∗v^* = v_{\pi^*}$ is the corresponding state value.

Policy Optimality

Suppose that $v^*$ is the unique solution to $\max_\pi (r_\pi + \gamma P_\pi v),$
and $vπv_\pi$ is the state value function satisfying $vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi$ for any given policy $π\pi$ , then

$v≥vπ,∀πv^ \geq v_\pi, \quad \forall \pi$

Greedy Optimal Policy

For any $\in \mathcal{S}$ , the deterministic greedy policy

$π(a∣s)={1,a=a(s)0,a≠a(s)\pi^(a \mid s) = \begin{cases} 1, & a = a^(s) \\ 0, & a \neq a^(s) \end{cases}$
is an optimal policy solving the BOE. Here,

$a^(s) = \arg\max_a q^(s,a),*$
where $q(s,a):=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v∗(s′)q^(s,a) := \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^*(s')$
Proof

$π(s)=arg⁡max⁡π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′))\pi^(s) = \arg\max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^(s') \right)$

对公式的概括

$v^*$ 的意义

$v^*$ 是一个“理想值表”，它告诉你：从每个状态出发，如果以后一直做出最优选择，能拿到的总回报是多少。

它满足一个自洽的关系式（Bellman 最优性方程）：

$v∗(s)=max⁡a[即时奖励+γ×未来价值]v^*(s) = \max_a \Big[\text{即时奖励} + \gamma \times \text{未来价值}\Big]$

$π∗\pi^*$ 的意义

$π∗\pi^*$ 是“最优策略”，它规定了：在每个状态下应该采取哪个动作，才能保证总回报不比任何其他策略差。
从公式上看， $π∗\pi^*$ 就是选择能让 $v^*$ 达到最大的那个动作（也就是“贪心选择”）。

Policy Optimality 定理

对于任意其他策略 $π\pi$ ，它的价值函数 $vπv_\pi$ 都不会超过 $v^*$ 。
$v^*$ 是所有策略里能实现的最高水平，它一定支配所有其他策略的价值表。

Greedy Optimal Policy 定理

只要你已经有了 $v^*$ ，那么直接在每个状态里“选那个让回报最大化的动作”就能得到 $π∗\pi^*$ 。
最优策略其实就是“贪心地”选动作，但前提是这个贪心是基于正确的 $v^*$ 。

更进一步的解释

为什么要先有 $v^*$ 才能得到 $π∗\pi^*$ ？
因为 $π∗\pi^*$ 的定义依赖于“未来回报”，而未来回报就是由 $v^*$ 描述的。
一旦知道了 $v^*$ ，最优策略就能“顺理成章”地通过贪心法则推出来。

为什么 $v^*$ 比任何 $vπv_\pi$ 都大？
$vπv_\pi$ 是“固定策略下”的表现。
$v^*$ 是在每一步都挑选最优动作的表现。
显然，如果你随时都能选最好的动作，你的表现不可能比其他任何固定策略差。

为什么贪心策略一定最优？
因为 Bellman 方程已经保证了：在 $v^*$ 下，每个状态的最优价值都等于“选择最优动作”得到的回报。
所以只要你在每个状态都执行这个“最优动作”，整个过程的价值函数自然等于 $v^*$ 。
也就是说：贪心 + 正确的价值表 = 全局最优策略。

Analyzing optimal policies

What factors determine the optimal policy?

It can be clearly seen from the BOE

$\max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r \;+\; \gamma \sum_{s'} p(s' \mid s,a) v(s') \right)$
that there are three factors:
- Reward design: $r$
- System model: $\mid s,a), \; p(r \mid s,a)$
- Discount rate: $γ\gamma$
In this equation, $\pi(a \mid s)$ are unknowns to be calculated.

Optimal Policy Invariance

Consider a Markov decision process with $v∗∈R∣S∣v^* \in \mathbb{R}^{|\mathcal{S}|}$ as the optimal state value satisfying

$v∗=max⁡π(rπ+γPπv∗).v^* = \max_\pi (r_\pi + \gamma P_\pi v^*).$
If every reward $r$ is changed by an affine transformation to $a r + b$ , where $\in \mathbb{R}$ and $\neq 0$ , then the corresponding optimal state value $v^{'}$ is also an affine transformation of $v^*$ :

$v^* + \frac{b}{1-\gamma} \mathbf{1},$
- where $γ∈(0,1)\gamma \in (0,1)$ is the discount rate and $1=[1,…,1]T\mathbf{1} = [1, \ldots, 1]^T$ .
Consequently, the optimal policies are invariant to the affine transformation of the reward signals.