🎯 REINFORCE 策略梯度算法推导(完整)
1. 目标函数定义
我们希望最大化策略的期望回报:
J ( θ ) = E τ ∼ π θ [ R ( τ ) ] J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] J(θ)=Eτ∼πθ[R(τ)]
其中:
- τ = ( s 0 , a 0 , s 1 , a 1 , . . . , s T , a T ) \tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T) τ=(s0,a0,s1,a1,...,sT,aT):轨迹
- R ( τ ) = ∑ t = 0 T r t R(\tau) = \sum_{t=0}^T r_t R(τ)=∑t=0Trt:轨迹总回报
- π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st):策略函数,如果是连续动作空间则是(概率密度函数值),离散动作空间则是是一个概率值(如 softmax 输出)。
2. 轨迹的概率
轨迹的概率分布为:
P ( τ ) = ρ ( s 0 ) ⋅ ∏ t = 0 T π θ ( a t ∣ s t ) ⋅ P ( s t + 1 ∣ s t , a t ) P(\tau) = \rho(s_0) \cdot \prod_{t=0}^T \pi_\theta(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t) P(τ)=ρ(s0)⋅t=0∏Tπθ(at∣st)⋅P(st+1∣st,at)
其中:
- ρ ( s 0 ) \rho(s_0) ρ(s0):初始状态分布
- P ( s t + 1 ∣ s t , a t ) P(s_{t+1} | s_t, a_t) P(st+1∣st,at):状态转移概率(与 θ \theta θ 无关), 就是选什么动作需要概率来描述,选了这个动作跳到什么状态,也是不确定的,也需要概率来描述。
3. 对目标函数求导
我们希望通过梯度上升更新策略参数 θ \theta θ:
∇ θ J ( θ ) = ∇ θ E τ ∼ π θ [ R ( τ ) ] \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] ∇θJ(θ)=∇θEτ∼πθ[R(τ)]
问题:如何求这个梯度?由于 π θ \pi_\theta πθ 依赖于 θ \theta θ,期望不能直接求导。
似然比技巧(likelihood ratio trick),推导如下:
∇ θ E x ∼ p θ ( x ) [ f ( x ) ] = ∇ θ ∫ f ( x ) p θ ( x ) d x = ∫ f ( x ) ∇ θ p θ ( x ) d x \nabla_\theta \mathbb{E}_{x \sim p_\theta(x)}[f(x)] = \nabla_\theta \int f(x) p_\theta(x) dx = \int f(x) \nabla_\theta p_\theta(x) dx ∇θEx∼pθ(x)[f(x)]=∇θ∫f(x)pθ(x)dx=∫f(x)∇θpθ(x)dx
这里之所以不对 f ( x ) f(x) f(x)求导,是因为在强化学习中这里的 f ( x ) f(x) f(x)是reward,是一个标量,与环境交互得到的。
利用链式法则:
∇ θ p θ ( x ) = p θ ( x ) ∇ θ log p θ ( x ) \nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x) ∇θpθ(x)=pθ(x)∇θlogpθ(x)
代入得:
= ∫ f ( x ) p θ ( x ) ∇ θ log p θ ( x ) d x = E x ∼ p θ ( x ) [ f ( x ) ∇ θ log p θ ( x ) ] = \int f(x) p_\theta(x) \nabla_\theta \log p_\theta(x) dx = \mathbb{E}_{x \sim p_\theta(x)}[f(x) \nabla_\theta \log p_\theta(x)] =∫f(x)pθ(x)∇θlogpθ(x)dx=Ex∼pθ(x)[f(x)∇θlogpθ(x)]
4. 推导 log 概率项
注意:
log P ( τ ) = log ρ ( s 0 ) + ∑ t = 0 T [ log π θ ( a t ∣ s t ) + log P ( s t + 1 ∣ s t , a t ) ] \log P(\tau) = \log \rho(s_0) + \sum_{t=0}^{T} \left[ \log \pi_\theta(a_t | s_t) + \log P(s_{t+1} | s_t, a_t) \right] logP(τ)=logρ(s0)+t=0∑T[logπθ(at∣st)+logP(st+1∣st,at)]
由于 ρ ( s 0 ) \rho(s_0) ρ(s0)和 P ( s t + 1 ∣ s t , a t ) P(s_{t+1} | s_t, a_t) P(st+1∣st,at)与 θ \theta θ 无关:
∇ θ log P ( τ ) = ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \log P(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) ∇θlogP(τ)=t=0∑T∇θlogπθ(at∣st)
5. 得到策略梯度表达式
代入得到最终梯度表达式:
∇ θ J ( θ ) = E τ ∼ π θ [ ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ R ( τ ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right] ∇θJ(θ)=Eτ∼πθ[t=0∑T∇θlogπθ(at∣st)⋅R(τ)]
6. 替换为每步折扣回报 ( G_t )
为了更准确地归因每步动作的影响,引入:
G t = ∑ k = t T γ k − t r k G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k Gt=k=t∑Tγk−trk
改写为:
∇ θ J ( θ ) = E τ [ ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ∇θJ(θ)=Eτ[t=0∑T∇θlogπθ(at∣st)⋅Gt]
7. 引入 baseline 减少方差
减去一个与动作无关的 baseline b ( s t ) b(s_t) b(st):
∇ θ J ( θ ) = E τ [ ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ ( G t − b ( s t ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ∇θJ(θ)=Eτ[t=0∑T∇θlogπθ(at∣st)⋅(Gt−b(st))]
常用 baseline:
b ( s t ) = V π ( s t ) ⇒ A t = G t − V ( s t ) b(s_t) = V^\pi(s_t) \quad \Rightarrow \quad A_t = G_t - V(s_t) b(st)=Vπ(st)⇒At=Gt−V(st)
最终得到优势形式:
∇ θ J ( θ ) = E [ ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)=E[t=0∑T∇θlogπθ(at∣st)⋅At]
✅ 常见策略梯度形式总结
名称 | 表达式 |
---|---|
REINFORCE | ∇ θ J ( θ ) = E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ∇θJ(θ)=E[∑t∇θlogπθ(at∣st)⋅Gt] |
baseline形式 | ∇ θ J ( θ ) = E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ ( G t − b ( s t ) ) ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ∇θJ(θ)=E[∑t∇θlogπθ(at∣st)⋅(Gt−b(st))] |
Advantage形式 | ∇ θ J ( θ ) = E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)=E[∑t∇θlogπθ(at∣st)⋅At] |
📌 附:连续动作高斯策略的梯度
假设策略为:
π θ ( a ∣ s ) = N ( μ θ ( s ) , σ 2 ) \pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma^2) πθ(a∣s)=N(μθ(s),σ2)
则:
log π θ ( a ∣ s ) = − ( a − μ θ ( s ) ) 2 2 σ 2 + const \log \pi_\theta(a|s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma^2} + \text{const} logπθ(a∣s)=−2σ2(a−μθ(s))2+const
对策略参数的梯度为:
∇ θ log π θ ( a ∣ s ) = ( a − μ θ ( s ) ) σ 2 ⋅ ∇ θ μ θ ( s ) \nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma^2} \cdot \nabla_\theta \mu_\theta(s) ∇θlogπθ(a∣s)=σ2(a−μθ(s))⋅∇θμθ(s)