强化学习(第二版)

您所在的位置:网站首页 智能计算系统第三章课后题 强化学习(第二版)

强化学习(第二版)

2024-07-14 09:53:29| 来源: 网络整理| 查看: 265

期望回报

G t = R t + 1 + R t + 2 + R t + 3 + ⋯ + R T G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 , 0 ≤ γ ≤ 1 G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯   ) = R t + 1 + γ G t + 1 \begin{aligned} G_t&=R_{t+1}+R_{t+2}+R_{t+3}+\cdots+R_T \\ G_t&=R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots=\sum_{k=0}^\infin \gamma^k R_{t+k+1}, 0≤\gamma≤1 \\ G_t &=R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots \\ &=R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+ \cdots)\\ &=R_{t+1}+\gamma G_{t+1} \end{aligned} Gt​Gt​Gt​​=Rt+1​+Rt+2​+Rt+3​+⋯+RT​=Rt+1​+γRt+2​+γ2Rt+3​+⋯=k=0∑∞​γkRt+k+1​,0≤γ≤1=Rt+1​+γRt+2​+γ2Rt+3​+⋯=Rt+1​+γ(Rt+2​+γRt+3​+⋯)=Rt+1​+γGt+1​​

价值函数

v π ( s ) = E [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , ∀ s ∈ S v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] \begin{aligned} v_\pi (s) &= \mathbb{E} [G_t | S_t = s] = \mathbb{E}_\pi [\sum_{k=0} ^\infin \gamma^k R_{t+k+1} | S_t=s], \forall s \in S \\ v_\pi (s) &= \sum_a \pi(a|s) q_\pi (s,a) \\ q_\pi (s,a)&=\mathbb{E} [G_t|S_t=s,A_t=a]=\mathbb{E}_\pi [\sum_{k=0}^\infin \gamma^k R_{t+k+1} | S_t=s,A_t=a] \\ q_\pi (s,a)&= \sum_{s^{'} ,r} p(s^{'},r|s,a)[r+\gamma v_\pi (s^{'})] \\ \end{aligned} vπ​(s)vπ​(s)qπ​(s,a)qπ​(s,a)​=E[Gt​∣St​=s]=Eπ​[k=0∑∞​γkRt+k+1​∣St​=s],∀s∈S=a∑​π(a∣s)qπ​(s,a)=E[Gt​∣St​=s,At​=a]=Eπ​[k=0∑∞​γkRt+k+1​∣St​=s,At​=a]=s′,r∑​p(s′,r∣s,a)[r+γvπ​(s′)]​

递归关系

G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 ⋯ = R t + 1 + γ G t + 1 V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] (此式为 v π 的贝尔曼方程) q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ ∑ a ′ π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) ] (此为 q π 的贝尔曼方程) \begin{aligned} G_t&=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}\cdots=R_{t+1}+\gamma G_{t+1}\\ V_\pi (s)&=\mathbb{E}_\pi[G_t|S_t=s]=\mathbb{E}_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s]\\ &=\sum_a \pi(a|s) \sum_{s^{'}} \sum_r p(s^{'},r|s,a)[r+\gamma \mathbb{E}_\pi [G_{t+1}|S_{t+1}=s^{'}]\\ &=\sum_a \pi(a|s) \sum_{s^{'},r} p(s^{'},r|s,a)[r+\gamma v_\pi(s^{'})] \text{(此式为$v_\pi$的贝尔曼方程)}\\ q_\pi (s,a)&=\mathbb{E}_\pi [G_t|S_t=s,A_t=a]=\mathbb{E}_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]\\ &=\sum_{s^{'},r} p(s^{'},r|s,a)[r+\gamma \sum_{a^{'}} \pi(a^{'}|s^{'})q_\pi(s^{'},a^{'})] \text{(此为$q_\pi$的贝尔曼方程)} \end{aligned} Gt​Vπ​(s)qπ​(s,a)​=Rt+1​+γRt+2​+γ2Rt+3​⋯=Rt+1​+γGt+1​=Eπ​[Gt​∣St​=s]=Eπ​[Rt+1​+γGt+1​∣St​=s]=a∑​π(a∣s)s′∑​r∑​p(s′,r∣s,a)[r+γEπ​[Gt+1​∣St+1​=s′]=a∑​π(a∣s)s′,r∑​p(s′,r∣s,a)[r+γvπ​(s′)](此式为vπ​的贝尔曼方程)=Eπ​[Gt​∣St​=s,At​=a]=Eπ​[Rt+1​+γGt+1​∣St​=s,At​=a]=s′,r∑​p(s′,r∣s,a)[r+γa′∑​π(a′∣s′)qπ​(s′,a′)](此为qπ​的贝尔曼方程)​

最优策略

v ∗ ( s ) = m a x π v π ( s ) v ∗ ( s ) = m a x a ∈ A ( s ) q π ∗ ( s , a ) = m a x a E π ∗ [ G t ∣ S t = s , A t = a ] = m a x a E π ∗ [ R t + 1 + γ G t + 1 ∣ S t = s , A t = a ] = m a x a E [ R t + 1 + γ v ∗ ( S t + 1 ) ∣ S t = s , A t = a ] = m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] ( v ∗ 的贝尔曼最优方程) q ∗ ( s , a ) = m a x π q π ( s , a ) q ∗ ( s , a ) = E [ R t + 1 + γ m a x a ′ q ∗ ( S t + 1 , a ′ ) ∣ S t = s , A t = a ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ m a x a ′ q ∗ ( s ′ , a ′ ) ] ( q ∗ 的贝尔曼最优方程) \begin{aligned} v_*(s)&=max_\pi v_\pi(s)\\ v_*(s)&=max_{a \in A(s)}q_{\pi_*}(s,a)=max_a \mathbb{E}_{\pi_*}[G_t|S_t=s,A_t=a]\\ &=max_a \mathbb{E}_{\pi_*}[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]\\ &=max_a \mathbb{E}[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]\\ &=max_a \sum_{s^{'},r}p(s^{'},r|s,a)[r+\gamma v_*(s^{'})] \text{($v_*$的贝尔曼最优方程)}\\ q_*(s,a)&=max_\pi q_\pi(s,a)\\ q_*(s,a)&=\mathbb{E}[R_{t+1}+\gamma max_{a^{'}} q_*(S_{t+1},a^{'})|S_t=s,A_t=a]\\ &=\sum_{s^{'},r} p(s^{'},r|s,a)[r+\gamma max_{a^{'}} q_*(s^{'},a^{'})] \text{($q_*$的贝尔曼最优方程)} \end{aligned} v∗​(s)v∗​(s)q∗​(s,a)q∗​(s,a)​=maxπ​vπ​(s)=maxa∈A(s)​qπ∗​​(s,a)=maxa​Eπ∗​​[Gt​∣St​=s,At​=a]=maxa​Eπ∗​​[Rt+1​+γGt+1​∣St​=s,At​=a]=maxa​E[Rt+1​+γv∗​(St+1​)∣St​=s,At​=a]=maxa​s′,r∑​p(s′,r∣s,a)[r+γv∗​(s′)](v∗​的贝尔曼最优方程)=maxπ​qπ​(s,a)=E[Rt+1​+γmaxa′​q∗​(St+1​,a′)∣St​=s,At​=a]=s′,r∑​p(s′,r∣s,a)[r+γmaxa′​q∗​(s′,a′)](q∗​的贝尔曼最优方程)​

答案解析

答案解析请参考强化学习(第二版)第三章答案



【本文地址】

公司简介

联系我们

今日新闻


点击排行

实验室常用的仪器、试剂和
说到实验室常用到的东西,主要就分为仪器、试剂和耗
不用再找了,全球10大实验
01、赛默飞世尔科技(热电)Thermo Fisher Scientif
三代水柜的量产巅峰T-72坦
作者:寞寒最近,西边闹腾挺大,本来小寞以为忙完这
通风柜跟实验室通风系统有
说到通风柜跟实验室通风,不少人都纠结二者到底是不
集消毒杀菌、烘干收纳为一
厨房是家里细菌较多的地方,潮湿的环境、没有完全密
实验室设备之全钢实验台如
全钢实验台是实验室家具中较为重要的家具之一,很多

推荐新闻


图片新闻

实验室药品柜的特性有哪些
实验室药品柜是实验室家具的重要组成部分之一,主要
小学科学实验中有哪些教学
计算机 计算器 一般 打孔器 打气筒 仪器车 显微镜
实验室各种仪器原理动图讲
1.紫外分光光谱UV分析原理:吸收紫外光能量,引起分
高中化学常见仪器及实验装
1、可加热仪器:2、计量仪器:(1)仪器A的名称:量
微生物操作主要设备和器具
今天盘点一下微生物操作主要设备和器具,别嫌我啰嗦
浅谈通风柜使用基本常识
 众所周知,通风柜功能中最主要的就是排气功能。在

专题文章

    CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭