PPO๋ฅผ ๊ณต๋ถํ๋ ค๊ณ ํ๋๋ฐ ์ด ๋ ผ๋ฌธ์ด ์ ํ๋์ด์ผํ๋ค๋ ์ด์ผ๊ธฐ๋ฅผ ๋ค์ด์ ๊ฐ๋ณ๊ฒ ๋ ผ๋ฌธ์ ์ฝ์ด๋ดค๋ค. ์์ง ๊ฐํํ์ต ๋ ผ๋ฌธ ์ฝ๋ ๊ฑด ์ต์ํ์ง ์์์ ์๊ฐ์ด ๊ฝค ๊ฑธ๋ ธ๋ค. ์ํ์ ๊ฐ๋ ์ด ์ ์ด์ ์ต๋ํ ๊ผผ๊ผผํ ์ดํดํ ์ ์๊ฒ ์ ๋ฆฌํด๋ดค๋๋ฐ, ๋ค๋ฅธ ์ฌ๋๋ค์๊ฒ๋ ๋์์ด ๋์์ผ๋ฉด ํด์ ํฌ์คํ ํ๋ค.
[https://arxiv.org/abs/1502.05477]
TRPO(Trust Region Policy Optimization)
url: https://arxiv.org/abs/1502.05477
title: "Trust Region Policy Optimization"
description: "We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters."
host: arxiv.org
favicon: https://arxiv.org/static/browse/0.3.4/images/icons/favicon-32x32.png
image: https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png
- Trust Region(๋ฏฟ์๋งํ ๊ตฌ์ญ)์์๋ง policy๋ฅผ ์ ๋ฐ์ดํธ ํ๋ค๋ ์ปจ์
- policy based algorithm์ ๋ณด์ ์์ธก์น๋ฅผ ์ต๋ํํ๋ ๊ณผ์ ์์ policy ๋ฅผ ์ง์ ์ฐพ์๊ฐ.
Preliminaries
์๋ณธ ์ ์ฑ ์ ์์ธก๊ฐ
$$\eta(\pi) = \mathbb{E}{s{0}, a_{0}, \ldots}\left[\sum^{\infty}{t=0}\gamma^{t}r(s{t})\right], where \ s_{0} \sim p_{0}(s_{0}), a_{t} \sim \pi(a_{t} | s_{t} ), s_{t+1} \sim P(s_{t+1}|s_{}, a_{t})$$
$s = state$, $a = action $ , $\pi = policy $, $P = probability(?) $, $\mathbb{E} = Expression, Expectation$. ์๊ฐ ๋จ์์...๋ฌด์ธ๊ฐ. , \r = reward\, \\gamma = epsilon(?)\
์ ์ ์ฑ ์ ์์ธก๊ฐ
$\eta(\widetilde{\pi}) = \eta(\pi) + \mathbb{E}{s{0}, a_{0}, \ldots \widetilde{\pi})}\left[\sum^{\infty}{t=0}\gamma^{t}A{\pi}(s_{t}, a_{t})\right]$
- $A_{\pi} = Advantage function$
์ ๋
State-action value function Q
$Q_{\pi}(s_{t}, a_{t}) = \mathbb{E}{s{t+1}, a_{t+1}, \ldots}\left[ \sum {l=0}^{\infty} \gamma^{l}r(s{t+l}) \right] $
Value function V
$V_{\pi}(s_{t}) = \mathbb{E}{a{t}, s_{t+1}, \ldots} \left[ \sum_{l=0}^{\infty} \gamma^{l}r(s_{t+l}) \right]$
Advantage function A
$\begin{matrix}A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s), \quad where \ a_{t} \sim \pi(a_{t}|s_{t}), s_{t+1} \sim P(s_{t+1} |s_{t}, a_{t})\quad for \ t \geq 0 \end{matrix}$
- $\mathbb{E}{s{0}, a_{0}, \dots \widetilde{\pi})} [\dots]$ ๋ action์ด $a_{t} \sim \widetilde{\pi}(\cdot | s_{t})$ ์ ์ํด์ sampling ๋จ์ ๋ณด์ฌ์ค๋ค. ์ฌ๊ธฐ์์ $\rho_{\pi}$๋ฅผ discounted visitation frequencies(ํ ์ธ๋ ๋ฐฉ๋ฌธ ๋น๋)๋ก ๋๋ค.
- discounted visitation frequencies(ํ ์ธ๋ ๋ฐฉ๋ฌธ ๋น๋)
$ \rho_{\pi}(s) = P(s_{0} = s)+ \gamma P(s_{1} = s) + \gamma^{2}P(s_{2} = s) + \cdots $ - ์ ์์ ์๊ฐ $t$ ๋์ ์ํ ํฉ๊ณ๋ก ์ ๋ฆฌํ๊ธฐ
$ \rho_{\pi}(s) = \sum_{t=0}^{\infty} \sum_{s} P(s_{t} = s|\widetilde{\pi}) $ - $s_{0} \sim P_{0}$ ์ $\pi$ ์ ๋ฐ๋ผ์ ๊ฒฐ์ ์ด ๋จ.
- Advantage function $A_{\pi}(s_{t}, a_{t}) = \widetilde{\pi}(a|s)A_{\pi}(s, a)$ ์ผ๋ก ์๊ฐ์ด ์๋ ์ํ์ ๋ํ ์์ผ๋ก ๋ณ๊ฒฝ
- discounted visitation frequencies(ํ ์ธ๋ ๋ฐฉ๋ฌธ ๋น๋)
์ ๋ฆฌ 1
$$\begin{matrix}
\eta(\widetilde{\pi}) & = & \eta(\pi)+ \mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}\left[ \sum {t=0}^{\infty}\gamma^{t}A{\pi}(s_{t}, a_{t}) \right] , \quad(\mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}[\ldots] ๋ถ๋ถ์\ \rho๋ก ๋์ฒด) \
& =& \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)\gamma^{t}A_{\pi}(s, a) \
& = & \eta(\pi) + \sum_{s}\sum_{t=0}^{\infty} \gamma^{t} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a) \ & = & \eta(\pi) + \sum_{s}\rho_{\widetilde{\pi}}(s) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)
\end{matrix}$$
- $\pi \rightarrow \widetilde{\pi}$ ๊ณผ์ ์์ ๋ชจ๋ state s์ ๋ํด advantage($\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)$)๊ฐ ์์๊ฐ์ด๋ผ๊ณ ์ถ์ ํ๋ค.
- ์ฑ๋ฅ ํฅ์์ ๋์์ด ๋จ
- ์ด์ธ์ ๊ฐ๋ค์ 0์ผ๋ก ์์ ์ฒ๋ฆฌํด๋ฒ๋ฆผ์ผ๋ก์จ ์ข ๋ ์์ ์ ์ผ๋ก ๋ณํจ
- $Advantage \geq 0$ ์ธ state-action ์์ด ์์ผ๋ฉด policy improve, ์์ผ๋ฉด policy ์ต์ ํ
- $\rho_{\widetilde{\pi}}(s)$ ์ $\widetilde{\pi}$ ์ฌ์ด์ ์ข ์์ฑ์ด ๋ณต์กํด์ ์ต์ ํ๋ฅผ ์ํด local approximation ์ ์ฌ์ฉํจ.
Local approximation
- $\rho_{\widetilde{\pi}}(s)$ ๋์ $\rho_{\pi}(s)$ ์ฌ์ฉ -> policy ๋ณํ์ ๋ฐ๋ฅธ state ๋ฐฉ๋ฌธ ๋ฐ๋์ ๋ณํ๋ฅผ ๋ฌด์ํจ
$$L_{\pi}(\widetilde{\pi}) = \eta(\pi) + \sum_{s}\rho_{\pi}(s)\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s,a)$$ - ์ฌ๊ธฐ์์ ํ๋ผ๋ฏธํฐํ๊ฐ ๊ฐ๋ฅํ ์ ์ฑ
$\pi_{\theta}$ ๊ฐ ์๋ค๊ณ ํ ๋($\pi_{\theta}(a|s)$๊ฐ $\theta$ ๋ฒกํฐ์ ๋ํด ๋ฏธ๋ถ ๊ฐ๋ฅ),
$$ L_{\pi_{\theta_{0}}}(\pi_{\theta_{0}}) = \eta(\pi_{\theta_{0}}) $$
$$ \triangledown_{\theta}L_{\pi_{\theta_{0}}}(\pi_{\theta})|{\theta=\theta{0}} = \triangledown_{\theta}\eta(\pi_{\theta}) |{\theta=\theta{0}} $$- $\pi_{\theta_{0}} \rightarrow \widetilde{\pi}$ ๋ผ๋ trialํ ๋จ๊ณ๊ฐ $L_{\pi_{\theta}} old$๋ฅผ ๊ฐ์ -> $\eta$ ๋ ๊ฐ์
- ๊ทธ๋ฌ๋ '์ผ๋ง๋ ์๊ฒ' ๋จ๊ณ๋ฅผ ์ค์ ํด์ผํ๋์ง ํฌ๊ธฐ์ ๋ํด์๋ ๋ชจ๋ฆ -> Conservative policy iterationConservative Policy Iteration
- $\pi_{\theta_{0}} \rightarrow \widetilde{\pi}$ ๋ผ๋ trialํ ๋จ๊ณ๊ฐ $L_{\pi_{\theta}} old$๋ฅผ ๊ฐ์ -> $\eta$ ๋ ๊ฐ์
- $\theta$ ๊ฐ์ ์ ๋ํ ๋ช ์์ ์ธ lower bound (ํํ) ์ ์
- $\pi_{old}$ = ํ์ฌ ์ ์ฑ , $\pi' = argmax_{\pi'}, L_{\pi_{old}}(\pi')$ $$\pi_{new}(a|s) = (1-\alpha)\pi_{old}(a|s) + \alpha \pi'(a|s)$$
- lower bound
$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$
- ์ ๋ฐฉ์ ์์๋ง ์ ์ฉ๋จ
- ์ด policy class๋ ์ค์ ๋ก๋ ์ด๋ ต๊ณ ์ ํ์ ์ด๋ฏ๋ก ์ค์ฉ์ ์ธ ์ ๋ฐ์ดํธ ์ฒด๊ณ๋ฅผ ์ ์ฉํ๋ ๊ฒ์ด ๋ฐ๋์งํจ
Monotonic Improvement Guarantee for General Stochastic Policies
$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$
- ๊ธฐ์กด์ lower bound์ conservative policy iteration ์ ์ฉ
- $\alpha$ ๋ฅผ $\pi$ ์ $\widetilde{\pi}$ ์ฌ์ด์ ๊ฑฐ๋ฆฌ๋ก ๋ณ๊ฒฝํ๊ณ ์ ์ ํ ์์๋ก ๋์ฒดํ๋ ๋ฑ ํด์ ์ lower bound๋ฅผ ์ผ๋ฐ์ ์ธ ํ๋ฅ ๋ก ์ ์ ์ฑ
์ผ๋ก ์ผ๋ฐํ
- $\alpha$ ๋ฅผ $\pi$ ์ $\widetilde{\pi}$ ์ฌ์ด์ ๊ฑฐ๋ฆฌ = ์ด์ฐํ๋ฅ ๋ถํฌ p, q์ ๋ํด ์ ์ฒด ๋ณ๋ ๋ฐ์ฐ (total variation divergence)
$$D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_{i} - q_{i}|$$
$$D_{TV}^{max}(\pi, \widetilde{\pi}) = max_{s} D_{TV}(\pi(\cdot|s)||\widetilde{\pi}(\cdot|s))$$Theorem 1$$\alpha = D_{TV}^{max}(\pi_{old}, \pi_{new})$$
$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{4\epsilon r}{(1-\gamma)^{2}}, \quad \epsilon = max_{s, a} |A_{\pi}(s, a)|$$
- $\alpha$ ๋ฅผ $\pi$ ์ $\widetilde{\pi}$ ์ฌ์ด์ ๊ฑฐ๋ฆฌ = ์ด์ฐํ๋ฅ ๋ถํฌ p, q์ ๋ํด ์ ์ฒด ๋ณ๋ ๋ฐ์ฐ (total variation divergence)
- $D$ = Distance
- $TV$ = Total variation
- ์ฌ๊ธฐ์์ TV Divergence๊ฐ $\alpha$ ๋ณด๋ค ์์ ๋ ๋ถํฌ์ ํ๋ฅ ๋ณ์๋ฅผ ๊ฒฐํฉํ ์ ์๋ค๋ ์ฌ์ค์ ์ฌ์ฉํด Kakade & Langford์ ๊ฒฐ๊ณผ ํ์ฅ -> ์ด๋ $1-\alpha$์ ๊ฐ์TV Divergence์ KL Divergence์ ๊ด๊ณ$$D_{TV}(p||q)^{2} \leq D_{KL}(p||q)$$
$$D_{KL}^{max}(\pi, \widetilde{\pi}) = max_{s}D_{KL}(\pi(\cdot|s) || \widetilde{\pi}(\cdot|s))$$ - ์ ์์ ์ ๋ฆฌํด๋ณด๋ฉด
$$\eta(\widetilde{\pi})\geq L_{\pi}(\widetilde{\pi}) - CD_{KL}^{max}(\pi, \widetilde{\pi}), \quad where \ C = \frac{4\epsilon r}{(1-\gamma)^{2}} \ and \ \alpha^{2} = D_{KL}^{max}(\pi, \widetilde{\pi}) \cdots (9)$$
Algorithm 1
for i = 0, 1, 2, ... until convergence do
compute all advantage class $A_{\pi_{i}}(s, a)$
solve the constrained optimization problem
$\pi_{i+1} = argmax_{\pi}[L_{\pi_{i}} - CD_{KL}^{max}(\pi_{i}, \pi)]$
where $C = 4\epsilon r/(1-\gamma)^{2}$
and $L_{\pi_{i}}(\pi) = \eta(\pi_{i}) + \sum_{s}p_{\pi_{i}}(s)\sum_{a}\pi(a|s)A_{\pi_{i}}(s,a)$
end for
- 9๋ฒ ๋ฐฉ์ ์์ ๊ธฐ๋ฐ์ผ๋ก ํ ์ ์ฑ ๋ฐ๋ณต ์๊ณ ๋ฆฌ์ฆ
- $A_{\pi}$๋ฅผ ์ ํํ ๊ณ์ฐํ๋ค๊ณ ๊ฐ์
$$\begin{matrix}
M_{i}(\pi) &=& L_{\pi_{i}}(\pi) - CD_{KL}^{max}(\pi_{i}, \pi) \
\eta(\pi_{i+1})&\geq& M_{i}(\pi_{i+1}) \
\eta(\pi_{i})&\geq& M_{i}(\pi_{i}) \
\eta(\pi_{i+1}) - \eta(\pi_{i}) &\geq& M_{i}(\pi_{i+1}) - M_{i}(\pi_{i})
\end{matrix}$$
- ์๊ณ ๋ฆฌ์ฆ์ด ์์ ์ ์ผ๋ก ๊ฐ์ ๋๋ policy sequence๋ฅผ ์์ฑ
- $M_{i}$๋ฅผ ์ต๋ํ -> true objective $\eta$๋ ๊ฐ์ํ์ง ์์
- Minorization-maximaization(MM) ์๊ณ ๋ฆฌ์ฆ์ ์ผ์ข
Trust Region Policy Optimization
- ๋๊ท๋ชจ ์ ๋ฐ์ดํธ ํ์ฉ์ ์ํด penalty๊ฐ ์๋๋ผ KL Divergence์ ๋ํ ์ ์ฝ ์กฐ๊ฑด์ ๊ฑฐ๋ ๊ฒ
- ๋งค๊ฐ๋ณ์ vector $\theta$ ๋ฅผ ์จ์ $\pi_{\theta}(a|s)$ ๋ฅผ ๊ณ ๋ คํ๋ฏ๋ก $\pi$ ๋์ $\theta$ ์ฌ์ฉ
$$\begin{matrix}
\eta(\theta) := \eta(\pi_{\theta}) \
L_{\theta}(\widetilde{\theta}) = L_{\pi_{\theta}}(\pi_{\widetilde{\theta}}) \
D_{KL}(\theta||\widetilde{\theta}) = D_{KL}(\pi_{\theta}||\pi_{\widetilde{\theta}})
\end{matrix}$$
$$\theta_{old} = above\ policy \ parameter$$ - ์ด์ ์ ์ฆ๋ช
ํ ๋ฐ์ ๋ฐ๋ฅด๋ฉด,
$$\eta(\theta) \geq L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)$$- ๋ฐ๋ผ์ $\eta(\theta)$ ์ฆ๊ฐ๋ฅผ ์ํด์ ์ฐํญ์ ์ต๋ํํด์ผํจ
$$maximize_{\theta}[L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)]$$
- ๋ฐ๋ผ์ $\eta(\theta)$ ์ฆ๊ฐ๋ฅผ ์ํด์ ์ฐํญ์ ์ต๋ํํด์ผํจ
- $C\left( \frac{4\epsilon r}{(1-\gamma)^{2}} \right)$ ๋ฅผ ํ์ฉํ penalty ์ ๋ต์ step size๊ฐ ๋๋ฌด ์์ ๋ฌธ์ ๊ฐ ์์.
- step size๋ฅผ ํค์ธ ์๋ก์ด ๋ฐฉ๋ฒ = old policy์ new policy ์ฌ์ด์์ KL Divergence์ ์ ์ฝ์ ๊ฑด๋ค!)
$$\begin{matrix}
maximize_{\theta} \ L_{\theta_{old}}(\theta) & (11) \
subject to D_{KL}^{max}(\theta_{old}, \theta) \leq \delta
\end{matrix}$$
- step size๋ฅผ ํค์ธ ์๋ก์ด ๋ฐฉ๋ฒ = old policy์ new policy ์ฌ์ด์์ KL Divergence์ ์ ์ฝ์ ๊ฑด๋ค!)
- ๋ค๋ง ์ด ์์ ๋ชจ๋ ์ํ๊ณต๊ฐ์์ KL Divergence๊ฐ ์ ํ๋จ = ํ์ค์ ์ผ๋ก ์ด๋ง์ด๋งํ ์ซ์์ ๋ฐ์ฐ์ ์ ์ฝํ๊ธด ํ๋ฌ
- ๋์ KDL์ heuristic ๊ทผ์ฌ์น_๋ฅผ ์ฌ์ฉ
$$\overline{D\{KL}^{p}}(\theta_{1}, \theta_{2}) := \mathbb{E}{s \sim p}[D{KL}(\pi_{\theta_{1}}(\cdot|s)) || \pi_{\theta_{2}}(\cdot|s)]$$ - ์ ๋ฆฌํ๋ฉด
$$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$ - ์ ์ฑ ๋ณ๊ฒฝ์ ๋ํ ์ ์ฝ ์กฐ๊ฑด์ ๋ฐ๋ผ ์ด reward ์ถ์ ์น ์ต์ ํ์ฉ ์ ์ฑ ๋งค๊ฐ๋ณ์ ๋ฌธ์ ํด๊ฒฐ
Sample-Based Estimation of the Objective and Constraint
- Monte-Carlo ์๋ฎฌ๋ ์ด์
ํจ์๋ก ๋ชฉ์ ํจ์์ ์ ์ฝ ํจ์ ๊ทผ์ฌ
$$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$
$$maximize_{\theta} \ \sum_{s} p_{\theta_{old}}(s)\sum_{a}\pi_{\theta}(a|s)A_{\theta_{old}}(s, a) \quad subject \ to \ \overline{D_{KL}^{p_{\theta_{old}}}}(\theta_{old}, \theta) \leq \delta$$ - ์ ๋ฆฌ
- $\sum_{s}p_{\theta_{old}}(s)[\dots]$ ๋ถ๋ถ์ ์์ธก์ ๋ฐ๋ผ์ $\frac{1}{1-\gamma}\mathbb{E}{s \sim p{\theta_{old}}}[\dots]$ ๋ก ๋์ฒด
- $A_{\theta_{old}}$ ๋ฅผ Q value function $Q_{\theta_{old}}$ ๋ก ๋ณ๊ฒฝ(์์ํ ?)
- Action ํฉ๊ณ๋ฅผ importance sampling ์ถ์ ๊ธฐ๋ก ๋์ฒด
- $q$ ๋ฅผ ์ํ๋ง ๋ถํฌ๋ก ์ฌ์ฉ
- loss function์ ๋ํ ๋จ์ผ state $s_{n}$ ์ ๊ธฐ์ฌ๋ ํ๋ฝ
$$\sum_{a}\pi_{\theta}(a|s_{n})A_{\theta_{old}}(s_{n}, a) = \mathbb{E}{a \sim q}[\frac{\pi{\theta}(a|s)}{q(a|s)}Q_{\theta_{old}}(s, a)] \quad subject \ to \ \mathbb{E}{s \sim p{\theta_{old}}}[D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta}(\cdot|s))] \leq \delta$$
- ์ด์ ๊ธฐ๋๊ฐ $\mathbb{E}$ ๋ฅผ ํ๋ณธ ํ๊ท ์ผ๋ก, Q-value๋ฅผ ๊ฒฝํ์ ์ถ์ ์น๋ก ๋ฐ๊พธ๋ฉด ๋จ.
- ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ด ์กด์ฌํจ
-
- Single path : ์ผ๋ฐ์ ์ผ๋ก policy gradient ์ ์ฌ์ฉ. trejectories ์ํ๋ง ๊ธฐ๋ฐ
- policy simulation์ ์ํํ์ฌ trajectories(๊ถค๋?) set์ ์์ฑ
- ๋ชจ๋ state-action ์ $(s_{n}, a_{n})$์ objective์ ํฌํจํจ
- Vine : rollout ์ธํธ๋ฅผ ๊ตฌ์ฑํ๊ณ ์ธํธ์ ๊ฐ state์์ ์์
์ ์ํ. policy iteration method์์ ์ฌ์ฉ
- 'Trunk' ๋ผ๋ trajectories set์ ์์ฑํ๊ณ , ๋์ฐฉํ state์ subset์์๋ถํฐ 'branch'๋ผ๋ rollout์ ์์ฑํจ
- ์ด state $s_{n}$ ๊ฐ๊ฐ์์ ์ฌ๋ฌ ๊ฐ์ action($a_{1}, a_{2}$)๋ฅผ ์คํ์ํค๊ณ , action ์ดํ์ rollout์ ์ฐ์ถํจ.
- ์ด๋ common random numbers(CRN) ์ ์ฌ์ฉํด์ ๋ถ์ฐ์ ์ค์
- Single path : ์ผ๋ฐ์ ์ผ๋ก policy gradient ์ ์ฌ์ฉ. trejectories ์ํ๋ง ๊ธฐ๋ฐ
'๐ฌ ML & Data > ๐ฎ Reinforcement Learning' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[MPC] 4. Optimal Control(2) - Taylor Series ์ ์ฉ, Algebraic Riccati Equation(ARE) ๊ตฌํ๊ธฐ (1) | 2024.03.08 |
---|---|
[MPC] 4. Optimal Control(1) - LQR๊ณผ Taylor Series(ํ ์ผ๋ฌ ๊ธ์) (1) | 2024.03.06 |
[MPC] 3. ์ํ(state)์ ์ถ๋ ฅ(output) ์์ธกํด๋ณด๊ธฐ (0) | 2024.03.06 |
[MPC] 2. ์ํ ๊ณต๊ฐ ๋ฐฉ์ ์ ์ ๋ (0) | 2024.03.06 |
[MPC] 1. Model Predictive Control Intro (0) | 2024.03.06 |