[๊ฐ•ํ™”ํ•™์Šต] TRPO(Trust Region Policy Optimization) ๋…ผ๋ฌธ ์ •๋ฆฌ

2024. 9. 2. 16:36ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90

PPO๋ฅผ ๊ณต๋ถ€ํ•˜๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ์ด ๋…ผ๋ฌธ์ด ์„ ํ–‰๋˜์–ด์•ผํ•œ๋‹ค๋Š” ์ด์•ผ๊ธฐ๋ฅผ ๋“ค์–ด์„œ ๊ฐ€๋ณ๊ฒŒ ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ดค๋‹ค. ์•„์ง ๊ฐ•ํ™”ํ•™์Šต ๋…ผ๋ฌธ ์ฝ๋Š” ๊ฑด ์ต์ˆ™ํ•˜์ง€ ์•Š์•„์„œ ์‹œ๊ฐ„์ด ๊ฝค ๊ฑธ๋ ธ๋‹ค. ์ˆ˜ํ•™์  ๊ฐœ๋…์ด ์ ์–ด์„œ ์ตœ๋Œ€ํ•œ ๊ผผ๊ผผํžˆ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ •๋ฆฌํ•ด๋ดค๋Š”๋ฐ, ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ๋„ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•ด์„œ ํฌ์ŠคํŒ…ํ•œ๋‹ค.

[https://arxiv.org/abs/1502.05477]

TRPO(Trust Region Policy Optimization)

url: https://arxiv.org/abs/1502.05477
title: "Trust Region Policy Optimization"
description: "We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters."
host: arxiv.org
favicon: https://arxiv.org/static/browse/0.3.4/images/icons/favicon-32x32.png
image: https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png
  • Trust Region(๋ฏฟ์„๋งŒํ•œ ๊ตฌ์—ญ)์—์„œ๋งŒ policy๋ฅผ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค๋Š” ์ปจ์…‰
  • policy based algorithm์€ ๋ณด์ƒ ์˜ˆ์ธก์น˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ policy ๋ฅผ ์ง์ ‘ ์ฐพ์•„๊ฐ.

 

Preliminaries

์›๋ณธ ์ •์ฑ…์˜ ์˜ˆ์ธก๊ฐ’

$$\eta(\pi) = \mathbb{E}{s{0}, a_{0}, \ldots}\left[\sum^{\infty}{t=0}\gamma^{t}r(s{t})\right], where \ s_{0} \sim p_{0}(s_{0}), a_{t} \sim \pi(a_{t} | s_{t} ), s_{t+1} \sim P(s_{t+1}|s_{}, a_{t})$$

$s = state$, $a = action $ , $\pi = policy $, $P = probability(?) $, $\mathbb{E} = Expression, Expectation$. ์‹œ๊ฐ„ ๋‹จ์œ„์˜...๋ฌด์–ธ๊ฐ€. , \r = reward\, \\gamma = epsilon(?)\

์ƒˆ ์ •์ฑ…์˜ ์˜ˆ์ธก๊ฐ’

$\eta(\widetilde{\pi}) = \eta(\pi) + \mathbb{E}{s{0}, a_{0}, \ldots \widetilde{\pi})}\left[\sum^{\infty}{t=0}\gamma^{t}A{\pi}(s_{t}, a_{t})\right]$

  • $A_{\pi} = Advantage function$

์œ ๋„

State-action value function Q

$Q_{\pi}(s_{t}, a_{t}) = \mathbb{E}{s{t+1}, a_{t+1}, \ldots}\left[ \sum {l=0}^{\infty} \gamma^{l}r(s{t+l}) \right] $

Value function V

$V_{\pi}(s_{t}) = \mathbb{E}{a{t}, s_{t+1}, \ldots} \left[ \sum_{l=0}^{\infty} \gamma^{l}r(s_{t+l}) \right]$

Advantage function A

$\begin{matrix}A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s), \quad where \ a_{t} \sim \pi(a_{t}|s_{t}), s_{t+1} \sim P(s_{t+1} |s_{t}, a_{t})\quad for \ t \geq 0 \end{matrix}$

  • $\mathbb{E}{s{0}, a_{0}, \dots \widetilde{\pi})} [\dots]$ ๋Š” action์ด $a_{t} \sim \widetilde{\pi}(\cdot | s_{t})$ ์— ์˜ํ•ด์„œ sampling ๋จ์„ ๋ณด์—ฌ์ค€๋‹ค. ์—ฌ๊ธฐ์—์„œ $\rho_{\pi}$๋ฅผ discounted visitation frequencies(ํ• ์ธ๋œ ๋ฐฉ๋ฌธ ๋นˆ๋„)๋กœ ๋‘”๋‹ค.
    • discounted visitation frequencies(ํ• ์ธ๋œ ๋ฐฉ๋ฌธ ๋นˆ๋„)
      $ \rho_{\pi}(s) = P(s_{0} = s)+ \gamma P(s_{1} = s) + \gamma^{2}P(s_{2} = s) + \cdots $
    • ์œ„ ์‹์„ ์‹œ๊ฐ„ $t$ ๋Œ€์‹  ์ƒํƒœ ํ•ฉ๊ณ„๋กœ ์ •๋ฆฌํ•˜๊ธฐ
      $ \rho_{\pi}(s) = \sum_{t=0}^{\infty} \sum_{s} P(s_{t} = s|\widetilde{\pi}) $
    • $s_{0} \sim P_{0}$ ์€ $\pi$ ์— ๋”ฐ๋ผ์„œ ๊ฒฐ์ •์ด ๋จ.
    • Advantage function $A_{\pi}(s_{t}, a_{t}) = \widetilde{\pi}(a|s)A_{\pi}(s, a)$ ์œผ๋กœ ์‹œ๊ฐ„์ด ์•„๋‹Œ ์ƒํƒœ์— ๋Œ€ํ•œ ์‹์œผ๋กœ ๋ณ€๊ฒฝ
์ •๋ฆฌ 1

$$\begin{matrix}
\eta(\widetilde{\pi}) & = & \eta(\pi)+ \mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}\left[ \sum {t=0}^{\infty}\gamma^{t}A{\pi}(s_{t}, a_{t}) \right] , \quad(\mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}[\ldots] ๋ถ€๋ถ„์„\ \rho๋กœ ๋Œ€์ฒด) \
& =& \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)\gamma^{t}A_{\pi}(s, a) \
& = & \eta(\pi) + \sum_{s}\sum_{t=0}^{\infty} \gamma^{t} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a) \ & = & \eta(\pi) + \sum_{s}\rho_{\widetilde{\pi}}(s) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)
\end{matrix}$$

  • $\pi \rightarrow \widetilde{\pi}$ ๊ณผ์ •์—์„œ ๋ชจ๋“  state s์— ๋Œ€ํ•ด advantage($\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)$)๊ฐ€ ์–‘์ˆ˜๊ฐ’์ด๋ผ๊ณ  ์ถ”์ •ํ•œ๋‹ค.
    • ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋จ
    • ์ด์™ธ์˜ ๊ฐ’๋“ค์„ 0์œผ๋กœ ์ƒ์ˆ˜ ์ฒ˜๋ฆฌํ•ด๋ฒ„๋ฆผ์œผ๋กœ์จ ์ข€ ๋” ์•ˆ์ •์ ์œผ๋กœ ๋ณ€ํ•จ
  • $Advantage \geq 0$ ์ธ state-action ์Œ์ด ์žˆ์œผ๋ฉด policy improve, ์—†์œผ๋ฉด policy ์ตœ์ ํ™”
  • $\rho_{\widetilde{\pi}}(s)$ ์™€ $\widetilde{\pi}$ ์‚ฌ์ด์˜ ์ข…์†์„ฑ์ด ๋ณต์žกํ•ด์„œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด local approximation ์„ ์‚ฌ์šฉํ•จ.

Local approximation

  • $\rho_{\widetilde{\pi}}(s)$ ๋Œ€์‹  $\rho_{\pi}(s)$ ์‚ฌ์šฉ -> policy ๋ณ€ํ™”์— ๋”ฐ๋ฅธ state ๋ฐฉ๋ฌธ ๋ฐ€๋„์˜ ๋ณ€ํ™”๋ฅผ ๋ฌด์‹œํ•จ
    $$L_{\pi}(\widetilde{\pi}) = \eta(\pi) + \sum_{s}\rho_{\pi}(s)\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s,a)$$
  • ์—ฌ๊ธฐ์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ •์ฑ… $\pi_{\theta}$ ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ($\pi_{\theta}(a|s)$๊ฐ€ $\theta$ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ),
    $$ L_{\pi_{\theta_{0}}}(\pi_{\theta_{0}}) = \eta(\pi_{\theta_{0}}) $$
    $$ \triangledown_{\theta}L_{\pi_{\theta_{0}}}(\pi_{\theta})|{\theta=\theta{0}} = \triangledown_{\theta}\eta(\pi_{\theta}) |{\theta=\theta{0}} $$
    • $\pi_{\theta_{0}} \rightarrow \widetilde{\pi}$ ๋ผ๋Š” trialํ•œ ๋‹จ๊ณ„๊ฐ€ $L_{\pi_{\theta}} old$๋ฅผ ๊ฐœ์„  -> $\eta$ ๋„ ๊ฐœ์„ 
      • ๊ทธ๋Ÿฌ๋‚˜ '์–ผ๋งˆ๋‚˜ ์ž‘๊ฒŒ' ๋‹จ๊ณ„๋ฅผ ์„ค์ •ํ•ด์•ผํ•˜๋Š”์ง€ ํฌ๊ธฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋ชจ๋ฆ„ -> Conservative policy iterationConservative Policy Iteration
  • $\theta$ ๊ฐœ์„ ์— ๋Œ€ํ•œ ๋ช…์‹œ์ ์ธ lower bound (ํ•˜ํ•œ) ์ œ์‹œ
  • $\pi_{old}$ = ํ˜„์žฌ ์ •์ฑ…, $\pi' = argmax_{\pi'}, L_{\pi_{old}}(\pi')$ $$\pi_{new}(a|s) = (1-\alpha)\pi_{old}(a|s) + \alpha \pi'(a|s)$$
  • lower bound

$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$

  • ์œ„ ๋ฐฉ์ •์‹์—๋งŒ ์ ์šฉ๋จ
  • ์ด policy class๋Š” ์‹ค์ œ๋กœ๋Š” ์–ด๋ ต๊ณ  ์ œํ•œ์ ์ด๋ฏ€๋กœ ์‹ค์šฉ์ ์ธ ์—…๋ฐ์ดํŠธ ์ฒด๊ณ„๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•จ

Monotonic Improvement Guarantee for General Stochastic Policies

$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$

  • ๊ธฐ์กด์˜ lower bound์— conservative policy iteration ์ ์šฉ
  • $\alpha$ ๋ฅผ $\pi$ ์™€ $\widetilde{\pi}$ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ  ์ ์ ˆํžˆ ์ƒ์ˆ˜๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋“ฑ ํ•ด์„œ ์œ„ lower bound๋ฅผ ์ผ๋ฐ˜์ ์ธ ํ™•๋ฅ ๋ก ์  ์ •์ฑ…์œผ๋กœ ์ผ๋ฐ˜ํ™”
    • $\alpha$ ๋ฅผ $\pi$ ์™€ $\widetilde{\pi}$ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ = ์ด์‚ฐํ™•๋ฅ  ๋ถ„ํฌ p, q์— ๋Œ€ํ•ด ์ „์ฒด ๋ณ€๋™ ๋ฐœ์‚ฐ (total variation divergence)
      $$D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_{i} - q_{i}|$$
      $$D_{TV}^{max}(\pi, \widetilde{\pi}) = max_{s} D_{TV}(\pi(\cdot|s)||\widetilde{\pi}(\cdot|s))$$Theorem 1$$\alpha = D_{TV}^{max}(\pi_{old}, \pi_{new})$$
      $$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{4\epsilon r}{(1-\gamma)^{2}}, \quad \epsilon = max_{s, a} |A_{\pi}(s, a)|$$
  • $D$ = Distance
  • $TV$ = Total variation
  • ์—ฌ๊ธฐ์—์„œ TV Divergence๊ฐ€ $\alpha$ ๋ณด๋‹ค ์ž‘์€ ๋‘ ๋ถ„ํฌ์˜ ํ™•๋ฅ  ๋ณ€์ˆ˜๋ฅผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์‚ฌ์šฉํ•ด Kakade & Langford์˜ ๊ฒฐ๊ณผ ํ™•์žฅ -> ์ด๋Š” $1-\alpha$์™€ ๊ฐ™์ŒTV Divergence์™€ KL Divergence์˜ ๊ด€๊ณ„$$D_{TV}(p||q)^{2} \leq D_{KL}(p||q)$$
    $$D_{KL}^{max}(\pi, \widetilde{\pi}) = max_{s}D_{KL}(\pi(\cdot|s) || \widetilde{\pi}(\cdot|s))$$
  • ์œ„ ์‹์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด
    $$\eta(\widetilde{\pi})\geq L_{\pi}(\widetilde{\pi}) - CD_{KL}^{max}(\pi, \widetilde{\pi}), \quad where \ C = \frac{4\epsilon r}{(1-\gamma)^{2}} \ and \ \alpha^{2} = D_{KL}^{max}(\pi, \widetilde{\pi}) \cdots (9)$$

Algorithm 1

for i = 0, 1, 2, ... until convergence do
compute all advantage class $A_{\pi_{i}}(s, a)$
solve the constrained optimization problem
$\pi_{i+1} = argmax_{\pi}[L_{\pi_{i}} - CD_{KL}^{max}(\pi_{i}, \pi)]$
where $C = 4\epsilon r/(1-\gamma)^{2}$
and $L_{\pi_{i}}(\pi) = \eta(\pi_{i}) + \sum_{s}p_{\pi_{i}}(s)\sum_{a}\pi(a|s)A_{\pi_{i}}(s,a)$
end for

  • 9๋ฒˆ ๋ฐฉ์ •์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ •์ฑ… ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • $A_{\pi}$๋ฅผ ์ •ํ™•ํžˆ ๊ณ„์‚ฐํ•œ๋‹ค๊ณ  ๊ฐ€์ •

$$\begin{matrix}
M_{i}(\pi) &=& L_{\pi_{i}}(\pi) - CD_{KL}^{max}(\pi_{i}, \pi) \
\eta(\pi_{i+1})&\geq& M_{i}(\pi_{i+1}) \
\eta(\pi_{i})&\geq& M_{i}(\pi_{i}) \
\eta(\pi_{i+1}) - \eta(\pi_{i}) &\geq& M_{i}(\pi_{i+1}) - M_{i}(\pi_{i})
\end{matrix}$$

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์•ˆ์ •์ ์œผ๋กœ ๊ฐœ์„ ๋˜๋Š” policy sequence๋ฅผ ์ƒ์„ฑ
  • $M_{i}$๋ฅผ ์ตœ๋Œ€ํ™” -> true objective $\eta$๋Š” ๊ฐ์†Œํ•˜์ง€ ์•Š์Œ
  • Minorization-maximaization(MM) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ผ์ข…

Trust Region Policy Optimization

  • ๋Œ€๊ทœ๋ชจ ์—…๋ฐ์ดํŠธ ํ—ˆ์šฉ์„ ์œ„ํ•ด penalty๊ฐ€ ์•„๋‹ˆ๋ผ KL Divergence์— ๋Œ€ํ•œ ์ œ์•ฝ ์กฐ๊ฑด์„ ๊ฑฐ๋Š” ๊ฒƒ
  • ๋งค๊ฐœ๋ณ€์ˆ˜ vector $\theta$ ๋ฅผ ์จ์„œ $\pi_{\theta}(a|s)$ ๋ฅผ ๊ณ ๋ คํ•˜๋ฏ€๋กœ $\pi$ ๋Œ€์‹  $\theta$ ์‚ฌ์šฉ
    $$\begin{matrix}
    \eta(\theta) := \eta(\pi_{\theta}) \
    L_{\theta}(\widetilde{\theta}) = L_{\pi_{\theta}}(\pi_{\widetilde{\theta}}) \
    D_{KL}(\theta||\widetilde{\theta}) = D_{KL}(\pi_{\theta}||\pi_{\widetilde{\theta}})
    \end{matrix}$$
    $$\theta_{old} = above\ policy \ parameter$$
  • ์ด์ „์— ์ฆ๋ช…ํ•œ ๋ฐ”์— ๋”ฐ๋ฅด๋ฉด,
    $$\eta(\theta) \geq L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)$$
    • ๋”ฐ๋ผ์„œ $\eta(\theta)$ ์ฆ๊ฐ€๋ฅผ ์œ„ํ•ด์„œ ์šฐํ•ญ์„ ์ตœ๋Œ€ํ™”ํ•ด์•ผํ•จ
      $$maximize_{\theta}[L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)]$$
  • $C\left( \frac{4\epsilon r}{(1-\gamma)^{2}} \right)$ ๋ฅผ ํ™œ์šฉํ•œ penalty ์ „๋žต์€ step size๊ฐ€ ๋„ˆ๋ฌด ์ž‘์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ.
    • step size๋ฅผ ํ‚ค์šธ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ• = old policy์™€ new policy ์‚ฌ์ด์—์„œ KL Divergence์— ์ œ์•ฝ์„ ๊ฑด๋‹ค!)
      $$\begin{matrix}
      maximize_{\theta} \ L_{\theta_{old}}(\theta) & (11) \
      subject to D_{KL}^{max}(\theta_{old}, \theta) \leq \delta
      \end{matrix}$$
  • ๋‹ค๋งŒ ์ด ์‹์€ ๋ชจ๋“  ์ƒํƒœ๊ณต๊ฐ„์—์„œ KL Divergence๊ฐ€ ์ œํ•œ๋จ = ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋งˆ์–ด๋งˆํ•œ ์ˆซ์ž์˜ ๋ฐœ์‚ฐ์„ ์ œ์•ฝํ•˜๊ธด ํž˜๋“ฌ
  • ๋Œ€์‹  KDL์˜ heuristic ๊ทผ์‚ฌ์น˜_๋ฅผ ์‚ฌ์šฉ
    $$\overline{D\
    {KL}^{p}}(\theta_{1}, \theta_{2}) := \mathbb{E}{s \sim p}[D{KL}(\pi_{\theta_{1}}(\cdot|s)) || \pi_{\theta_{2}}(\cdot|s)]$$
  • ์ •๋ฆฌํ•˜๋ฉด
    $$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$
  • ์ •์ฑ… ๋ณ€๊ฒฝ์— ๋Œ€ํ•œ ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ผ ์ด reward ์ถ”์ •์น˜ ์ตœ์ ํ™”์šฉ ์ •์ฑ… ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฌธ์ œ ํ•ด๊ฒฐ

Sample-Based Estimation of the Objective and Constraint

  • Monte-Carlo ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ•จ์ˆ˜๋กœ ๋ชฉ์  ํ•จ์ˆ˜์™€ ์ œ์•ฝ ํ•จ์ˆ˜ ๊ทผ์‚ฌ
    $$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$
    $$maximize_{\theta} \ \sum_{s} p_{\theta_{old}}(s)\sum_{a}\pi_{\theta}(a|s)A_{\theta_{old}}(s, a) \quad subject \ to \ \overline{D_{KL}^{p_{\theta_{old}}}}(\theta_{old}, \theta) \leq \delta$$
  • ์ •๋ฆฌ
    1. $\sum_{s}p_{\theta_{old}}(s)[\dots]$ ๋ถ€๋ถ„์€ ์˜ˆ์ธก์— ๋”ฐ๋ผ์„œ $\frac{1}{1-\gamma}\mathbb{E}{s \sim p{\theta_{old}}}[\dots]$ ๋กœ ๋Œ€์ฒด
    2. $A_{\theta_{old}}$ ๋ฅผ Q value function $Q_{\theta_{old}}$ ๋กœ ๋ณ€๊ฒฝ(์ƒ์ˆ˜ํ™” ?)
    3. Action ํ•ฉ๊ณ„๋ฅผ importance sampling ์ถ”์ •๊ธฐ๋กœ ๋Œ€์ฒด
      • $q$ ๋ฅผ ์ƒ˜ํ”Œ๋ง ๋ถ„ํฌ๋กœ ์‚ฌ์šฉ
      • loss function์— ๋Œ€ํ•œ ๋‹จ์ผ state $s_{n}$ ์˜ ๊ธฐ์—ฌ๋„ ํ•˜๋ฝ
        $$\sum_{a}\pi_{\theta}(a|s_{n})A_{\theta_{old}}(s_{n}, a) = \mathbb{E}{a \sim q}[\frac{\pi{\theta}(a|s)}{q(a|s)}Q_{\theta_{old}}(s, a)] \quad subject \ to \ \mathbb{E}{s \sim p{\theta_{old}}}[D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta}(\cdot|s))] \leq \delta$$
  • ์ด์ œ ๊ธฐ๋Œ“๊ฐ’ $\mathbb{E}$ ๋ฅผ ํ‘œ๋ณธ ํ‰๊ท ์œผ๋กœ, Q-value๋ฅผ ๊ฒฝํ—˜์  ์ถ”์ •์น˜๋กœ ๋ฐ”๊พธ๋ฉด ๋จ.
  • ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•จ

  •  
    1. Single path : ์ผ๋ฐ˜์ ์œผ๋กœ policy gradient ์— ์‚ฌ์šฉ. trejectories ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜
      • policy simulation์„ ์ˆ˜ํ–‰ํ•˜์—ฌ trajectories(๊ถค๋„?) set์„ ์ƒ์„ฑ
      • ๋ชจ๋“  state-action ์Œ $(s_{n}, a_{n})$์„ objective์— ํฌํ•จํ•จ
    2. Vine : rollout ์„ธํŠธ๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์„ธํŠธ์˜ ๊ฐ state์—์„œ ์ž‘์—…์„ ์ˆ˜ํ–‰. policy iteration method์—์„œ ์‚ฌ์šฉ
      • 'Trunk' ๋ผ๋Š” trajectories set์„ ์ƒ์„ฑํ•˜๊ณ , ๋„์ฐฉํ•œ state์˜ subset์—์„œ๋ถ€ํ„ฐ 'branch'๋ผ๋Š” rollout์„ ์ƒ์„ฑํ•จ
      • ์ด state $s_{n}$ ๊ฐ๊ฐ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ action($a_{1}, a_{2}$)๋ฅผ ์‹คํ–‰์‹œํ‚ค๊ณ , action ์ดํ›„์— rollout์„ ์‚ฐ์ถœํ•จ.
      • ์ด๋•Œ common random numbers(CRN) ์„ ์‚ฌ์šฉํ•ด์„œ ๋ถ„์‚ฐ์„ ์ค„์ž„
728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[MPC] 4. Optimal Control(2) - Taylor Series ์ ์šฉ, Algebraic Riccati Equation(ARE) ๊ตฌํ•˜๊ธฐ  (1) 2024.03.08
[MPC] 4. Optimal Control(1) - LQR๊ณผ Taylor Series(ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜)  (1) 2024.03.06
[MPC] 3. ์ƒํƒœ(state)์™€ ์ถœ๋ ฅ(output) ์˜ˆ์ธกํ•ด๋ณด๊ธฐ  (0) 2024.03.06
[MPC] 2. ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„  (0) 2024.03.06
[MPC] 1. Model Predictive Control Intro  (0) 2024.03.06
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [MPC] 4. Optimal Control(2) - Taylor Series ์ ์šฉ, Algebraic Riccati Equation(ARE) ๊ตฌํ•˜๊ธฐ
  • [MPC] 4. Optimal Control(1) - LQR๊ณผ Taylor Series(ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜)
  • [MPC] 3. ์ƒํƒœ(state)์™€ ์ถœ๋ ฅ(output) ์˜ˆ์ธกํ•ด๋ณด๊ธฐ
  • [MPC] 2. ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] TRPO(Trust Region Policy Optimization) ๋…ผ๋ฌธ ์ •๋ฆฌ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”