[강화학습] TRPO(Trust Region Policy Optimization) 논문 정리

728x90

PPO를 공부하려고 했는데 이 논문이 선행되어야한다는 이야기를 들어서 가볍게 논문을 읽어봤다. 아직 강화학습 논문 읽는 건 익숙하지 않아서 시간이 꽤 걸렸다. 수학적 개념이 적어서 최대한 꼼꼼히 이해할 수 있게 정리해봤는데, 다른 사람들에게도 도움이 되었으면 해서 포스팅한다.

[https://arxiv.org/abs/1502.05477]

TRPO(Trust Region Policy Optimization)

url: https://arxiv.org/abs/1502.05477
title: "Trust Region Policy Optimization"
description: "We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters."
host: arxiv.org
favicon: https://arxiv.org/static/browse/0.3.4/images/icons/favicon-32x32.png
image: https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png

Trust Region(믿을만한 구역)에서만 policy를 업데이트 한다는 컨셉
policy based algorithm은 보상 예측치를 최대화하는 과정에서 policy 를 직접 찾아감.

Preliminaries

원본 정책의 예측값

$$\eta(\pi) = \mathbb{E}{s{0}, a_{0}, \ldots}\left[\sum^{\infty}{t=0}\gamma^{t}r(s{t})\right], where \ s_{0} \sim p_{0}(s_{0}), a_{t} \sim \pi(a_{t} | s_{t} ), s_{t+1} \sim P(s_{t+1}|s_{}, a_{t})$$

$s = state$, $a = action $ , $\pi = policy $, $P = probability(?) $, $\mathbb{E} = Expression, Expectation$. 시간 단위의...무언가. , \r = reward\, \\gamma = epsilon(?)\

새 정책의 예측값

$\eta(\widetilde{\pi}) = \eta(\pi) + \mathbb{E}{s{0}, a_{0}, \ldots \widetilde{\pi})}\left[\sum^{\infty}{t=0}\gamma^{t}A{\pi}(s_{t}, a_{t})\right]$

$A_{\pi} = Advantage function$

유도

State-action value function Q

$Q_{\pi}(s_{t}, a_{t}) = \mathbb{E}{s{t+1}, a_{t+1}, \ldots}\left[ \sum {l=0}^{\infty} \gamma^{l}r(s{t+l}) \right] $

Value function V

$V_{\pi}(s_{t}) = \mathbb{E}{a{t}, s_{t+1}, \ldots} \left[ \sum_{l=0}^{\infty} \gamma^{l}r(s_{t+l}) \right]$

Advantage function A

$\begin{matrix}A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s), \quad where \ a_{t} \sim \pi(a_{t}|s_{t}), s_{t+1} \sim P(s_{t+1} |s_{t}, a_{t})\quad for \ t \geq 0 \end{matrix}$

$\mathbb{E}{s{0}, a_{0}, \dots \widetilde{\pi})} [\dots]$ 는 action이 $a_{t} \sim \widetilde{\pi}(\cdot | s_{t})$ 에 의해서 sampling 됨을 보여준다. 여기에서 $\rho_{\pi}$를 discounted visitation frequencies(할인된 방문 빈도)로 둔다.
- discounted visitation frequencies(할인된 방문 빈도)
  $ \rho_{\pi}(s) = P(s_{0} = s)+ \gamma P(s_{1} = s) + \gamma^{2}P(s_{2} = s) + \cdots $
- 위 식을 시간 $t$ 대신 상태 합계로 정리하기
  $ \rho_{\pi}(s) = \sum_{t=0}^{\infty} \sum_{s} P(s_{t} = s|\widetilde{\pi}) $
- $s_{0} \sim P_{0}$ 은 $\pi$ 에 따라서 결정이 됨.
- Advantage function $A_{\pi}(s_{t}, a_{t}) = \widetilde{\pi}(a|s)A_{\pi}(s, a)$ 으로 시간이 아닌 상태에 대한 식으로 변경

정리 1

$$\begin{matrix}
\eta(\widetilde{\pi}) & = & \eta(\pi)+ \mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}\left[ \sum {t=0}^{\infty}\gamma^{t}A{\pi}(s_{t}, a_{t}) \right] , \quad(\mathbb{E}{s{0}, a_{0}, \ldots, \widetilde{\pi}}[\ldots] 부분을\ \rho로 대체) \
& =& \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)\gamma^{t}A_{\pi}(s, a) \
& = & \eta(\pi) + \sum_{s}\sum_{t=0}^{\infty} \gamma^{t} P(s_{t}=s | \widetilde{\pi}) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a) \ & = & \eta(\pi) + \sum_{s}\rho_{\widetilde{\pi}}(s) \sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)
\end{matrix}$$

$\pi \rightarrow \widetilde{\pi}$ 과정에서 모든 state s에 대해 advantage($\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s, a)$)가 양수값이라고 추정한다.
- 성능 향상에 도움이 됨
- 이외의 값들을 0으로 상수 처리해버림으로써 좀 더 안정적으로 변함
$Advantage \geq 0$ 인 state-action 쌍이 있으면 policy improve, 없으면 policy 최적화
$\rho_{\widetilde{\pi}}(s)$ 와 $\widetilde{\pi}$ 사이의 종속성이 복잡해서 최적화를 위해 local approximation 을 사용함.

Local approximation

$\rho_{\widetilde{\pi}}(s)$ 대신 $\rho_{\pi}(s)$ 사용 -> policy 변화에 따른 state 방문 밀도의 변화를 무시함
$$L_{\pi}(\widetilde{\pi}) = \eta(\pi) + \sum_{s}\rho_{\pi}(s)\sum_{a}\widetilde{\pi}(a|s)A_{\pi}(s,a)$$
여기에서 파라미터화가 가능한 정책 $\pi_{\theta}$ 가 있다고 할 때($\pi_{\theta}(a|s)$가 $\theta$ 벡터에 대해 미분 가능),
$$ L_{\pi_{\theta_{0}}}(\pi_{\theta_{0}}) = \eta(\pi_{\theta_{0}}) $$
$$ \triangledown_{\theta}L_{\pi_{\theta_{0}}}(\pi_{\theta})|{\theta=\theta{0}} = \triangledown_{\theta}\eta(\pi_{\theta}) |{\theta=\theta{0}} $$
- $\pi_{\theta_{0}} \rightarrow \widetilde{\pi}$ 라는 trial한 단계가 $L_{\pi_{\theta}} old$를 개선 -> $\eta$ 도 개선
  - 그러나 '얼마나 작게' 단계를 설정해야하는지 크기에 대해서는 모름 -> Conservative policy iterationConservative Policy Iteration
$\theta$ 개선에 대한 명시적인 lower bound (하한) 제시
$\pi_{old}$ = 현재 정책, $\pi' = argmax_{\pi'}, L_{\pi_{old}}(\pi')$ $$\pi_{new}(a|s) = (1-\alpha)\pi_{old}(a|s) + \alpha \pi'(a|s)$$
lower bound

$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$

위 방정식에만 적용됨
이 policy class는 실제로는 어렵고 제한적이므로 실용적인 업데이트 체계를 적용하는 것이 바람직함

Monotonic Improvement Guarantee for General Stochastic Policies

$$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new})-\frac{{2 \epsilon \gamma}}{(1-\gamma)^{2}}\alpha^{2}, \quad where\ \epsilon = max|\mathbb{E}{a \sim \pi'(a|s)}[A{\pi}(s, a)]|$$

기존의 lower bound에 conservative policy iteration 적용
$\alpha$ 를 $\pi$ 와 $\widetilde{\pi}$ 사이의 거리로 변경하고 적절히 상수로 대체하는 등 해서 위 lower bound를 일반적인 확률론적 정책으로 일반화
- $\alpha$ 를 $\pi$ 와 $\widetilde{\pi}$ 사이의 거리 = 이산확률 분포 p, q에 대해 전체 변동 발산 (total variation divergence)
  $$D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_{i} - q_{i}|$$
  $$D_{TV}^{max}(\pi, \widetilde{\pi}) = max_{s} D_{TV}(\pi(\cdot|s)||\widetilde{\pi}(\cdot|s))$$Theorem 1$$\alpha = D_{TV}^{max}(\pi_{old}, \pi_{new})$$
  $$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) - \frac{4\epsilon r}{(1-\gamma)^{2}}, \quad \epsilon = max_{s, a} |A_{\pi}(s, a)|$$
$D$ = Distance
$TV$ = Total variation
여기에서 TV Divergence가 $\alpha$ 보다 작은 두 분포의 확률 변수를 결합할 수 있다는 사실을 사용해 Kakade & Langford의 결과 확장 -> 이는 $1-\alpha$와 같음TV Divergence와 KL Divergence의 관계$$D_{TV}(p||q)^{2} \leq D_{KL}(p||q)$$
$$D_{KL}^{max}(\pi, \widetilde{\pi}) = max_{s}D_{KL}(\pi(\cdot|s) || \widetilde{\pi}(\cdot|s))$$
위 식을 정리해보면
$$\eta(\widetilde{\pi})\geq L_{\pi}(\widetilde{\pi}) - CD_{KL}^{max}(\pi, \widetilde{\pi}), \quad where \ C = \frac{4\epsilon r}{(1-\gamma)^{2}} \ and \ \alpha^{2} = D_{KL}^{max}(\pi, \widetilde{\pi}) \cdots (9)$$

Algorithm 1

for i = 0, 1, 2, ... until convergence do
compute all advantage class $A_{\pi_{i}}(s, a)$
solve the constrained optimization problem
$\pi_{i+1} = argmax_{\pi}[L_{\pi_{i}} - CD_{KL}^{max}(\pi_{i}, \pi)]$
where $C = 4\epsilon r/(1-\gamma)^{2}$
and $L_{\pi_{i}}(\pi) = \eta(\pi_{i}) + \sum_{s}p_{\pi_{i}}(s)\sum_{a}\pi(a|s)A_{\pi_{i}}(s,a)$
end for

9번 방정식을 기반으로 한 정책 반복 알고리즘
$A_{\pi}$를 정확히 계산한다고 가정

$$\begin{matrix}
M_{i}(\pi) &=& L_{\pi_{i}}(\pi) - CD_{KL}^{max}(\pi_{i}, \pi) \
\eta(\pi_{i+1})&\geq& M_{i}(\pi_{i+1}) \
\eta(\pi_{i})&\geq& M_{i}(\pi_{i}) \
\eta(\pi_{i+1}) - \eta(\pi_{i}) &\geq& M_{i}(\pi_{i+1}) - M_{i}(\pi_{i})
\end{matrix}$$

알고리즘이 안정적으로 개선되는 policy sequence를 생성
$M_{i}$를 최대화 -> true objective $\eta$는 감소하지 않음
Minorization-maximaization(MM) 알고리즘의 일종

Trust Region Policy Optimization

대규모 업데이트 허용을 위해 penalty가 아니라 KL Divergence에 대한 제약 조건을 거는 것
매개변수 vector $\theta$ 를 써서 $\pi_{\theta}(a|s)$ 를 고려하므로 $\pi$ 대신 $\theta$ 사용
$$\begin{matrix}
\eta(\theta) := \eta(\pi_{\theta}) \
L_{\theta}(\widetilde{\theta}) = L_{\pi_{\theta}}(\pi_{\widetilde{\theta}}) \
D_{KL}(\theta||\widetilde{\theta}) = D_{KL}(\pi_{\theta}||\pi_{\widetilde{\theta}})
\end{matrix}$$
$$\theta_{old} = above\ policy \ parameter$$
이전에 증명한 바에 따르면,
$$\eta(\theta) \geq L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)$$
- 따라서 $\eta(\theta)$ 증가를 위해서 우항을 최대화해야함
  $$maximize_{\theta}[L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old}, \theta)]$$
$C\left( \frac{4\epsilon r}{(1-\gamma)^{2}} \right)$ 를 활용한 penalty 전략은 step size가 너무 작은 문제가 있음.
- step size를 키울 새로운 방법 = old policy와 new policy 사이에서 KL Divergence에 제약을 건다!)
  $$\begin{matrix}
  maximize_{\theta} \ L_{\theta_{old}}(\theta) & (11) \
  subject to D_{KL}^{max}(\theta_{old}, \theta) \leq \delta
  \end{matrix}$$
다만 이 식은 모든 상태공간에서 KL Divergence가 제한됨 = 현실적으로 어마어마한 숫자의 발산을 제약하긴 힘듬
대신 KDL의 heuristic 근사치_를 사용
$$\overline{D\{KL}^{p}}(\theta_{1}, \theta_{2}) := \mathbb{E}{s \sim p}[D{KL}(\pi_{\theta_{1}}(\cdot|s)) || \pi_{\theta_{2}}(\cdot|s)]$$
정리하면
$$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$
정책 변경에 대한 제약 조건에 따라 총 reward 추정치 최적화용 정책 매개변수 문제 해결

Sample-Based Estimation of the Objective and Constraint

Monte-Carlo 시뮬레이션 함수로 목적 함수와 제약 함수 근사
$$maximize_{\theta} \ L_{\theta_{old}}(\theta) \ subject \ to \ \overline{D_{KL}^{A_{\theta_{old}}}}(\theta_{old, \theta}) \leq \delta$$
$$maximize_{\theta} \ \sum_{s} p_{\theta_{old}}(s)\sum_{a}\pi_{\theta}(a|s)A_{\theta_{old}}(s, a) \quad subject \ to \ \overline{D_{KL}^{p_{\theta_{old}}}}(\theta_{old}, \theta) \leq \delta$$
정리
1. $\sum_{s}p_{\theta_{old}}(s)[\dots]$ 부분은 예측에 따라서 $\frac{1}{1-\gamma}\mathbb{E}{s \sim p{\theta_{old}}}[\dots]$ 로 대체
2. $A_{\theta_{old}}$ 를 Q value function $Q_{\theta_{old}}$ 로 변경(상수화 ?)
3. Action 합계를 importance sampling 추정기로 대체
  - $q$ 를 샘플링 분포로 사용
  - loss function에 대한 단일 state $s_{n}$ 의 기여도 하락
    $$\sum_{a}\pi_{\theta}(a|s_{n})A_{\theta_{old}}(s_{n}, a) = \mathbb{E}{a \sim q}[\frac{\pi{\theta}(a|s)}{q(a|s)}Q_{\theta_{old}}(s, a)] \quad subject \ to \ \mathbb{E}{s \sim p{\theta_{old}}}[D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta}(\cdot|s))] \leq \delta$$
이제 기댓값 $\mathbb{E}$ 를 표본 평균으로, Q-value를 경험적 추정치로 바꾸면 됨.
두 가지 방법이 존재함

1. Single path : 일반적으로 policy gradient 에 사용. trejectories 샘플링 기반
  - policy simulation을 수행하여 trajectories(궤도?) set을 생성
  - 모든 state-action 쌍 $(s_{n}, a_{n})$을 objective에 포함함
2. Vine : rollout 세트를 구성하고 세트의 각 state에서 작업을 수행. policy iteration method에서 사용
  - 'Trunk' 라는 trajectories set을 생성하고, 도착한 state의 subset에서부터 'branch'라는 rollout을 생성함
  - 이 state $s_{n}$ 각각에서 여러 개의 action($a_{1}, a_{2}$)를 실행시키고, action 이후에 rollout을 산출함.
  - 이때 common random numbers(CRN) 을 사용해서 분산을 줄임

728x90

저작자표시 비영리 변경금지 (새창열림)

'🐬 ML & Data > 📮 Reinforcement Learning' 카테고리의 다른 글

[MPC] 4. Optimal Control(2) - Taylor Series 적용, Algebraic Riccati Equation(ARE) 구하기 (1)	2024.03.08
[MPC] 4. Optimal Control(1) - LQR과 Taylor Series(테일러 급수) (1)	2024.03.06
[MPC] 3. 상태(state)와 출력(output) 예측해보기 (0)	2024.03.06
[MPC] 2. 상태 공간 방정식 유도 (0)	2024.03.06
[MPC] 1. Model Predictive Control Intro (0)	2024.03.06

TRPO(Trust Region Policy Optimization)

Preliminaries

원본 정책의 예측값

새 정책의 예측값

정리 1

Monotonic Improvement Guarantee for General Stochastic Policies

Algorithm 1

Trust Region Policy Optimization

Sample-Based Estimation of the Objective and Constraint

'🐬 ML & Data > 📮 Reinforcement Learning' 카테고리의 다른 글

티스토리툴바