728x90

아오 티스토리 LaTEX 진짜 못해먹겠네

선행 TRPO 논문 리뷰는 아래 링크 참고. 보고 오면 좀 더 알아보기 쉬울 것으로 예상됨(일단 저는 그랬습니다)

[강화학습] TRPO(Trust Region Policy Optimization) 논문 정리

PPO를 공부하려고 했는데 이 논문이 선행되어야한다는 이야기를 들어서 가볍게 논문을 읽어봤다. 아직 강화학습 논문 읽는 건 익숙하지 않아서 시간이 꽤 걸렸다. 수학적 개념이 적어서 최대한

dnai-deny.tistory.com

PPO(Proximal Policy Optimization)

On-policy, actor-critic 알고리즘
Value function 없이 확률적인 policy를 곧장 학습함
policy update를 포착하는 novel objective function을 사용해서 매 iteration에서 policy가 과도하게 크게 변하는 것을 방지학습 안정성을 높임
TRPO의 trust region policy optimization 장점을 가져오면서도 더 쉽게 구현이 가능하고, 일반적이고, smaple complexity가 나아진 모델
Q-Learning은 간단한 문제에 대해 이해력이 별로고, Vanilla policy gradient는 데이터 효율성과 건전성이 낮으며, TRPO는 복잡하고 dropout과 같은 노이즈 있거나 parameter sharing을 포함한 architecture에는 적합하지 않다.
policy performance에 대한 하한(lower bound, pessimistic estimate비관적 추정)을 형성하는 잘린 확률 비율을 사용함
policy 최적화를 위해서 데이터 샘플링-샘플링 데이터 최적화를 번갈아 수행

PPO is a policy-gradient type of method. The algorithm directly optimizes the expected reward by estimating the gradient of the policy from the trajectories taken by the agent. This is usually an on-line method, where only the most recent transitions taken by the agent is used to update the policy.

url: https://pytorch.org/tutorials/intermediate/reinforcement_ppo.html
title: "Reinforcement Learning (PPO) with TorchRL Tutorial — PyTorch Tutorials 2.4.0+cu121 documentation"
host: pytorch.org

1. Policy Gradient Method

policy gradient의 추정치 계산 -> stochastic gradient ascent 와 연결해서 사용
- common gradient estimator
  $$\hat{g} = \hat{\mathbb{E}{t}}[\bigtriangledown{\theta}\log \pi_{\theta}(a_{t}|s_{t})]$$
- $\pi_{\theta}$ 가 확률적 policy이고 $A_{t}$ 가 t 시점의 advantage function 결과
- $\hat{\mathbb{E}_{t}}[\dots]$ = sampling과 optimization을 번갈아하는 알고리즘의 유한 샘플 배치의 경험적 평균
Loss Function
$$L^{PG}(\theta) = \hat{\mathbb{E}{t}}[\log \pi{\theta}(a_{t}|s_{t})\hat{A_{t}}]$$
- 같은 trajectory를 여러 개 써서 최적화를 시키면 파괴적으로 큰 업데이트가 생길 수 있음.2. Trust Region Policy Optimization
위 문제를 해결하는 방법 중 하나. 신뢰 영역 내의 범위에서만 policy를 업데이트하도록 제한해서 너무 큰 업데이트가 발생하는 것을 방지함
$$maximize_{\theta} \ \hat{\mathbb{E}{t}}\left[ \frac{\pi{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A_{t}} \right] \ subject \ to \ \hat{\mathbb{E}{t}}[D{KL}(\pi_{\theta_{old}}(\cdot|s), \pi_{\theta}(\cdot|s))] \leq \delta$$
- $\theta_{old}$ = update 전 policy parameter
objective에 대한 linear approximation(선형 근사), 제약 조건에 대한 quadratic approximation(2차 근사)를 conjugate gradient algorithm을 써서 대략적으로 해결
제약 조건 대신 penalty를 쓰는 버전의 식은 아래와 같음
$$maximize_{\theta} \ \hat{\mathbb{E}{t}}\left[ \frac{\pi{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A_{t}} - \beta D_{KL}(\pi_{\theta_{old}}(\cdot|s), \pi_{\theta}(\cdot|s)) \right]$$
- 여기서 $\beta$ 는 TRPO 논문의 C인 것으로 추정됨
- 모든 상황에 동작하는 $\beta$ 를 찾기 어렵고, 학습 중 특성 변화가 있는 경우엔 더 어려움

3. Clipped Surrogate Objective

$$r_{t}(\theta) = \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}, \quad r_{t}(\theta_{old}) = 1$$

위와 같이 정의했다고 가정하자.
TRPO는 surrogate objective maximize
$$L_{\theta_{old}}(\theta) - CD_{KL}^{max}(\theta_{old, \theta}) \leq \eta(\theta)$$
위에 정의한 내용을 바탕으로 써보면
$$L^{CPI}(\theta) = \hat{\mathbb{E}{t}}\left[\frac{\pi{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A_{t}} \right] = \hat{\mathbb{E}{t}}[r{t}(\theta)\hat{A_{t}}]$$
- CPI = conservative policy iteration
여기에서 제약 없이 $L^{CPI}$ 를 maximize하면 정책 업데이트가 지나치게 커짐
- objective를 수정하고 $r_{t}(\theta)=1$에서 멀어지도록 정책에 change penalty를 적용하는 방법 고려

main objective

$$L^{CLIP}(\theta) = \hat{\mathbb{E}{t}}\left[min(r{t}(\theta)\hat{A_{t}}, clip(r_{t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A_{t}}) \right]$$

epsilon 이 hyperparameter고 0.2라면
1. $r_{t}(\theta)\hat{A_{t}}=L^{CPI}$
2. $clip(r_{t}(\theta), 1-\epsilon, 1+\epsilon)$ => surrogate objective 의 확률 비율을 자름( $[1-\epsilon, 1+\epsilon]$ 사이로)
1번과 2번 중에 최소값을 취함.
이게 final objective = lower bound of unclipped
이렇게 되면 probability ratio는 objective를 향상 시킬 때만 무시되고, 악화시킬 때는 포함함으로써 더 나쁜 방향으로 가지 못하도록 억제함

$$L^{CLIP}(\theta) = L^{CPI}(\theta), \ where \ \theta = \theta_{old} \ (r = 1)$$

Advantage가 음수인지 양수인지에 따라 $1-\epsilon$ 혹은 $1+\epsilon$ 에서 clip됨

위 그래프는 연속된 제어 문제에 대한 proximal policy optimization 알고리즘을 통해 얻은 policy update 방향에 끼워넣는 것에 따라 어떻게 objective가 달라지는지를 보여줌
$L^{CLIP}$ 은 $L^{CPI}$의 하한선이며, 정책 업데이트가 너무 많아지면 패널티가 발생함(linear interpolation factor 1 부분 주목)
업데이트된 policy는 초기보다 0.02의 KL Divergence를 가지고, 이 부분이 $L^{CLIP}$ 이 최대가 되는 지점

4. Adaptive KL Penalty Coefficient

clipped surrogate objective 대체 - 혹은 플러스 요소로 KL Divergence penalty를 주는 방법이 있음
각 policy update마다 KL Divergence의 일부 타겟인 $d_{targ}$ 가 목표를 달성할 수 있게 패널티 계수(Coefficient) 수정
- 실제로는 clipped version 보다 성능이 안 좋지만 중요한 baseline이라 다룸

minibatch SGD의 epoch 몇 개를 써서 KL-penalized objective 최적화
$$L^{KLPEN}(\theta) = \hat{\mathbb{E}{t}}\left[\frac{\pi{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A_{t}} - \beta(D_{KL}[\pi_{\theta_{old}}(\cdot|s_{t}), \pi_{\theta}(\cdot|s_{t})]) \right]$$
$d = \hat{\mathbb{E}{t}}[D{KL}[\pi_{\theta_{old}}(\cdot|s_{t}), \pi_{\theta}(\cdot|s_{t})]]$
- $if \ d < d_{targ} / 1.5, \ \beta=\beta/2$
- $if \ d > d_{targ} \times 1.5, \ \beta=\beta \times 2$
- 여기서 1.5와 2는 경험적으로 선택되지만 그렇게 민감하지는 않음
- $\beta$ 초기값도 어차피 빠르게 조정되므로 크게 의미 없음

KL Divergence가 $d_{targ}$ 이랑 아주 다른 경우가 종종 있지만 드물고, $\beta$ 가 빨리 조정됨.

5. Algorithm

variance-reduced advantage-function estimator를 계산하는 대부분의 기술은 학습된 state-action function $V(s)$ 를 사용
신경망이 포함된 모델의 경우에는 policy surrogate와 value function erorr를 결합한 손실함수 필요
- 탐색 강화를 위해 entropy bonuys 사용objective maximized each iteration$$\begin{matrix}
  L^{CLIP+VF+S}(\theta) &= \ \hat{\mathbb{E}{t}}\left[L^{CLIP}{t}(\theta) - c_{1}L^{VF}{t}(\theta) + c{2}S\pi_{\theta}\right] \
  where & \ c_{1}, c_{2} = coefficent \
  & S = entropy \ bonus \
  & L_{t}^{VF} = squared \ error \ loss \ (V_{\theta}(s_{t}) - V_{t}^{targ})^{2}

\end{matrix}
$$

유명한 policy gradient 구현은 작은 timestamp T에 대해 policy를 실행하고 수집한 샘플을 업데이트에 사용
- T를 넘지 않는 Advantage Estimator
  $$\hat{{A_{t}}} = -V(s_{t}) + r_{t} + \gamma r_{t+1} + \cdots + \gamma^{T-t+1}r_{T-1} + \gamma^{T-t}V(s_{T})$$
- 이 choice를 일반화하면 일반화된 advantage estimation의 truncated(잘린) version 사용 가능
$\lambda = 1$ 일 때,
$$\hat{A_{t}} = \delta_{t} + (\gamma \lambda)\delta_{t+1} + \cdots + (\gamma \lambda)^{T-t+1}\delta {T-1}, \quad where \ \delta{t} = r_{t} + \gamma V(s_{t+1}) - V(s_{t})$$
fixed-length trajectory segments PPO
- 각 iteration마다 actor이 T 길이의 데이터 수집
- NT 시간 만큼의 surrogate loss 계산
- minibatch SGD(or Adam)으로 K epoch 동안 최적화

Algorithm 1 PPO, Actor-Critic Style

for iteration=1, 2, ... do
for actor=1, 2, ..., N do
   Run policy $\pi_{\theta_{old}}$ in env for T timesteps
   Compute advantage estimates $\hat{A_{1}}, \dots, \hat{A_{T}}$
   end for
   Optimize surrogate $L$ wrt $\theta$, with K epochs and minibatch size M < NT
   $\theta_{old} \leftarrow \theta$
end for

728x90

저작자표시 비영리 변경금지 (새창열림)

[강화학습] PPO(Proximal Policy Optimization) 논문 정리

PPO(Proximal Policy Optimization)

1. Policy Gradient Method

3. Clipped Surrogate Objective

main objective

4. Adaptive KL Penalty Coefficient

5. Algorithm

Algorithm 1 PPO, Actor-Critic Style

티스토리툴바