[๊ฐ•ํ™”ํ•™์Šต] TRPO(Trust Region Policy Optimization) ๋…ผ๋ฌธ ์ •๋ฆฌ
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
PPO๋ฅผ ๊ณต๋ถ€ํ•˜๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ์ด ๋…ผ๋ฌธ์ด ์„ ํ–‰๋˜์–ด์•ผํ•œ๋‹ค๋Š” ์ด์•ผ๊ธฐ๋ฅผ ๋“ค์–ด์„œ ๊ฐ€๋ณ๊ฒŒ ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ดค๋‹ค. ์•„์ง ๊ฐ•ํ™”ํ•™์Šต ๋…ผ๋ฌธ ์ฝ๋Š” ๊ฑด ์ต์ˆ™ํ•˜์ง€ ์•Š์•„์„œ ์‹œ๊ฐ„์ด ๊ฝค ๊ฑธ๋ ธ๋‹ค. ์ˆ˜ํ•™์  ๊ฐœ๋…์ด ์ ์–ด์„œ ์ตœ๋Œ€ํ•œ ๊ผผ๊ผผํžˆ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ •๋ฆฌํ•ด๋ดค๋Š”๋ฐ, ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ๋„ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•ด์„œ ํฌ์ŠคํŒ…ํ•œ๋‹ค.[https://arxiv.org/abs/1502.05477]TRPO(Trust Region Policy Optimization)url: https://arxiv.org/abs/1502.05477title: "Trust Region Policy Optimization"description: "We describe an iterative procedure for optimizing policies, with guaranteed mono..
[MPC] 4. Optimal Control(2) - Taylor Series ์ ์šฉ, Algebraic Riccati Equation(ARE) ๊ตฌํ•˜๊ธฐ
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
LQR์— ์ ์šฉ $$V^{*}(x(t), t) = \underset{u[t, t+\Delta t]}{min} \{ \Delta t \cdot l[x(t + \alpha \Delta t), u(t + \alpha \Delta t), t + \alpha \Delta t] + V^{*}(x(t + \Delta t), t+\Delta t) \}$$ ์ด ์‹์—์„œ $V^{*}(x(t + \Delta t), t+\Delta t)$ ๋ถ€๋ถ„์„ ์œ„ Taylor Series๋กœ x์™€ t์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•ด๋ณด์ž. $x = (x(t), t), v = \Delta t$ ๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž. ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. $$V^{*}(x + v) = V^{*}(x) + f'(x)v + f(x)v' + \frac 12 f''(x)v^{2}+ \frac1..
[MPC] 4. Optimal Control(1) - LQR๊ณผ Taylor Series(ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜)
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
optimal control ๊ธฐ์ดˆ - LQR(Linear Quadratic Regulator) LQR์ด ๊ธฐ์ดˆ๋ผ์„œ ์š”๊ฑธ๋กœ system : $\dot x = f(x, u, t), x(t_{0}) = x_{0}$ cost function : $$V(x(t_{0}), u, t_{0}) = \int_{t_{0}}^{T} l[x(\tau), u(\tau), \tau]d\tau + m(x(T))$$ ์œ„ cost function์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์ž…๋ ฅ $u^{*}(t), t_{0}\le t \le T$ ์ฐพ๊ธฐ -> optimal control์˜ ๋ชฉ์  principle of optimality ์— ๋”ฐ๋ผ ํ•œ ํ•ด๊ฐ€ ์ตœ์ ์ด๋ฉด sub problem์˜ ํ•ด๋„ ์ตœ์ ์ด ๋œ๋‹ค. $t_{0} < t < t_{1} < T$ ๋กœ $t_{1}$ ์ถ”๊ฐ€..
[MPC] 3. ์ƒํƒœ(state)์™€ ์ถœ๋ ฅ(output) ์˜ˆ์ธกํ•ด๋ณด๊ธฐ
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
Input / Output ์ •๋ฆฌ $N_p$ : ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋ฏธ๋ž˜ ์ถœ๋ ฅ ์ˆ˜ $N_c$ : ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋ฏธ๋ž˜ ์ œ์–ด์ž…๋ ฅ ์ˆ˜ ๊ฒฝ๋กœ ์ถ”์ ์˜ ๊ฒฝ์šฐ, $N_p$๊ฐœ ์ ์„ tracking ํ•˜๊ธฐ ์œ„ํ•œ $N_c$๊ฐœ ์ œ์–ด ๋ช…๋ น... Control Input $\Delta u(k), \Delta u(k+1), \Delta u(k+2), \cdots, \Delta u(k + N_{c} - 1)$ Output $y(k), y(k+1), \cdots, y(k+N_{p})$ $y(k) = Cx(k)$ ์ด๋ฏ€๋กœ $y(k+1) = Cx(k+1), y(k+2) = Cx(k+2), \cdots$ ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅ ๋”ฐ๋ผ์„œ ์˜ˆ์ธก state $x(k+1), x(k+2), \cdots, x(k+N_{p})$๋ฅผ ๊ตฌํ•˜๋ฉด ๋จ State variable ๊ตฌํ•˜๊ธฐ $..
[MPC] 2. ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
MPC ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„ ์ƒํƒœ๊ณต๊ฐ• ๋ฐฉ์ •์‹ + LTI(Linear TimeINvariant, ์„ ํ˜• ์‹œ๊ฐ„ ๋ถˆ๋ณ€ ์‹œ์Šคํ…œ)์˜ ๊ฒฝ์šฐ => Continuous-time state-space model ์ƒํƒœ ๋ฐฉ์ •์‹ : $$\bar{x} = Ax + Bu$$ ์ถœ๋ ฅ ๋ฐฉ์ •์‹ : $$y = Cx$$ MPC๋Š” discrete ํ•œ ํ™˜๊ฒฝ => Discrete-time state-space model ์ƒํƒœ ๋ฐฉ์ •์‹ : $$x(k+1) = A_{d}x(k) + B_{d}u(k)$$ ์ถœ๋ ฅ ๋ฐฉ์ •์‹ : $$y(k) = C_{d}x(k)$$ MPC ๊ธฐ๋ณธ ๋ชจ๋ธ์€ Discrete-time aumented state-space model ์ƒํƒœ ๋ณ€์ˆ˜ ๋Œ€์‹  ์ƒํƒœ ๋ณ€์ˆ˜์˜ ๋ณ€ํ™”๋Ÿ‰ $\Delta x$ ์‚ฌ์šฉ ์ƒํƒœ ๋ฐฉ์ •์‹ $${x(k+1) - x(k) ..
[MPC] 1. Model Predictive Control Intro
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
์œ ํŠœ๋ธŒ https://www.youtube.com/watch?v=zU9DxmNZ1ng&list=PLSAJDR2d_AUtkWiO_U-p-4VpnXGIorrO-&index=1 ๋ธ”๋กœ๊ทธ https://sunggoo.tistory.com/65 ์œ„ ์ž๋ฃŒ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ๊ฐ€๋ณ๊ฒŒ ์ •๋ฆฌํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹ ์ฆ๋ช…์ด ๋งŽ๊ฒ ๊ณ , ๊ทธ ๋’ค๋กœ๋Š” ๋ชฉ์ ์— ๋”ฐ๋ผ ๋…ผ๋ฌธ์ด๋‚˜ ์ฝ”๋“œ ๊ตฌํ˜„์„ ๋ณด๋ฉด์„œ ์ถ”๊ฐ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. MPC(Model Predictive Control)์˜ ์ปจ์…‰ ๊ธฐ๊ธฐ ์ƒํƒœ ๋ณ€ํ™”(dynamics) + ์ฃผ๋ณ€ ํ™˜๊ฒฝ ์š”์†Œ => cost function ์ œ์–ด๊ณตํ•™ ๋น„์„ ํ˜• / ๋น„๋ณผ๋ก(Non-linear, Non-convex) ๋Œ€์ƒ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๋Š๋ผ๊ธฐ์—๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ํ–ฅ๊ธฐ๊ฐ€ ์ข€ ์žˆ์Œ Flow k-1 ์ผ ๋•Œ์˜ ์ƒํƒœ ๋ณ€์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ k+1 ~ ..
[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
โ€ป ์•„๋ž˜ ๋งํฌ์˜ ๋‚ด์šฉ์„ ๊ณต๋ถ€ํ•˜๋ฉฐ ํ•œ๊ตญ์–ด๋กœ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. Reinforcement Learning: Dealing with Sparse Reward Environments Reinforcement Learning (RL) is a method of machine learning in which an agent learns a strategy through interactions with its environment… medium.com 1. Sparse Reward Sparse Reward(ํฌ๋ฐ•ํ•œ ๋ณด์ƒ) : Agent๊ฐ€ ๋ชฉํ‘œ ์ƒํ™ฉ์— ๊ฐ€๊นŒ์›Œ์กŒ์„ ๋•Œ๋งŒ ๊ธ์ • ๋ณด์ƒ์„ ๋ฐ›๋Š” ๊ฒฝ์šฐ ํ˜„์žฌ ์‹คํ—˜ ํ™˜๊ฒฝ ์„ธํŒ…๊ณผ ๊ฐ™์Œ Curiosity-Driven method agent๊ฐ€ ๊ด€์‹ฌ์‚ฌ ๋ฐ–์˜ ํ™˜๊ฒฝ์—๋„ ๋™๊ธฐ๋ฅผ ๋ฐ›๋„๋ก Curric..
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
DQN์˜ ์ฐจ์›์˜ ์ €์ฃผ ๋ฌธ์ œ(๊ณ ์ฐจ์› action์„ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ ์—ฐ์‚ฐ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๊ณ  memory space๋ฅผ ๋งŽ์ด ์š”ํ•จ)๋ฅผ off-policy actor critic ๋ฐฉ์‹์œผ๋กœ ํ’€์–ด๋‚ธ๋‹ค. ๊ธฐ์กด DQN ๋ฐฉ์‹์˜ insight๋“ค์— batch normalization replay buffer target Q network Actor-critic ํŒŒ๋ผ๋ฏธํ„ฐํ™” ๋œ actor function์„ ๊ฐ€์ง actor function : state์—์„œ ํŠน์ • action์œผ๋กœ mappingํ•˜์—ฌ ํ˜„์žฌ policy๋ฅผ ์ง€์ • policy gradient ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ์—ฌ๊ธฐ์—์„œ J๊ฐ€ Objective Function(๋ชฉํ‘œํ•จ์ˆ˜) actor function์ด ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ gradient asent๋กœ ์ตœ๋Œ€ํ™”→ ์ด ๋•Œ์˜ policy parameter..
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
Dueling Double DQN https://arxiv.org/pdf/1509.06461.pdf https://arxiv.org/pdf/1511.06581.pdf Double DQN DQN์—์„œ reward๋ฅผ ๊ณผ๋Œ€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ. Q Value๊ฐ€ agent๊ฐ€ ์‹ค์ œ๋ณด๋‹ค ๋†’์€ ๋ฆฌํ„ด์„ ๋ฐ›์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒฝํ–ฅ ⇒ Q learning update ๋ฐฉ์ •์‹์— ๋‹ค์Œ ์ƒํƒœ(state)์— ๋Œ€ํ•œ Q value ์ตœ๋Œ€๊ฐ’์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ Q ๊ฐ’์— ๋Œ€ํ•œ max ์—ฐ์‚ฐ์€ ํŽธํ–ฅ์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค. ํ™˜๊ฒฝ์˜ ์ตœ๋Œ€ true value๊ฐ€ 0์ธ๋ฐ agent๊ฐ€ ์ถ”์ •ํ•˜๋Š” ์ตœ๋Œ€ true value๊ฐ€ ์–‘์ˆ˜์ธ ๊ฒฝ์šฐ์— ์„ฑ๋Šฅ ์ €ํ•˜ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ network ์‚ฌ์šฉ. Q Next : action selection → ๋‹ค์Œ ์•ก์…˜์œผ๋กœ ๊ฐ€์žฅ ์ข‹์€ ..
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
์ธํ„ฐ๋„ท์„ ๋‹ค ๋’ค์ ธ๋ดค๋Š”๋ฐ ๊ฐ•ํ™”ํ•™์Šต์„ gym์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒŒ์ž„ agent ์‚ฌ์šฉํ•ด์„œ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ์œก์ฒœ๋งŒ ๊ฐœ๊ณ  ์ปค์Šคํ…€ํ•ด์„œ ํ•™์Šต์„ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ๋‹จ ํ•œ ๊ฐœ ์žˆ์—ˆ๋‹ค. ์ด์ œ ๋ง‰ ๊ณต๋ถ€๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•˜๋Š” ๋งˆ์Œ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ์จ๋ณด๊ณ ์ž ํ•œ๋‹ค. 1. Gym์˜ Env ๊ตฌ์กฐ ์‚ดํŽด๋ณด๊ธฐ ๊ผญ ๊ทธ๋ž˜์•ผํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ(๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๊ธด ํ•˜๋‹ค) ์–ด์จŒ๋“  gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ environment ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด์„œ ๊ตฌํ˜„ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค. !pip install gym gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ env ๊ตฌ์กฐ๋Š” ๋Œ€์ถฉ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. site-packages/gym/core.py ์—์„œ ์ง์ ‘ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. class Env(Generic[ObsType, ActType]):m.Generator] = None """ The ma..
[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
[Model Review] Markov Decision Process & Q-Learning 1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ dnai-deny.tistory.com Deep Reinforcement Learning ๊ธฐ์กด Q Learning์—์„œ๋Š” State์™€ Action์— ํ•ด๋‹นํ•˜๋Š” Q-Value๋ฅผ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ €์žฅ state space์™€ action space๊ฐ€ ์ปค์ง€๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด memory์™€ exploration time์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ ⇒ ๋”ฅ๋Ÿฌ๋‹์œผ๋กœ Q-Table์„ ์ƒ์„ฑํ•˜๋Š” Q..
[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning
ยท
๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ์ „์ด๊ฐ€ ์ผ์–ด๋‚จ ์ƒํƒœ ์ „์ด์— ๊ฐ๊ฐ ํ™•๋ฅ  ์กด์žฌ S4์˜ ๊ฒฝ์šฐ ์ข…๋ฃŒ์ƒํƒœ ๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ(Markov property) $$ P[S_{t+1} | S_t] = P[S_{t+1} |S_1,S_2, ... S_t] $$ ์ƒํƒœ๊ฐ€ ๋˜๊ธฐ๊นŒ์ง€์˜ ๊ณผ์ •์€ ํ™•๋ฅ  ๊ณ„์‚ฐ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Œ. ์–ด๋Š ์‹œ์ ์˜ ์ƒํƒœ๋กœ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์„ ๋•Œ ๋งˆ๋ฅด์ฝ”ํ”„ํ•œ ์ƒํƒœ๋ผ๊ณ  ํ•จ.๋ฐ˜๋ก€) ์šด์ „ํ•˜๋Š” ์‚ฌ์ง„(์–ด๋Š ์‹œ์ ์˜ ์‚ฌ์ง„์œผ๋กœ๋Š” ํ›„์ง„/์ „์ง„/์†๋„ ๋“ฑ์„ ํŒŒ์•… ๋ถˆ๊ฐ€ → ๋‹ค์Œ ์ƒํƒœ ๊ฒฐ์ • ๋ถˆ๊ฐ€๋Šฅ) ex) ์ฒด์Šค ๊ฒŒ์ž„(์–ด๋Š ..