๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning

[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)

darly213 2023. 8. 1. 11:11
728x90
 

[Model Review] Markov Decision Process & Q-Learning

1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ

dnai-deny.tistory.com

 

Deep Reinforcement Learning

  • ๊ธฐ์กด Q Learning์—์„œ๋Š” State์™€ Action์— ํ•ด๋‹นํ•˜๋Š” Q-Value๋ฅผ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ €์žฅ
    • state space์™€ action space๊ฐ€ ์ปค์ง€๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด memory์™€ exploration time์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ
    ⇒ ๋”ฅ๋Ÿฌ๋‹์œผ๋กœ Q-Table์„ ์ƒ์„ฑํ•˜๋Š” Q-function์„ ๋น„์„ ํ˜• ๋งคํ•‘ํ•˜๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ์ฐพ์„ ํ•„์š” X !

 

Deep Q-Network(DQN)

CNN + Experience replay + Target network

  • ๊ธฐ์กด Deep Q-Learning
    1. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋งค ์Šคํ…๋งˆ๋‹ค 2-5๋ฅผ ๋ฐ˜๋ณต
    2. Action \( a_t )\๋ฅผ E-Greedy ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ
      • E-Greedy = policy๊ฐ€ ์ •ํ•œ action์„ 1-E์˜ ํ™•๋ฅ ๋กœ ์‹คํ–‰, E์˜ ํ™•๋ฅ ์œผ๋กœ๋Š” ๋žœ๋ค action ์‹คํ–‰
    3. Action \( a_t \)๋ฅผ ์ˆ˜ํ–‰ํ•ด์„œ transition \( e_t = (s_t, a_t, r_t, s_{t+1}) \)์„ ์–ป์Œ
    4. Target value \( y_t = r_t + \gamma max_{a'}Q(s_{t+1}, a; \theta) \)
    5. Loss function \( (y_t - Q(s_t, a_t;\theta))^2 \) ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ \( \theta \)๋ฅผ ์—…๋ฐ์ดํŠธ

→ DQN์—์„œ๋Š” transition sample์„ ์–ป๋Š” 3๋ฒˆ์—์„œ experience replay, ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ 4-5๋ฒˆ์—์„œ target network ์ ์šฉ

 

CNN Architecture

  • CNN์„ ๋„์ž…ํ•ด์„œ ์ธ๊ฐ„๊ณผ ๋น„์Šทํ•œ ํ˜•ํƒœ๋กœ Atari ๊ฒŒ์ž„(๋ฒฝ๋Œ๊นจ๊ธฐ)์˜ input์„ ์ฒ˜๋ฆฌํ•จ
    • ๋†’์€ ์ฐจ์›์˜ ์ด๋ฏธ์ง€ ์ธํ’‹ ์ฒ˜๋ฆฌ์— ํšจ๊ณผ์  → linear function approximator๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๋‹ค
  • CNN ์ž…๋ ฅ์œผ๋กœ action ์ œ์™ธ, state๋งŒ ๋ฐ›๊ณ  ์ถœ๋ ฅ์œผ๋กœ action์— ํ•ด๋‹นํ•˜๋Š” ๋ณต์ˆ˜ ๊ฐœ์˜ Q-Value๋ฅผ ์–ป๋Š” ๊ตฌ์กฐ
    • Q-Value ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•ด 4๋ฒˆ ๊ณผ์ • ๊ณ„์‚ฐ์„ ํ•  ๋•Œ, max๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ ์—ฌ๋Ÿฌ ๋ฒˆ action๋งˆ๋‹ค state๋ฅผ ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ  ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

 

Experience Replay

  1. ๋งค ์Šคํ…๋งˆ๋‹ค ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ(์•ก์…˜์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ) ๋ฅผ replay memory D์— ์ €์žฅ
  2. D์— ์ €์žฅ๋œ ์ƒ˜ํ”Œ์„ uniform ํ•˜๊ฒŒ ๋žœ๋ค ์ถ”์ถœํ•ด์„œ Q-update ํ•™์Šต์— ์‚ฌ์šฉ

  • ํ˜„์žฌ ์„ ํƒํ•œ action์„ ์ˆ˜ํ–‰ํ•ด์„œ ๊ฒฐ๊ณผ ๊ฐ’ + ์ƒ˜ํ”Œ์„ ์–ป์Œ(๋™์ผ)
  • ๋ฐ”๋กœ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ง€์—ฐ!

1) Sample correlation

  • ๋”ฅ๋Ÿฌ๋‹์—์„œ ํ•™์Šต ์ƒ˜ํ”Œ์ด ๋…๋ฆฝ์ ์œผ๋กœ ์ถ”์ถœ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ๋ชจ๋ธ์„ ํ•™์Šตํ•จ.
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” ์—ฐ์†๋œ ์ƒ˜ํ”Œ ์‚ฌ์ด์— ์˜์กด์„ฑ์ด ์กด์žฌํ•จ → ์ด์ „ ์ƒ˜ํ”Œ์ด ๋‹ค์Œ ์ƒ˜ํ”Œ ์ƒ์„ฑ์— ์˜ํ–ฅ์„ ์คŒ.
  • ⇒ replay memory D์— ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•ด์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ด์ „์˜ ์ƒ˜ํ”Œ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ˜๋“œ์‹œ ๋‹ค์Œ ์ƒํƒœ์— ์˜ํ–ฅ์„ ์ฃผ์ง€๋Š” ์•Š๊ฒŒ ๋จ == ๋…๋ฆฝ์ถ”์ถœ๊ณผ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ

2) Data Distribution ๋ณ€ํ™”

  • ๋ชจ๋ธ์„ ์ •์ฑ… ํ•จ์ˆ˜์— ๋”ฐ๋ผ์„œ ํ•™์Šตํ•  ๊ฒฝ์šฐ Q-update๋กœ ์ •์ฑ…์ด ๋ฐ”๋€Œ๊ฒŒ ๋˜๋ฉด ์ดํ›„์— ์ƒ์„ฑ๋˜๋Š” data์˜ ๋ถ„ํฌ๋„ ๋ฐ”๋€” ์ˆ˜ ์žˆ๋‹ค.
    • ํ˜„์žฌ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ƒ˜ํ”Œ์— ์˜ํ–ฅ์„ ์ฃผ๋ฏ€๋กœ ํŽธํ–ฅ๋œ ์ƒ˜ํ”Œ๋งŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ์Œ
    • ์ขŒ์ธก์œผ๋กœ ๊ฐ€๋Š” Q ๊ฐ’์ด ํฌ๋ฉด ๋‹ค์Œ์—๋„ ์™ผ์ชฝ์œผ๋กœ๋งŒ ๊ฐ€๋Š” feed back loop ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ
  • ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ถˆ์•ˆ์ •์„ฑ์„ ์œ ๋ฐœํ•˜๊ณ  ์ด๋กœ ์ธํ•ด local minimum ์ˆ˜๋ ด / ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐœ์‚ฐ ๋“ฑ์˜ ๋ฌธ์ œ ๋ฐœ์ƒ ๊ฐ€๋Šฅ
  • ⇒ replay memory์—์„œ ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ๋Š” ์‹œ๊ฐ„ ์ƒ์˜ ์—ฐ๊ด€์„ฑ์ด ์ ๋‹ค. + Q-function์ด ์—ฌ๋Ÿฌ action์„ ๋™์‹œ์— ๊ณ ๋ คํ•ด์„œ ์—…๋ฐ์ดํŠธ(CNN)๋˜๋ฏ€๋กœ ์ •์ฑ… ํ•จ์ˆ˜์™€ training data์˜ ๋ถ„ํฌ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ํŽธํ–ฅ๋˜์ง€ ์•Š๊ณ  smoothing๋จ
  • → ๊ธฐ์กด deep Q-learning์˜ ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ

Experience Replay์˜ ์žฅ์ 

  • Data Efficiency ์ฆ๊ฐ€ : ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์„ ์—ฌ๋Ÿฌ๋ฒˆ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • Sample correlation ๊ฐ์†Œ : ๋žœ๋ค ์ถ”์ถœ๋กœ update variance ๋‚ฎ์ถค
  • ํ•™์Šต ์•ˆ์ •์„ฑ ์ฆ๊ฐ€ : Behavior policy๊ฐ€ ํ‰๊ท ํ™” ๋˜์–ด ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ๋ฐœ์‚ฐ ์–ต์ œ

 

Target Network

  1. Target network \( \theta^- \)๋ฅผ ์ด์šฉํ•ด์„œ target value \( y_t = r_t + \gamma max_{a'}Q(s_{t+1}, a; \theta) \) ๊ณ„์‚ฐ
  2. Main Q-network \( \theta \) ๋ฅผ ์ด์šฉํ•ด์„œ action value \( Q(s_t, a_t;\theta) \) ๋ฅผ ๊ณ„์‚ฐ
  3. Loss function \( (y_t - Q(s_t, a_t;\theta))^2 \) ์ด ์ตœ์†Œํ™”๋˜๋„๋ก Main Q network \( \theta \) ์—…๋ฐ์ดํŠธ
  4. ๋งค C ์Šคํ…๋งˆ๋‹ค target network \( \theta^-$๋ฅผ main Q-network \( \theta \) ๋กœ ์—…๋ฐ์ดํŠธ
  • ๊ธฐ์กด Q network๋ฅผ ๋ณต์ œํ•ด์„œ main Q-network์™€ Target network ์ด์ค‘ ๊ตฌ์กฐ๋กœ ๋งŒ๋“ฌ

  • Main Q-network : state / action์„ ์ด์šฉํ•ด ๊ฒฐ๊ณผ์ธ action value Q๋ฅผ ์–ป๋Š”๋ฐ ์‚ฌ์šฉ, ๋งค ์Šคํ…๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ
  • Target network : ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ์˜ ๊ธฐ์ค€๊ฐ’์ด ๋˜๋Š” target value๋ฅผ ์–ป๋Š”๋ฐ ์‚ฌ์šฉ, ๋งค ์Šคํ…๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ๋˜์ง€ ์•Š๊ณ  C ์Šคํ…๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ Main Q-Network์™€ ๋™๊ธฐํ™”

์›€์ง์ด๋Š” target value ๋ฌธ์ œ๋กœ ์ธํ•œ ๋ถˆ์•ˆ์ •์„ฑ ๊ฐœ์„ 

3) ์›€์ง์ด๋Š” Target Value

  • ๊ธฐ์กด Q-Learning์—์„œ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์œผ๋กœ gradient descent ๊ธฐ๋ฐ˜ ๋ฐฉ์‹ ์ฑ„ํƒ
  • ์—…๋ฐ์ดํŠธ ๊ธฐ์ค€์ ์ธ target value \( y_t \) ๋„ \( \theta \)๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™” → action value์™€ target value๊ฐ€ ๋™์‹œ์— ์›€์ง์ž„
  • ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ \( \theta \)๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜์ง€ ์•Š๊ณ  ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ ์œ ๋ฐœ

⇒ C step ๋™์•ˆ target network๋ฅผ ์ด์šฉํ•ด target value ๊ณ ์ •ํ•ด๋‘๋ฉด C step ๋™์•ˆ์€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ, C step ์ดํ›„ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ธฐํ™”ํ•ด์„œ bias ์ค„์—ฌ์คŒ

Preprocessing

  • 2D Convolutional layer์— ๋งž์ถฐ์„œ ์ •์‚ฌ๊ฐํ˜• ํ˜•ํƒœ์˜ gray scale ์ด๋ฏธ์ง€ ์‚ฌ์šฉ
  • down sampling
  • 4๊ฐœ ํ”„๋ ˆ์ž„์„ ํ•˜๋‚˜์˜ Q-function ์ž…๋ ฅ์œผ๋กœ

 

Model Architecture

Model eveluation

  • ์ตœ๋Œ€ reward
  • ์ตœ๋Œ€ ํ‰๊ท  Q value(smooth)

 

 

References

 

Playing Atari with Deep Reinforcement Learning

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw

arxiv.org

 

[RL] ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜: (1) DQN (Deep Q-Network)

Google DeepMind๋Š” 2013๋…„ NIPS, 2015๋…„ Nature ๋‘ ๋ฒˆ์˜ ๋…ผ๋ฌธ์„ ํ†ตํ•ด DQN (Deep Q-Network) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. DQN์€ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ์ฒซ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ

ai-com.tistory.com

 

 

Playing Atari with Deep Reinforcement Learning · Pull Requests to Tomorrow

Playing Atari with Deep Reinforcement Learning 07 May 2017 | PR12, Paper, Machine Learning, Reinforcement Learning ์ด๋ฒˆ ๋…ผ๋ฌธ์€ DeepMind Technologies์—์„œ 2013๋…„ 12์›”์— ๊ณต๊ฐœํ•œ “Playing Atari with Deep Reinforcement Learning”์ž…๋‹ˆ๋‹ค. ์ด

jamiekang.github.io

 

 

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Human-level Control through Deep Reinforcement Learning (DQN)

Human-level control through deep reinforcement learning | Nature ์ด๋ฒˆ์—๋Š” Nature์ง€์— ๋ฐœํ‘œ๋œ DQN๊ด€๋ จ๋œ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค. Playing Atari with Deep Reinforcement Learning๊ณผ ๊ฑฐ์˜ ๊ฐ™์€ ์ €์ž๋“ค์ด ์ž‘์„ฑ์„ ํ–ˆ๋Š”๋ฐ ์ด๋Š”

limepencil.tistory.com

 

728x90