[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)

2023. 8. 1. 11:11ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90
 

[Model Review] Markov Decision Process & Q-Learning

1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ

dnai-deny.tistory.com

 

Deep Reinforcement Learning

  • ๊ธฐ์กด Q Learning์—์„œ๋Š” State์™€ Action์— ํ•ด๋‹นํ•˜๋Š” Q-Value๋ฅผ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ €์žฅ
    • state space์™€ action space๊ฐ€ ์ปค์ง€๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด memory์™€ exploration time์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ
    ⇒ ๋”ฅ๋Ÿฌ๋‹์œผ๋กœ Q-Table์„ ์ƒ์„ฑํ•˜๋Š” Q-function์„ ๋น„์„ ํ˜• ๋งคํ•‘ํ•˜๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ์ฐพ์„ ํ•„์š” X !

 

Deep Q-Network(DQN)

CNN + Experience replay + Target network

  • ๊ธฐ์กด Deep Q-Learning
    1. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋งค ์Šคํ…๋งˆ๋‹ค 2-5๋ฅผ ๋ฐ˜๋ณต
    2. Action \( a_t )\๋ฅผ E-Greedy ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ
      • E-Greedy = policy๊ฐ€ ์ •ํ•œ action์„ 1-E์˜ ํ™•๋ฅ ๋กœ ์‹คํ–‰, E์˜ ํ™•๋ฅ ์œผ๋กœ๋Š” ๋žœ๋ค action ์‹คํ–‰
    3. Action \( a_t \)๋ฅผ ์ˆ˜ํ–‰ํ•ด์„œ transition \( e_t = (s_t, a_t, r_t, s_{t+1}) \)์„ ์–ป์Œ
    4. Target value \( y_t = r_t + \gamma max_{a'}Q(s_{t+1}, a; \theta) \)
    5. Loss function \( (y_t - Q(s_t, a_t;\theta))^2 \) ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ \( \theta \)๋ฅผ ์—…๋ฐ์ดํŠธ

→ DQN์—์„œ๋Š” transition sample์„ ์–ป๋Š” 3๋ฒˆ์—์„œ experience replay, ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ 4-5๋ฒˆ์—์„œ target network ์ ์šฉ

 

CNN Architecture

  • CNN์„ ๋„์ž…ํ•ด์„œ ์ธ๊ฐ„๊ณผ ๋น„์Šทํ•œ ํ˜•ํƒœ๋กœ Atari ๊ฒŒ์ž„(๋ฒฝ๋Œ๊นจ๊ธฐ)์˜ input์„ ์ฒ˜๋ฆฌํ•จ
    • ๋†’์€ ์ฐจ์›์˜ ์ด๋ฏธ์ง€ ์ธํ’‹ ์ฒ˜๋ฆฌ์— ํšจ๊ณผ์  → linear function approximator๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๋‹ค
  • CNN ์ž…๋ ฅ์œผ๋กœ action ์ œ์™ธ, state๋งŒ ๋ฐ›๊ณ  ์ถœ๋ ฅ์œผ๋กœ action์— ํ•ด๋‹นํ•˜๋Š” ๋ณต์ˆ˜ ๊ฐœ์˜ Q-Value๋ฅผ ์–ป๋Š” ๊ตฌ์กฐ
    • Q-Value ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•ด 4๋ฒˆ ๊ณผ์ • ๊ณ„์‚ฐ์„ ํ•  ๋•Œ, max๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ ์—ฌ๋Ÿฌ ๋ฒˆ action๋งˆ๋‹ค state๋ฅผ ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ  ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

 

Experience Replay

  1. ๋งค ์Šคํ…๋งˆ๋‹ค ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ(์•ก์…˜์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ) ๋ฅผ replay memory D์— ์ €์žฅ
  2. D์— ์ €์žฅ๋œ ์ƒ˜ํ”Œ์„ uniform ํ•˜๊ฒŒ ๋žœ๋ค ์ถ”์ถœํ•ด์„œ Q-update ํ•™์Šต์— ์‚ฌ์šฉ

  • ํ˜„์žฌ ์„ ํƒํ•œ action์„ ์ˆ˜ํ–‰ํ•ด์„œ ๊ฒฐ๊ณผ ๊ฐ’ + ์ƒ˜ํ”Œ์„ ์–ป์Œ(๋™์ผ)
  • ๋ฐ”๋กœ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ง€์—ฐ!

1) Sample correlation

  • ๋”ฅ๋Ÿฌ๋‹์—์„œ ํ•™์Šต ์ƒ˜ํ”Œ์ด ๋…๋ฆฝ์ ์œผ๋กœ ์ถ”์ถœ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ๋ชจ๋ธ์„ ํ•™์Šตํ•จ.
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” ์—ฐ์†๋œ ์ƒ˜ํ”Œ ์‚ฌ์ด์— ์˜์กด์„ฑ์ด ์กด์žฌํ•จ → ์ด์ „ ์ƒ˜ํ”Œ์ด ๋‹ค์Œ ์ƒ˜ํ”Œ ์ƒ์„ฑ์— ์˜ํ–ฅ์„ ์คŒ.
  • ⇒ replay memory D์— ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•ด์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ด์ „์˜ ์ƒ˜ํ”Œ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ˜๋“œ์‹œ ๋‹ค์Œ ์ƒํƒœ์— ์˜ํ–ฅ์„ ์ฃผ์ง€๋Š” ์•Š๊ฒŒ ๋จ == ๋…๋ฆฝ์ถ”์ถœ๊ณผ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ

2) Data Distribution ๋ณ€ํ™”

  • ๋ชจ๋ธ์„ ์ •์ฑ… ํ•จ์ˆ˜์— ๋”ฐ๋ผ์„œ ํ•™์Šตํ•  ๊ฒฝ์šฐ Q-update๋กœ ์ •์ฑ…์ด ๋ฐ”๋€Œ๊ฒŒ ๋˜๋ฉด ์ดํ›„์— ์ƒ์„ฑ๋˜๋Š” data์˜ ๋ถ„ํฌ๋„ ๋ฐ”๋€” ์ˆ˜ ์žˆ๋‹ค.
    • ํ˜„์žฌ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ƒ˜ํ”Œ์— ์˜ํ–ฅ์„ ์ฃผ๋ฏ€๋กœ ํŽธํ–ฅ๋œ ์ƒ˜ํ”Œ๋งŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ์Œ
    • ์ขŒ์ธก์œผ๋กœ ๊ฐ€๋Š” Q ๊ฐ’์ด ํฌ๋ฉด ๋‹ค์Œ์—๋„ ์™ผ์ชฝ์œผ๋กœ๋งŒ ๊ฐ€๋Š” feed back loop ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ
  • ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ถˆ์•ˆ์ •์„ฑ์„ ์œ ๋ฐœํ•˜๊ณ  ์ด๋กœ ์ธํ•ด local minimum ์ˆ˜๋ ด / ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐœ์‚ฐ ๋“ฑ์˜ ๋ฌธ์ œ ๋ฐœ์ƒ ๊ฐ€๋Šฅ
  • ⇒ replay memory์—์„œ ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ๋Š” ์‹œ๊ฐ„ ์ƒ์˜ ์—ฐ๊ด€์„ฑ์ด ์ ๋‹ค. + Q-function์ด ์—ฌ๋Ÿฌ action์„ ๋™์‹œ์— ๊ณ ๋ คํ•ด์„œ ์—…๋ฐ์ดํŠธ(CNN)๋˜๋ฏ€๋กœ ์ •์ฑ… ํ•จ์ˆ˜์™€ training data์˜ ๋ถ„ํฌ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ํŽธํ–ฅ๋˜์ง€ ์•Š๊ณ  smoothing๋จ
  • → ๊ธฐ์กด deep Q-learning์˜ ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ

Experience Replay์˜ ์žฅ์ 

  • Data Efficiency ์ฆ๊ฐ€ : ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์„ ์—ฌ๋Ÿฌ๋ฒˆ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • Sample correlation ๊ฐ์†Œ : ๋žœ๋ค ์ถ”์ถœ๋กœ update variance ๋‚ฎ์ถค
  • ํ•™์Šต ์•ˆ์ •์„ฑ ์ฆ๊ฐ€ : Behavior policy๊ฐ€ ํ‰๊ท ํ™” ๋˜์–ด ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ๋ฐœ์‚ฐ ์–ต์ œ

 

Target Network

  1. Target network \( \theta^- \)๋ฅผ ์ด์šฉํ•ด์„œ target value \( y_t = r_t + \gamma max_{a'}Q(s_{t+1}, a; \theta) \) ๊ณ„์‚ฐ
  2. Main Q-network \( \theta \) ๋ฅผ ์ด์šฉํ•ด์„œ action value \( Q(s_t, a_t;\theta) \) ๋ฅผ ๊ณ„์‚ฐ
  3. Loss function \( (y_t - Q(s_t, a_t;\theta))^2 \) ์ด ์ตœ์†Œํ™”๋˜๋„๋ก Main Q network \( \theta \) ์—…๋ฐ์ดํŠธ
  4. ๋งค C ์Šคํ…๋งˆ๋‹ค target network \( \theta^-$๋ฅผ main Q-network \( \theta \) ๋กœ ์—…๋ฐ์ดํŠธ
  • ๊ธฐ์กด Q network๋ฅผ ๋ณต์ œํ•ด์„œ main Q-network์™€ Target network ์ด์ค‘ ๊ตฌ์กฐ๋กœ ๋งŒ๋“ฌ

  • Main Q-network : state / action์„ ์ด์šฉํ•ด ๊ฒฐ๊ณผ์ธ action value Q๋ฅผ ์–ป๋Š”๋ฐ ์‚ฌ์šฉ, ๋งค ์Šคํ…๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ
  • Target network : ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ์˜ ๊ธฐ์ค€๊ฐ’์ด ๋˜๋Š” target value๋ฅผ ์–ป๋Š”๋ฐ ์‚ฌ์šฉ, ๋งค ์Šคํ…๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ๋˜์ง€ ์•Š๊ณ  C ์Šคํ…๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ Main Q-Network์™€ ๋™๊ธฐํ™”

⇒ ์›€์ง์ด๋Š” target value ๋ฌธ์ œ๋กœ ์ธํ•œ ๋ถˆ์•ˆ์ •์„ฑ ๊ฐœ์„ 

3) ์›€์ง์ด๋Š” Target Value

  • ๊ธฐ์กด Q-Learning์—์„œ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์œผ๋กœ gradient descent ๊ธฐ๋ฐ˜ ๋ฐฉ์‹ ์ฑ„ํƒ
  • ์—…๋ฐ์ดํŠธ ๊ธฐ์ค€์ ์ธ target value \( y_t \) ๋„ \( \theta \)๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™” → action value์™€ target value๊ฐ€ ๋™์‹œ์— ์›€์ง์ž„
  • ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ \( \theta \)๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜์ง€ ์•Š๊ณ  ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ ์œ ๋ฐœ

⇒ C step ๋™์•ˆ target network๋ฅผ ์ด์šฉํ•ด target value ๊ณ ์ •ํ•ด๋‘๋ฉด C step ๋™์•ˆ์€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ, C step ์ดํ›„ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ธฐํ™”ํ•ด์„œ bias ์ค„์—ฌ์คŒ

Preprocessing

  • 2D Convolutional layer์— ๋งž์ถฐ์„œ ์ •์‚ฌ๊ฐํ˜• ํ˜•ํƒœ์˜ gray scale ์ด๋ฏธ์ง€ ์‚ฌ์šฉ
  • down sampling
  • 4๊ฐœ ํ”„๋ ˆ์ž„์„ ํ•˜๋‚˜์˜ Q-function ์ž…๋ ฅ์œผ๋กœ

 

Model Architecture

Model eveluation

  • ์ตœ๋Œ€ reward
  • ์ตœ๋Œ€ ํ‰๊ท  Q value(smooth)

 

 

References

 

Playing Atari with Deep Reinforcement Learning

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw

arxiv.org

 

[RL] ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜: (1) DQN (Deep Q-Network)

Google DeepMind๋Š” 2013๋…„ NIPS, 2015๋…„ Nature ๋‘ ๋ฒˆ์˜ ๋…ผ๋ฌธ์„ ํ†ตํ•ด DQN (Deep Q-Network) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. DQN์€ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ์ฒซ๋ฒˆ์งธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ

ai-com.tistory.com

 

 

Playing Atari with Deep Reinforcement Learning · Pull Requests to Tomorrow

Playing Atari with Deep Reinforcement Learning 07 May 2017 | PR12, Paper, Machine Learning, Reinforcement Learning ์ด๋ฒˆ ๋…ผ๋ฌธ์€ DeepMind Technologies์—์„œ 2013๋…„ 12์›”์— ๊ณต๊ฐœํ•œ “Playing Atari with Deep Reinforcement Learning”์ž…๋‹ˆ๋‹ค. ์ด

jamiekang.github.io

 

 

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Human-level Control through Deep Reinforcement Learning (DQN)

Human-level control through deep reinforcement learning | Nature ์ด๋ฒˆ์—๋Š” Nature์ง€์— ๋ฐœํ‘œ๋œ DQN๊ด€๋ จ๋œ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค. Playing Atari with Deep Reinforcement Learning๊ณผ ๊ฑฐ์˜ ๊ฐ™์€ ์ €์ž๋“ค์ด ์ž‘์„ฑ์„ ํ–ˆ๋Š”๋ฐ ์ด๋Š”

limepencil.tistory.com

 

728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ  (2) 2023.10.23
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)  (0) 2023.10.16
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)  (0) 2023.10.06
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€  (0) 2023.08.16
[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning  (0) 2023.08.01
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
  • [๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
  • [๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€
  • [๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”