[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning

2023. 8. 1. 09:32ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90

1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP)

๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process)

๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process)

  • ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ
    • ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ์ „์ด๊ฐ€ ์ผ์–ด๋‚จ
    • ์ƒํƒœ ์ „์ด์— ๊ฐ๊ฐ ํ™•๋ฅ  ์กด์žฌ
    • S4์˜ ๊ฒฝ์šฐ ์ข…๋ฃŒ์ƒํƒœ

๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ(Markov property)

$$ P[S_{t+1} | S_t] = P[S_{t+1} |S_1,S_2, ... S_t] $$

  • ์ƒํƒœ๊ฐ€ ๋˜๊ธฐ๊นŒ์ง€์˜ ๊ณผ์ •์€ ํ™•๋ฅ  ๊ณ„์‚ฐ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Œ.
  • ์–ด๋А ์‹œ์ ์˜ ์ƒํƒœ๋กœ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์„ ๋•Œ ๋งˆ๋ฅด์ฝ”ํ”„ํ•œ ์ƒํƒœ๋ผ๊ณ  ํ•จ.๋ฐ˜๋ก€) ์šด์ „ํ•˜๋Š” ์‚ฌ์ง„(์–ด๋А ์‹œ์ ์˜ ์‚ฌ์ง„์œผ๋กœ๋Š” ํ›„์ง„/์ „์ง„/์†๋„ ๋“ฑ์„ ํŒŒ์•… ๋ถˆ๊ฐ€ → ๋‹ค์Œ ์ƒํƒœ ๊ฒฐ์ • ๋ถˆ๊ฐ€๋Šฅ)
  • ex) ์ฒด์Šค ๊ฒŒ์ž„(์–ด๋А ์‹œ์ ์˜ ์‚ฌ์ง„์œผ๋กœ ๋‹ค์Œ ์ˆ˜ ๊ฒฐ์ • ๊ฐ€๋Šฅ)

์–ด๋–ค ํ˜„์ƒ์„ ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค๋กœ ๋ชจ๋ธ๋งํ•˜๋ ค๋ฉด ์ƒํƒœ๊ฐ€ ๋งˆ๋ฅด์ฝ”ํ”„ ํ•ด์•ผํ•˜๋ฉฐ, ๋‹จ์ผ ์ƒํƒœ ์ •๋ณด๋งŒ์œผ๋กœ๋„ ์ •๋ณด๊ฐ€ ์ถฉ๋ถ„ํ•˜๋„๋ก ์ƒํƒœ๋ฅผ ์ž˜ ๊ตฌ์„ฑํ•ด์•ผ ํ•จ

๋งˆ๋ฅด์ฝ”๋ธŒ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP)

  • ์—์ด์ „ํŠธ๊ฐ€ ์ƒํ™ฉ๋งˆ๋‹ค ์•ก์…˜์„ ์ทจํ•˜๋ฉด ์•ก์…˜์— ๋”ฐ๋ผ ์ƒํƒœ๊ฐ€ ๋ณ€ํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ ๋ณด์ƒ์ด ์ฃผ์–ด์ง

$$ MDP = (S, A, P, R, \gamma) $$

  • S ⇒ ์ƒํƒœ ์ง‘ํ•ฉ
  • A ⇒ ์•ก์…˜ ์ง‘ํ•ฉ
  • P ⇒ ์ „์ด ํ™•๋ฅ  ํ–‰๋ ฌ
    • ํ˜„์žฌ ์ƒํƒœ s์—์„œ a ๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ ๋‹ค์Œ ์ƒํƒœ๊ฐ€ s’๊ฐ€ ๋  ํ™•๋ฅ 
    $$ P^a_{SS'} = P[S_{t+1} = s'|S_t = s, A_t = a] $$
  • ๋ณด์ƒํ•จ์ˆ˜ R
    • ํ˜„์žฌ ์ƒํƒœ s์—์„œ a๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ ๋ฐ›๋Š” ๋ณด์ƒ์˜ ๊ธฐ๋Œ“๊ฐ’
    $$ R^a_S = E[R_{t+1} | S_t = s, A_t = a] $$
  • ๊ฐ์‡ ์ธ์ž ${\gamma}$
    • 0 < ${\gamma}$ < 1
    • ๋ฏธ๋ž˜์— ์–ป์„ ๋ณด์ƒ๊ณผ ๋‹น์žฅ ์–ป์„ ๋ณด์ƒ์˜ ์ค‘์š”๋„๋ฅผ ๊ฒฐ์ •
    • ๋ฏธ๋ž˜์— ์–ป์„ ๋ณด์ƒ ๊ฐ’์— ${\gamma}$๊ฐ€ ์—ฌ๋Ÿฌ๋ฒˆ ๊ณฑํ•ด์ง€๋ฉฐ ๊ทธ ๊ฐ’์„ ์ž‘๊ฒŒ ๋งŒ๋“ฌ

์ •์ฑ… ํ•จ์ˆ˜

  • ๊ฐ ์ƒํƒœ์—์„œ ์–ด๋–ค ์•ก์…˜์„ ์„ ํƒํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜= ์ƒํƒœ s์—์„œ ์•ก์…˜ a๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ 
    • s์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์…˜์ด a0, a1, a2๋ผ๊ณ  ํ•  ๋•Œ,
    $$ \pi(a_0|s) = 0.2, \pi(a_1|s) = 0.3, \pi(a_2|s) = 0.5 $$
    • ์•ก์…˜์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ€์—ฌ
  • $$ \pi(a|s) = P[A_t = a | s_t = s] $$

์ƒํƒœ ๊ฐ€์น˜ ํ•จ์ˆ˜

  • ์ฃผ์–ด์ง„ ์ƒํƒœ์—์„œ ์ •์ฑ…ํ•จ์ˆ˜๋กœ ์ •ํ•œ ์•ก์…˜์— ๋”ฐ๋ผ ํ–‰๋™ํ–ˆ์„ ๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฆฌํ„ด์˜ ๊ธฐ๋Œ“๊ฐ’
  • s → ๋๊นŒ์ง€ ์ •์ฑ…ํ•จ์ˆ˜ ๋”ฐ๋ผ์„œ ์›€์ง์ผ ๋•Œ

$$ v_\pi(s) = E_\pi[r_{t+1} + \gamma r_{t+1} + \gamma^2r_{t+1} + ... | S_t = s] = E_\pi[G_t|S_t = s] $$

์•ก์…˜ ๊ฐ€์น˜ ํ•จ์ˆ˜

  • ๊ฐ ์ƒํƒœ์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์…˜์„ ์ „๋ถ€ ํ‰๊ฐ€ํ•ด๋ณด๊ณ  ๊ฐ€์žฅ ๋ณด์ƒ์ด ํฐ ์•ก์…˜ ์„ ํƒ
  • input ⇒ state ์™€ action ⇒ action๊ณผ ์ƒํƒœ๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ ํ‰๊ฐ€

$$ q_\pi(s, a) = E_\pi[G_t|S_t = s, A_t = a] $$

  • s์—์„œ a๋ฅผ ์„ ํƒํ•œ ๋’ค ๊ทธ ๋’ค๋กœ๋Š” ์ •์ฑ… ํ•จ์ˆ˜๋ฅผ ๋”ฐ๋ผ์„œ ์›€์ง์ผ ๋•Œ ์–ป๋Š” ๋ฆฌํ„ด์˜ ๊ธฐ๋Œ“๊ฐ’

 

2. Q-Learning

Model Free Algorithm

  • ๊ธฐ์กด์˜ model based algorithm ๋“ค์€ ์–ด๋–ค ์ƒํƒœ์—์„œ ์–ด๋–ค ํ–‰๋™์„ ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋‹ค์Œ์˜ ์ƒํƒœ๊ฐ€ ๋  ํ™•๋ฅ 
  • ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ์•Œ๊ณ , ํ–‰๋™์— ๋”ฐ๋ฅธ ํ™˜๊ฒฝ์˜ ๋ณ€ํ™”๋ฅผ ์•„๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ๊ณ„์† ํƒํ—˜ํ•ด์„œ Trail and error๋ฅผ ์–ป๊ณ  policy function์„ ์ ์ฐจ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉํ–ฅmodel free algorithm ⇒ agent๊ฐ€ ํ–‰๋™์„ ํ†ตํ•ด ์˜ˆ์ƒ๋˜๋Š” ๋ณด์ƒ์˜ ์ดํ•ฉ์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” policy function์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ

  • ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ๋ชจ๋ฅด๊ณ , ํ™˜๊ฒฝ์ด ์•Œ๋ ค์ฃผ๋Š” ๋‹ค์Œ ์ƒํƒœ์™€ ๋‹ค์Œ ๋ณด์ƒ์„ ์ˆ˜๋™์ ์œผ๋กœ ํš๋“

Q-Learning

  • ์œ ํ•œํ•œ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •์—์„œ ํŠน์ • ์ƒํ™ฉํ•ด์„œ ํŠน์ • ํ–‰๋™์„ ํ•˜๋ผ๋Š” ์ตœ์ ์˜ policy ๋ฅผ ๋ฐฐ์šฐ๋Š” ๊ฒƒ

Markov Decision Process

Q-Value

  • ํ˜„์žฌ ์ƒํƒœ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด ๋ชจ๋“  ์—ฐ์†์ ์ธ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณค์„ ๋•Œ ์ „์ฒด ๋ณด์ƒ์˜ ์˜ˆ์ธก๊ฐ’ ๊ทน๋Œ€ํ™”
  • Q ⇒ ํ˜„์žฌ ์ƒํƒœ์—์„œ ์ทจํ•œ ํ–‰๋™์˜ ๋ณด์ƒ(Quality)
  • ์•ก์…˜ ๊ฐ€์น˜ ํ•จ์ˆ˜(ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜) ๊ธฐ๋ฐ˜์œผ๋กœ Q-Value ๊ณ„์‚ฐ
    • $\gamma$ ๋Š” ํ˜„์žฌ๋กœ๋ถ€ํ„ฐ t ์‹œ๊ฐ„ ํ›„์˜ ๋ณด์ƒ์— ๋Œ€ํ•œ ์˜ํ–ฅ๋ ฅ์„ ์–ผ๋งˆ๋‚˜ ์ค„์ผ์ง€, ์ฆ‰ ํ˜„์žฌ ์ƒํƒœ์— ๋Œ€ํ•œ ๋ณด์ƒ์˜ ์ค‘์š”๋„๋ฅผ ์–ด๋А์ •๋„๋กœ ์ฑ…์ •ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฐ’

Q-Learning

  • Q ํ•จ์ˆ˜๊ฐ€ ๊ณ ์ •๋œ ์ž„์˜์˜ ๊ฐ’์„ ๊ฐ–๊ณ  ์‹œ์ž‘ → ๋งค time-step๋งˆ๋‹ค Agent๊ฐ€ ํ–‰๋™ at๋ฅผ ์„ ํƒํ•˜๊ณ , ๋ณด์ƒ rt๋ฅผ ๋ฐ›์œผ๋ฉฐ ์ƒˆ๋กœ์šด ์ƒํƒœ s(t+1)๋กœ ์ „์ดํ•˜๊ณ  Q ๊ฐ’์„ ๊ฐฑ์‹ 
  • ์ด์ „ ๊ฐ’๊ณผ ์ƒˆ๋กœ์šด ์ •๋ณด์˜ ๊ฐ€์ค‘ํ•ฉ(weighted sum)์„ ์ด์šฉํ•˜๋Š” Value Iteration update๊ธฐ๋ฒ•์ด ํ•ต์‹ฌ

$$ Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } {\rm {learning~rate}}\cdot \left(\overbrace {\underbrace {r{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max {a}Q(s{t+1},a)} _{\rm {estimate~of~optimal~future~value}}} ^{\rm {learned~value}}\right) $$

728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ  (2) 2023.10.23
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)  (0) 2023.10.16
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)  (0) 2023.10.06
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€  (0) 2023.08.16
[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)  (0) 2023.08.01
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
  • [๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
  • [๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€
  • [๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”