[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ

2023. 10. 23. 09:28ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90

โ€ป ์•„๋ž˜ ๋งํฌ์˜ ๋‚ด์šฉ์„ ๊ณต๋ถ€ํ•˜๋ฉฐ ํ•œ๊ตญ์–ด๋กœ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

 

Reinforcement Learning: Dealing with Sparse Reward Environments

Reinforcement Learning (RL) is a method of machine learning in which an agent learns a strategy through interactions with its environment…

medium.com

 

1. Sparse Reward

  • Sparse Reward(ํฌ๋ฐ•ํ•œ ๋ณด์ƒ) : Agent๊ฐ€ ๋ชฉํ‘œ ์ƒํ™ฉ์— ๊ฐ€๊นŒ์›Œ์กŒ์„ ๋•Œ๋งŒ ๊ธ์ • ๋ณด์ƒ์„ ๋ฐ›๋Š” ๊ฒฝ์šฐ
    • ํ˜„์žฌ ์‹คํ—˜ ํ™˜๊ฒฝ ์„ธํŒ…๊ณผ ๊ฐ™์Œ
  1. Curiosity-Driven method
    • agent๊ฐ€ ๊ด€์‹ฌ์‚ฌ ๋ฐ–์˜ ํ™˜๊ฒฝ์—๋„ ๋™๊ธฐ๋ฅผ ๋ฐ›๋„๋ก
  2. Curriculum learning
    • agent๊ฐ€ ๋ชฉํ‘œ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด๋ฃฐ ์ˆ˜ ์žˆ๋„๋ก ์ปค๋ฆฌํ˜๋Ÿผ์„ ์ž‘์„ฑํ•ด์คŒ
  3. Auxiliary task
    • ๋ณด์กฐ ์ž‘์—… - ์ดˆ๊ธฐ์˜ ํฌ์†Œ๋ณด์ƒ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ๊ณผ๋Š” ๋‹ค๋ฅด์ง€๋งŒ agent์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์„ ์คŒ

Sparse Reward Task

  • ํฌ์†Œ ๋ณด์ƒ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ˜•์‹ = ํ˜„์žฌ agent ์ƒํƒœ๋ฅผ s๋กœ, ๋ชฉํ‘œ ์ƒํƒœ๋ฅผ s_g๋ผ๊ณ  ํ•  ๋•Œ, s - s_g์˜ ๊ฐ’์ด ์ž„๊ณ„๊ฐ’๋ณด๋‹ค ์ž‘์œผ๋ฉด ํ•ด๋‹น ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•œ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ.

  • ๋ณด์ƒ์„ ๋ฐ›๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ดˆ๊ธฐ์ƒํƒœ s_0๋ถ€ํ„ฐ ํ™˜๊ฒฝ ํƒ์ƒ‰์„ ์‹œ์ž‘ํ•ด์•ผํ•จ.
  • ์ผ์ข…์˜ local minimum in gradient descent ์— ๋น ์ง€์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์„ ํƒํ•ด๋ณด์ง€ ์•Š์€ ํ–‰๋™๋„ ์„ ํƒํ•˜๋ฉฐ ํ™˜๊ฒฝ์„ ํƒ์ƒ‰ํ•ด๋‚˜๊ฐ€์•ผํ•˜๊ณ , ๋™์‹œ์— ๋ณด์ƒ์ด ๋งŽ์€ ๋ฐฉํ–ฅ์œผ๋กœ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธ๋„ ํ•ด์•ผํ•จ
  • ํ™˜๊ฒฝ ํƒ์ƒ‰๊ณผ ๋ณด์ƒ ์ด์šฉ(๊ฐœ๋ฐœ)์˜ trade-off ๋ฌธ์ œ๋ฅผ ${\epsilon}-greedy$ ๋ฐฉ๋ฒ•์„ ์จ์„œ action์„ ํ™•๋ฅ ์ด ๋†’์€ ๊ฒƒ๊ณผ ๋žœ๋คํ•œ ๊ฒƒ ์ค‘ ๊ณ ๋ฅด๋ฉด์„œ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•œ ์‚ฌ๋ก€๊ฐ€ ์žˆ์Œ.

Reward Shaping

  • ๊ธฐ๋ณธ ๋ณด์ƒ์„ ์ถ”๊ฐ€์ ์ธ ์ž‘์—…์„ ํ†ตํ•ด์„œ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ. ๊ฐ€์žฅ ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•
  • ์ถ”๊ฐ€ ๋ณด์ƒ์„ ํ†ตํ•ด์„œ ์ ์ ˆํ•˜๊ฒŒ ์‹ค์ œ ํฌ์†Œ ๋ณด์ƒ๊ณผ์˜ ๊ฐญ์„ ์ปค๋ฒ„ํ•˜๋Š” ๊ฒƒ
  • ๋นˆ๋Œ€ ์žก๋‹ค๊ฐ€ ์ดˆ๊ฐ€์‚ผ๊ฐ„ ํƒœ์šธ ์ˆ˜ ์žˆ์Œ -> ์ด๋Ÿฌํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋Š” ์ฃผ๋กœ ํ•ธ๋“œ๋ฉ”์ด๋“œ๊ณ  ์‚ฌ๋žŒ์˜ ์ „๋ฌธ์„ฑ์„ ํ•„์š”๋กœ ํ•จ.
    • ์ด๋Ÿฐ ๊ฒฝ์šฐ์— ์ •์ฑ… ํ•™์Šต ์ค‘์— human bias์ด ๋ฐ˜์˜๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์„ ์ˆ˜๋„ ์žˆ์Œ
    • ์‚ฌ๋žŒ์ด ์ฐพ์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ์ •์ฑ…์„ ์ฐพ๋Š” ๊ฒƒ์—๋„ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Œ

2. Curiosity-Driven Method

  • Curiosity-driven method์˜ ๋ฐฐ๊ฒฝ์—๋Š” agent๊ฐ€ ๊ฒฝํ—˜ํ•ด๋ณด์ง€ ๋ชปํ•œ state๋ฅผ ๋ฐฉ๋ฌธํ•˜๋Š” ๊ฒƒ์ด ๋นˆ์•ฝํ•œ ๋ณด์ƒ์„ ์ฑ„์›Œ์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ  ๊ถŒ์žฅํ•œ๋‹ค๋Š” ๊ฐ€์„ค์ด ์žˆ์Œ
  • ํ˜„์‹ค์—์„œ์˜ ์ถ”์ธก์€ ํ˜ธ๊ธฐ์‹ฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™˜๊ฒฝ์„ ํƒ์ƒ‰ํ•˜๋ฉด์„œ ๋ฐฐ์šฐ๋Š” ์•„๊ธฐ์™€ ๊ฐ™์Œ
    • ์ฒ˜์Œ์— ์ž๊ธฐ ๋ชธ์„ ์‹ ๊ธฐํ•ดํ•˜๋‹ค๊ฐ€, ์ต์ˆ™ํ•ด์ง€๋ฉด ํ™˜๊ฒฝ์— ์žˆ๋Š” ๋‹ค๋ฅธ ๊ฐ์ฒด๋“ค์— ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ
  • ์ด๊ฒƒ์ฒ˜๋Ÿผ agent๊ฐ€ ๊ถ๊ธˆํ•ดํ•˜๋ฉด์„œ ํƒ์ƒ‰ํ•˜๋‹ค๊ฐ€ agnent๊ฐ€ ๊ฐ€์žฅ ์ต์ˆ™ํ•˜์ง€ ์•Š์€(unusual) ์ƒํƒœ๋กœ ํ–‰๋™์„ ์„ ํƒํ•ด์„œ ๊ฐ€๊ธฐ๋ฅผ ๊ธฐ๋Œ€

Intrinsic curiosity-driven exploration by self-supervised prediction

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction. ”

  • agent๊ฐ€ ์ƒˆ๋กœ์šด ์ƒํƒœ๋ฅผ ์ฐพ์•„๊ฐ€๋Š” ๋ฐฉ์‹์œผ๋กœ ํ™˜๊ฒฝ์„ ํƒ์ƒ‰ํ•˜๊ณ  ํ–‰๋™์˜ ๊ฒฐ๊ณผ ์˜ˆ์ธก์˜ ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ–‰๋™์„ ์„ ํƒํ•˜๋„๋ก ํ•™์Šต
  • ๋‚ด์žฌ์  ํ˜ธ๊ธฐ์‹ฌ ๋ชจ๋“ˆ(Intrinsic curiosity module ICM)์„ ํ†ตํ•ด ํ˜ธ๊ธฐ์‹ฌ์„ ๊ตฌํ˜„
    • ๋‘ ๊ฐœ์˜ neural network๋ฅผ hidden layer์™€ ๊ฒฐํ•ฉํ•˜๋Š” ํ˜•ํƒœ
    • pixel observation ์ž„๋ฒ ๋”ฉ์„ ์œ„ํ•ด์„œ(?)

1. The Dynamics model

- ์„ ํƒํ•œ ํ–‰๋™ $a_t$ ์™€ ์ƒํƒœ $s_t$๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ ์ƒํƒœ $s_{t+1}$์„ ์˜ˆ์ธกํ•จ
- ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ state์˜ ํŽธ์ฐจ๋ฅผ ์ƒˆ๋กœ์›€์œผ๋กœ ๊ฐ„์ฃผ
- agent๊ฐ€ ์˜ˆ์ธก์„ ์ญ‰ ์ตœ์ ํ™” ํ•˜๋ฉด์„œ ๋™์‹œ์— ์˜ˆ์ธก์ด ํ‹€๋ฆฐ ์ƒํƒœ๋“ค์„ ์ฐพ์•„๊ฐ€๊ฒŒ ๋˜๋ฉด agent๋Š” ์ƒˆ๋กœ์šด ์ƒํƒœ์— ๋ฐฉ๋ฌธํ•˜๋Š” action์„ ์ง€์†์ ์œผ๋กœ ์ทจํ•  ์ˆ˜ ์žˆ์Œ

2. The Inverse model

- ํ˜„์žฌ ์ƒํƒœ $s_t$์™€ ๋‹ค์Œ ์ƒํƒœ $s_{t+1}$์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ–‰๋™ $a_t$ ๋ฅผ ์˜ˆ์ธกํ•จ
- ICM์ด ์ƒ์‘ํ•˜๋Š” ํ–‰๋™์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š” observation ํŠน์„ฑ๋“ค๋งŒ ์ž„๋ฒ ๋“œ ํ•œ๋‹ค๋Š” ๊ฒƒ์— ๊ธฐ๋ฐ˜ํ•จ
- ํ–‰๋™์„ ์„ ํƒํ•˜๋Š”๋ฐ ๊ด€๋ จ์ด ์—†๋Š” input space์˜ ์ •๋ณด์— ๊ด€์‹ฌ์„ ๊ฐ–์ง€ ์•Š๊ฒŒ
-

state space์— ์ „์ฒด access ๊ถŒํ•œ์ด ์žˆ์œผ๋ฉด inverse model ์‚ฌ์šฉํ•  ํ•„์š” ์—†์Œ

  • agent๋Š” ๋งŽ์€ ๋ถ€๋ถ„์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ํ•œ ๋ฒˆ์— ์ตœ์ ํ™”ํ•˜๊ฒŒ ๋จ
    • $L_I$ : inverse dynamic model๋กœ ์˜ˆ์ธกํ•œ ํ–‰๋™๊ณผ ์‹ค์ œ ํ–‰๋™ $a_t$ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”
    • $L_F$ : dynamic mocdel์˜ ์˜ˆ์ธก์„ ๋ฐœ์ „์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ชฉํ‘œ ํ•จ์ˆ˜์˜ ์ฐจ์ด๋ฅผ ์ค„์ž„
    • $R$ : ์˜ˆ์ƒ๋˜๋Š” ๋ˆ„์  ์™ธ๋ถ€ ๋ณด์ƒ

  • $0 <= \beta <= 1$ ์ผ ๋•Œ inverse model loss๋Š” forward model loss์˜ ๋ฐ˜๋Œ€
  • $\lambda > 0$ ๋Š” ์™ธ๋ถ€ ๋ณด์ƒ์ด ๋‚ด์žฌ ๋ณด์ƒ ์‹ ํ˜ธ์— ์™ธ๋ถ€ ์š”์†Œ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ๊ฐ€

Planning to Explore via self-supervised World Models

R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models.”

  • model based agent์—์„œ๋„ curiority๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ.
  • ๋จผ์ € global world model์„ ๋นŒ๋“œํ•˜๊ธฐ ์œ„ํ•ด ์™ธ๋ถ€ ๋ณด์ƒ ์—†์ด ํ™˜๊ฒฝ์„ ํƒ์ƒ‰ํ•˜๋ฉด์„œ ์ž๊ธฐ ์ง€๋„ํ•™์Šต์„ ํ•จ.
  • ๊ทธ๋ฆฌ๊ณ  agent๋Š” ๊ฒฝํ—˜ํ•˜์ง€ ๋ชปํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™˜๊ฒฝ์— ์ ์‘ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํŠน์ • tasks์— ๋Œ€ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋ฐ›์Œ.

Why Model-Based self-supervised curiosity exploration

  1. model-freeํ•œ intrinsic curiosity model ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ํŠน์ •ํ•œ task์— ์ ์‘ํ•˜๊ธฐ ์œ„ํ•œ ์ •์ฑ… ํƒ์ƒ‰์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•จ
  2. ๊ธฐ์กด curiosity ๋ฐฉ์‹์€ ์ตœ๊ทผ ๋ฐฉ๋ฌธํ•œ ์ƒํƒœ์˜ curiocity(ํŽธ์ฐจ?)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ƒˆ๋กœ์šด ์ƒํƒœ๊ฐ€ ์•„๋‹ˆ๋ผ ์ด๋ฏธ ๋ฐฉ๋ฌธํ•œ ์ƒํƒœ๋ฅผ ์„ ํƒํ•˜๊ฒŒ ๋จ
  3. Inverse model์˜ ์‹ค์ œ ์ƒํƒœ์™€ ์˜ˆ์ธก ์ƒํƒœ์˜ ์ฐจ์ด๊ฐ€ ๋งŽ์€ ํ–‰๋™์„ ์ฐพ๋Š” ๋Œ€์‹  dynamic model์˜ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•ด์„œ ๋‹ค์Œ ์ƒํƒœ ์˜ˆ์ธก์˜ ๋ถˆ์ผ์น˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ๋กœ ํ•จ.

How to implement

  • time step ${o_t}$ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๊ณ ์ฐจ์› observation๋Š” ๋จผ์ € feature ${h_t}$ ๋กœ encoding
  • ${h_t}$๋Š” recurrent latent state ${s_t}$์˜ input์œผ๋กœ
  • ํƒ์ƒ‰ ์ •์ฑ…์€ agent๊ฐ€ ์ตœ๊ทผ์— ๊ฐ€์žฅ ์นœ์ˆ™ํ•˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด state๋กœ ๊ฐ€๋Š” action์„ return
  • ์ฒซ ๋ฒˆ์งธ ํƒ์ƒ‰ ํŽ˜์ด์ฆˆ์—์„œ agent๋Š” ์ง€์†์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  global world model์„ ํ•™์Šตํ•œ ๋‹ค์Œ ๊ทธ ๋’ค์˜ ํ™˜๊ฒฝ ํƒ์ƒ‰์„ ์œ„ํ•œ agent์˜ ํ–‰๋™์„ ์„ ํƒํ•จ
  • world model ๋‚ด๋ถ€์˜ ํƒ์ƒ‰ ์ •์ฑ…์€ ๋‹ค์ˆ˜์˜ dynamic model๋“ค์˜ ๋ถˆ์ผ์น˜๋ฅผ ์ƒํƒœ์˜ ์ฐธ์‹ ํ•จ์œผ๋กœ ํ‰๊ฐ€
    • = Latent Disagreement
  • 1๋‹จ๊ณ„ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•จ. ์•™์ƒ๋ธ”์˜ ๋ถˆํ™•์‹ค์„ฑ์€ ๋ชจ๋ธ์˜ one-step ์˜ˆ์ธก ํ‰๊ท ์˜ ๋ถ„์‚ฐ์œผ๋กœ ์ˆ˜์น˜ํ™”๋จ.
  • one-step predictive model์€ ๋‹ค์Œ ํŠน์„ฑ state ${h_{t+1}}$ ์„ ์˜ˆ์ธก
  • ๋ฏธ๋ž˜ ํŠน์„ฑ state๋“ค์˜ ๋ถ„์‚ฐ ํ˜น์€ ๋ถˆ์ผ์น˜๋Š” ๋‚ด์žฌ ๋ณด์ƒ์ด ๋จ
  • ์ตœ์  ํ–‰๋™ ๊ฒฐ์ •์„ ์œ„ํ•ด์„œ Plan2Explore์€ PlaNet์ด๋‚˜ Dreamer์˜ latent dynamics model์„ ์‚ฌ์šฉํ•จ
    • world model ๋‚ด๋ถ€์˜ parametric policy๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ
  • ํ•™์Šต๋œ world model์€ replay buffer์—์„œ ์–ป์€ ๋ฐ์ดํ„ฐ์—์„œ future latent state๋ฅผ ์˜ˆ์ธกํ•จ

3. Curriculum Learning

  • agent์—๊ฒŒ ์˜๋ฏธ์žˆ๋Š” sequence๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ task๋“ค์„ ์ฃผ๊ณ , task๋“ค์€ agent๊ฐ€ ์ฒ˜์Œ ์ฃผ์–ด์ง„ task๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„ ๋•Œ๊นŒ์ง€ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ ์  ๋ณต์žกํ•ด์ง.Automatic Goal Generation for Reinforcement Learning

C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents.”

  • curriculum learning์„ ์œ„ํ•ด์„œ๋Š” agent์—๊ฒŒ ํ•ด๊ฒฐํ•  task๋“ค์„ ๊ทธ๋ƒฅ ์ œ๊ณตํ•˜๊ธฐ๋งŒ ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ์˜๋ฏธ์žˆ๋Š” ์ˆœ์„œ๋กœ task๋ฅผ ์ œ๊ณตํ•ด์•ผํ•จ
  • agent๋Š” ์‰ฌ์šด task๋กœ ์‹œ์ž‘ํ•ด์„œ ์ดˆ๊ธฐ task๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„ ๋•Œ๊นŒ์ง€ training period๊ฐ€ ์ง€๋‚˜๊ฐˆ ์ˆ˜๋ก ์–ด๋ ค์›Œ์ง€๋Š” task๋ฅผ ํ•ด๊ฒฐํ•ด์•ผํ•œ๋‹ค.
  • ์˜๋ฏธ์žˆ๋Š” ์ˆœ์„œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ GoalGAN์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
    • agent๊ฐ€ ํ•ด๊ฒฐ๊ฐ€๋Šฅํ•œ ๋ชฉํ‘œ๋“ค์„ ์ƒ์„ฑํ•ด์ฃผ๋Š” ๋ชจ๋ธ

4. Auxiliary Tasks

M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward tasks from scratch.”
M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks.”

  • ํ•™์Šตํ•˜๋Š” ๋™์•ˆ agent๊ฐ€ ๋ณด์กฐ(auxiliary) task๋ฅผ ํ†ตํ•ด์„œ ๋ณด์ƒ ํ™•์žฅ
  • "Learning by playing solving sparse reward tasks from scratch" ์˜ auxiliary task๋Š” curriculum์„ ํ™œ์šฉํ•œ main task์— ๊ธฐ๋ฐ˜ํ•˜๋Š” ๊ฑด ์•„๋‹˜
  • ๋Œ€์‹  task๊ฐ€ ๋ณด์กฐ ์ œ์–ด task์™€ ๋ณด์กฐ ๋ณด์ƒ ์˜ˆ์ธก task๋กœ ์ฐจ๋ณ„ํ™”๋จ

Auxiliary Control Tasks

Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning.”

  1. Pixel Changes: ๋น ๋ฅด๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š” pixels๋“ค์ด ์ด๋ฒคํŠธ๋ฅผ ํŠน์ •ํ•˜๋Š” ์ง€ํ‘œ๊ฐ€ ๋œ๋‹ค๋Š” ์•„์ด๋””์–ด์—์„œ ์‹œ์ž‘๋จ. agent๋Š” ์˜ฌ๋ฐ”๋ฅธ ํ–‰๋™์„ ๊ณ ๋ฆ„์œผ๋กœ์จ pixel ๋ณ€ํ™”๋ฅผ controlํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๋„ํ•จ
  2. Network Features: agent๋Š” agent์˜ value ํ˜น์€ ์ •์ฑ… ๋„คํŠธ์›Œํฌ์˜ hidden layer์˜ activation์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๋„ํ•จ. ์™œ๋ƒํ•˜๋ฉด ์ •์ฑ… ํ˜น์€ value ๋„คํŠธ์›Œํฌ๋Š” high level ํŠน์„ฑ์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ๊ฒŒ agent์˜ activation์„ controlํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์œ ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—
  • Auxiliary Control๊ณผ ๋ณด์ƒ ์˜ˆ์ธก task๋Š” ๊ณต์œ ๋œ ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ A3C ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด์„œ agent์—์„œ ๊ฒฐํ•ฉ๋จ.
  • NN layer๋“ค์ด main๊ณผ auxiliary task๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณต์œ ๋˜์–ด ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์—, agent๋Š” ๋ชจ๋“  task์— ๋Œ€ํ•ด์„œ ๊ฐœ์„ ๋จ.

Case of Labyrinth environment

  • agent๊ฐ€ ๋ชฉํ‘œ์— ๋„์ฐฉํ–ˆ์„ ๋•Œ๋งŒ ๋ณด์ƒ์„ ์–ป๋Š” ํ™˜๊ฒฝ
  • agent๊ฐ€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ๋•๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ๋ณด์กฐ task๋ฅผ ์ œ์‹œํ•จ
  1. Pixel Control: Auxiliary ์ •์ฑ…์ด ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜์—ฌ๋Ÿฌ ๋ถ€๋ถ„์—์„œ pixel์ด ์‹ฌํ•˜๊ฒŒ ๋ณ€ํ™”๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต
  2. Reward Prediction: replay buffer๋กœ๋ถ€ํ„ฐ 3๊ฐœ์˜ frame์„ ์ œ๊ณต๋ฐ›์•„์„œ network๊ฐ€ ๋ณธ ์  ์—†๋Š” ๋‹ค์Œ frame์˜ ๋ณด์ƒ์„ ์˜ˆ์ธกํ•จ.
    • ๋ณด์ƒ์ด ํฌ๋ฐ•ํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ์™œ๊ณก์ด ๋ฐœ์ƒํ•ด์„œ ๋ณด์ƒ์„ ๋ฐ›์€ frame์ด ๋” ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์—
    • reward predictor๋Š” ๊ณ ์ฐจ์› input space๋ฅผ ์ €์ฐจ์› latent space๋กœ ๋ณ€ํ™˜ํ•˜๋Š” agent์˜ ํŠน์„ฑ ๋ ˆ์ด์–ด๋“ค์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ
  3. Vaule Fucntion Replay: agent๊ฐ€ A3C ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ on-policy value function๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ์ถ”๊ฐ€์ ์œผ๋กœ replay buffer์—์„œ sample์„ ํ•™์Šตํ•จ
    • value iteration์€ ๋‹ค์–‘ํ•œ frame ๊ธธ์ด์—์„œ ์‚ฌ์šฉ๋˜๊ณ , reward predictor์„ ํ†ตํ•ด์„œ ํ˜•์„ฑ๋œ ์ƒˆ๋กœ์šด feature๋“ค์„ ๋ฐœ๊ฒฌํ•ด์„œ ํ™œ์šฉํ•จ
  • ๊ฐ™์€ ๋ ˆ์ด์–ด๋“ค์„ ๊ณต์œ ํ•œ๋‹ค๊ณ  ํ•ด์„œ task๋“ค์ด ๋™์‹œ์— ๊ฐ™์€ ๋ฐ์ดํ„ฐ์—์„œ ํ•ด๊ฒฐ๋˜์ง€๋Š” ์•Š์Œ
  • ๋Œ€์‹  A3C agent๊ฐ€ ๋ฐฉ๋ฌธํ•œ observation๋“ค์„ ์ €์žฅํ•˜๋Š” replay buffer์„ ์ œ์•ˆํ–ˆ์Œ
  • UNREAL agent(UNsupervised REinforcement and Auxiliary Learning agent๊ฐ€ ๋‘ ๋ถ„๋ฆฌ๋œ DRL ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•จ
    • A3C๋กœ ํ•™์Šต๋œ ์ฒซ ๋ฒˆ์งธ ์ •์ฑ…์€ policy gradient method๋ฅผ ํ™œ์šฉํ•ด์„œ online์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋จ.
      • ๊ณผ๊ฑฐ ์ƒํƒœ๋ฅผ encodingํ•  ์ˆ˜ ์žˆ๋Š” RNN ํ™œ์šฉ
    • ๋ฐ˜๋ฉด์— auxiliary task๋Š” replay buffer์— ์ €์žฅ๋˜๊ณ  ๋ช…์‹œ์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋œ ์ƒˆ๋กœ์šด ๊ฒฝํ—˜ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•จ
    • ์ตœ๋Œ€ ํšจ์œจ์„ฑ ๋ณด์žฅ์„ ์œ„ํ•ด Q-learning์˜ off-policy๋กœ ํ•™์Šด๋˜๊ณ , ๊ฐ„๋‹จํ•œ feed-forward ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šต๋จ.
728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[MPC] 2. ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„  (0) 2024.03.06
[MPC] 1. Model Predictive Control Intro  (0) 2024.03.06
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)  (0) 2023.10.16
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)  (0) 2023.10.06
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€  (0) 2023.08.16
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [MPC] 2. ์ƒํƒœ ๊ณต๊ฐ„ ๋ฐฉ์ •์‹ ์œ ๋„
  • [MPC] 1. Model Predictive Control Intro
  • [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
  • [๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”