๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning

    [๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ

    โ€ป ์•„๋ž˜ ๋งํฌ์˜ ๋‚ด์šฉ์„ ๊ณต๋ถ€ํ•˜๋ฉฐ ํ•œ๊ตญ์–ด๋กœ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. Reinforcement Learning: Dealing with Sparse Reward Environments Reinforcement Learning (RL) is a method of machine learning in which an agent learns a strategy through interactions with its environment… medium.com 1. Sparse Reward Sparse Reward(ํฌ๋ฐ•ํ•œ ๋ณด์ƒ) : Agent๊ฐ€ ๋ชฉํ‘œ ์ƒํ™ฉ์— ๊ฐ€๊นŒ์›Œ์กŒ์„ ๋•Œ๋งŒ ๊ธ์ • ๋ณด์ƒ์„ ๋ฐ›๋Š” ๊ฒฝ์šฐ ํ˜„์žฌ ์‹คํ—˜ ํ™˜๊ฒฝ ์„ธํŒ…๊ณผ ๊ฐ™์Œ Curiosity-Driven method agent๊ฐ€ ๊ด€์‹ฌ์‚ฌ ๋ฐ–์˜ ํ™˜๊ฒฝ์—๋„ ๋™๊ธฐ๋ฅผ ๋ฐ›๋„๋ก Curric..

    [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)

    DQN์˜ ์ฐจ์›์˜ ์ €์ฃผ ๋ฌธ์ œ(๊ณ ์ฐจ์› action์„ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ ์—ฐ์‚ฐ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๊ณ  memory space๋ฅผ ๋งŽ์ด ์š”ํ•จ)๋ฅผ off-policy actor critic ๋ฐฉ์‹์œผ๋กœ ํ’€์–ด๋‚ธ๋‹ค. ๊ธฐ์กด DQN ๋ฐฉ์‹์˜ insight๋“ค์— batch normalization replay buffer target Q network Actor-critic ํŒŒ๋ผ๋ฏธํ„ฐํ™” ๋œ actor function์„ ๊ฐ€์ง actor function : state์—์„œ ํŠน์ • action์œผ๋กœ mappingํ•˜์—ฌ ํ˜„์žฌ policy๋ฅผ ์ง€์ • policy gradient ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ์—ฌ๊ธฐ์—์„œ J๊ฐ€ Objective Function(๋ชฉํ‘œํ•จ์ˆ˜) actor function์ด ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ gradient asent๋กœ ์ตœ๋Œ€ํ™”→ ์ด ๋•Œ์˜ policy parameter..

    [๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)

    Dueling Double DQN https://arxiv.org/pdf/1509.06461.pdf https://arxiv.org/pdf/1511.06581.pdf Double DQN DQN์—์„œ reward๋ฅผ ๊ณผ๋Œ€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ. Q Value๊ฐ€ agent๊ฐ€ ์‹ค์ œ๋ณด๋‹ค ๋†’์€ ๋ฆฌํ„ด์„ ๋ฐ›์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒฝํ–ฅ ⇒ Q learning update ๋ฐฉ์ •์‹์— ๋‹ค์Œ ์ƒํƒœ(state)์— ๋Œ€ํ•œ Q value ์ตœ๋Œ€๊ฐ’์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ Q ๊ฐ’์— ๋Œ€ํ•œ max ์—ฐ์‚ฐ์€ ํŽธํ–ฅ์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค. ํ™˜๊ฒฝ์˜ ์ตœ๋Œ€ true value๊ฐ€ 0์ธ๋ฐ agent๊ฐ€ ์ถ”์ •ํ•˜๋Š” ์ตœ๋Œ€ true value๊ฐ€ ์–‘์ˆ˜์ธ ๊ฒฝ์šฐ์— ์„ฑ๋Šฅ ์ €ํ•˜ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ network ์‚ฌ์šฉ. Q Next : action selection → ๋‹ค์Œ ์•ก์…˜์œผ๋กœ ๊ฐ€์žฅ ์ข‹์€ ..

    [๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€

    ์ธํ„ฐ๋„ท์„ ๋‹ค ๋’ค์ ธ๋ดค๋Š”๋ฐ ๊ฐ•ํ™”ํ•™์Šต์„ gym์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒŒ์ž„ agent ์‚ฌ์šฉํ•ด์„œ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ์œก์ฒœ๋งŒ ๊ฐœ๊ณ  ์ปค์Šคํ…€ํ•ด์„œ ํ•™์Šต์„ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ๋‹จ ํ•œ ๊ฐœ ์žˆ์—ˆ๋‹ค. ์ด์ œ ๋ง‰ ๊ณต๋ถ€๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•˜๋Š” ๋งˆ์Œ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ์จ๋ณด๊ณ ์ž ํ•œ๋‹ค. 1. Gym์˜ Env ๊ตฌ์กฐ ์‚ดํŽด๋ณด๊ธฐ ๊ผญ ๊ทธ๋ž˜์•ผํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ(๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๊ธด ํ•˜๋‹ค) ์–ด์จŒ๋“  gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ environment ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด์„œ ๊ตฌํ˜„ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค. !pip install gym gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ env ๊ตฌ์กฐ๋Š” ๋Œ€์ถฉ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. site-packages/gym/core.py ์—์„œ ์ง์ ‘ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. class Env(Generic[ObsType, ActType]):m.Generator] = None """ The ma..

    [๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)

    [Model Review] Markov Decision Process & Q-Learning 1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ dnai-deny.tistory.com Deep Reinforcement Learning ๊ธฐ์กด Q Learning์—์„œ๋Š” State์™€ Action์— ํ•ด๋‹นํ•˜๋Š” Q-Value๋ฅผ ํ…Œ์ด๋ธ” ํ˜•์‹์œผ๋กœ ์ €์žฅ state space์™€ action space๊ฐ€ ์ปค์ง€๋ฉด Q-Value๋ฅผ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด memory์™€ exploration time์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ ⇒ ๋”ฅ๋Ÿฌ๋‹์œผ๋กœ Q-Table์„ ์ƒ์„ฑํ•˜๋Š” Q..

    [๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning

    1. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP) ๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต - ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process) ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค(Markov Process) ์ƒํƒœ S์™€ ์ „์ดํ™•๋ฅ ํ–‰๋ ฌ P๋กœ ์ •์˜๋จ ํ•˜๋‚˜์˜ ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ์ „์ด๊ฐ€ ์ผ์–ด๋‚จ ์ƒํƒœ ์ „์ด์— ๊ฐ๊ฐ ํ™•๋ฅ  ์กด์žฌ S4์˜ ๊ฒฝ์šฐ ์ข…๋ฃŒ์ƒํƒœ ๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ(Markov property) $$ P[S_{t+1} | S_t] = P[S_{t+1} |S_1,S_2, ... S_t] $$ ์ƒํƒœ๊ฐ€ ๋˜๊ธฐ๊นŒ์ง€์˜ ๊ณผ์ •์€ ํ™•๋ฅ  ๊ณ„์‚ฐ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Œ. ์–ด๋Š ์‹œ์ ์˜ ์ƒํƒœ๋กœ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์„ ๋•Œ ๋งˆ๋ฅด์ฝ”ํ”„ํ•œ ์ƒํƒœ๋ผ๊ณ  ํ•จ.๋ฐ˜๋ก€) ์šด์ „ํ•˜๋Š” ์‚ฌ์ง„(์–ด๋Š ์‹œ์ ์˜ ์‚ฌ์ง„์œผ๋กœ๋Š” ํ›„์ง„/์ „์ง„/์†๋„ ๋“ฑ์„ ํŒŒ์•… ๋ถˆ๊ฐ€ → ๋‹ค์Œ ์ƒํƒœ ๊ฒฐ์ • ๋ถˆ๊ฐ€๋Šฅ) ex) ์ฒด์Šค ๊ฒŒ์ž„(์–ด๋Š ..