[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€

2023. 8. 16. 16:07ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90

์ธํ„ฐ๋„ท์„ ๋‹ค ๋’ค์ ธ๋ดค๋Š”๋ฐ ๊ฐ•ํ™”ํ•™์Šต์„ gym์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒŒ์ž„ agent ์‚ฌ์šฉํ•ด์„œ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ์œก์ฒœ๋งŒ ๊ฐœ๊ณ  ์ปค์Šคํ…€ํ•ด์„œ ํ•™์Šต์„ ํ•˜๋Š” ์˜ˆ์ œ๋Š” ๋‹จ ํ•œ ๊ฐœ ์žˆ์—ˆ๋‹ค. ์ด์ œ ๋ง‰ ๊ณต๋ถ€๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•˜๋Š” ๋งˆ์Œ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ์จ๋ณด๊ณ ์ž ํ•œ๋‹ค.

1. Gym์˜ Env ๊ตฌ์กฐ ์‚ดํŽด๋ณด๊ธฐ

๊ผญ ๊ทธ๋ž˜์•ผํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ(๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๊ธด ํ•˜๋‹ค) ์–ด์จŒ๋“  gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ environment ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด์„œ ๊ตฌํ˜„ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค.

!pip install gym

gym ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ env ๊ตฌ์กฐ๋Š” ๋Œ€์ถฉ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. site-packages/gym/core.py ์—์„œ ์ง์ ‘ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

class Env(Generic[ObsType, ActType]):m.Generator] = None
		"""
		The main API methods that users of this class need to know are:

    - :meth:`step` - Takes a step in the environment using an action returning the next observation, reward,
      if the environment terminated and observation information.
    - :meth:`reset` - Resets the environment to an initial state, returning the initial observation and observation information.
    - :meth:`render` - Renders the environment observation with modes depending on the output
    - :meth:`close` - Closes the environment, important for rendering where pygame is imported

    And set the following attributes:

    - :attr:`action_space` - The Space object corresponding to valid actions
    - :attr:`observation_space` - The Space object corresponding to valid observations
    - :attr:`reward_range` - A tuple corresponding to the minimum and maximum possible rewards
    - :attr:`spec` - An environment spec that contains the information used to initialise the environment from `gym.make`
    - :attr:`metadata` - The metadata of the environment, i.e. render modes
    - :attr:`np_random` - The random number generator for the environment
    """

		def step(self, action: ActType) -> Tuple[ObsType, float, bool, bool, dict]:
        raise NotImplementedError

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        options: Optional[dict] = None,
    ) -> Tuple[ObsType, dict]:

        # Initialize the RNG if the seed is manually passed
        if seed is not None:
            self._np_random, seed = seeding.np_random(seed)

    def render(self) -> Optional[Union[RenderFrame, List[RenderFrame]]]:
        raise NotImplementedError

    def close(self):
        pass

    @property
    def unwrapped(self) -> "Env":
        return self

    def __str__(self):
        """Returns a string of the environment with the spec id if specified."""
        if self.spec is None:
            return f"<{type(self).__name__} instance>"
        else:
            return f"<{type(self).__name__}<{self.spec.id}>>"

    def __enter__(self):
        """Support with-statement for the environment."""
        return self

    def __exit__(self, *args):
        """Support with-statement for the environment."""
        self.close()
        # propagate exception
        return False
  • ๊ณต์‹ ๋ฌธ์„œ ์„ค๋ช…
 

Core - Gym Documentation

Previous Basic Usage

www.gymlibrary.dev

์ค‘๊ฐ„ ์ฃผ์„์„ ์ฃ„๋‹ค ์ง€์›Œ์„œ ์งง์•„๋ณด์ธ๋‹ค. ์•„๋ฌดํŠผ ์ค‘์š”ํ•œ ๊ฒƒ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • Method
    • step ํ•จ์ˆ˜๋Š” ์–ด๋–ค action ์„ ์ทจํ–ˆ์„ ๋•Œ, state์˜ ๋ณ€ํ™”์™€ ๊ทธ์— ๋”ฐ๋ฅธ ๋ณด์ƒ์„ ์ฑ…์ •ํ•˜๋Š” ๋ถ€๋ถ„์ด๋‹ค. ์˜จ๋„๋กœ ์˜ˆ๋ฅผ ๋“ค๋ฉด ํ˜„์žฌ state ๊ฐ€ 20๋„์ด๊ณ , ๋ชฉํ‘œ ์˜จ๋„๊ฐ€ 30๋„๋ผ๋ฉด ์–ด๋–ค action ์„ ์ทจํ•˜๊ณ  ๋‚œ ๋’ค์— state์˜ ๋ณ€ํ™”๊ฐ€ ๊ธ์ •์ ์ธ์ง€, ๋ถ€์ •์ ์ธ์ง€์— ๋”ฐ๋ผ์„œ reward๋ฅผ ์ฃผ๋Š” ๋ถ€๋ถ„์ด ์ด ๋ถ€๋ถ„์ด๋‹ค.
      • return → observation, reward, terminated, truncated, info, done
    • reset ํ•จ์ˆ˜๋Š” ํ•œ episode(ํ˜น์€ ๊ฒŒ์ž„, ์ผ์ • ์‹œ๊ฐ„ ๋“ฑ ํ•™์Šต ์ฃผ๊ธฐ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค)๋ฅผ ์‹œ์ž‘ํ•  ๋•Œ attributes๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค.
      • return → observation, info
  • Attributes
    • action_space ๋Š” ์ด ๋ชจ๋ธ์ด ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™๋“ค์˜ ์ง‘ํ•ฉ์ด๋‹ค. ์ •์ˆ˜ ์ง‘ํ•ฉ์ผ ์ˆ˜๋„ ์žˆ๊ณ , DDPG ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ์—ฐ์†์ ์ธ ๊ณต๊ฐ„์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋ฅผํ…Œ๋ฉด ๋ถˆ์„ ๊บผ๋ผ, ์ผœ๋ผ, ๋ฐ๊ธฐ๋ฅผ ๋ฐ˜์œผ๋กœ ๋‚ฎ์ถฐ๋ผ์™€ ๊ฐ™์€ ํ–‰๋™๋ช…๋ น์˜ ๋ฆฌ์ŠคํŠธ์— ๊ฐ€๊น๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.
    • observation_space ๋Š” ๋ฐ˜ํ™˜ํ•  state์˜ ํ˜•์‹๊ณผ ๋ฒ”์œ„์˜ ์ •์˜์ด๋‹ค.
    • state ๋Š” ํ™˜๊ฒฝ์˜ ์ƒํƒœ์ด๋‹ค.
    • reward ๋Š” state์— ๋Œ€ํ•ด action์„ ์ทจํ–ˆ์„ ๋•Œ ๋ฐ˜ํ™˜ํ•ด์ค„ ๋ณด์ƒ์ด๋‹ค. reward function์€ ๋ˆ„๊ฐ€ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , ํ™˜๊ฒฝ์— ๋งž์ถฐ์„œ ์ ์ ˆํžˆ ๋งŒ๋“ค์–ด์ค˜์•ผํ•œ๋‹ค. ์ด ๋ถ€๋ถ„๋„ ์˜ˆ์ œ์— ๋„ฃ๊ธด ํ–ˆ์œผ๋‹ˆ ๊ฑฑ์ •๋ง์ž.

๊ทธ๋Ÿผ ๊ฑฐ๋‘์ ˆ๋ฏธํ•˜๊ณ  ๊ณง์žฅ ํ•œ ๋ฒˆ ์งœ๋ณด๋„๋ก ํ•˜์ž. 

* ์ด ์˜ˆ์ œ์˜ ๋ชฉํ‘œ๋Š” ๋žœ๋คํ•˜๊ฒŒ ์„ค์ •๋œ ์ˆซ์ž๋ฅผ ์›ํ•˜๋Š” ๋ฒ”์œ„ ์•ˆ์œผ๋กœ ๋“ค์–ด์˜ค๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

2. Gym Custom Environment ์ž‘์„ฑํ•˜๊ธฐ

gym library์˜ Env ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์ƒ์†๋ฐ›์„ ๊ฒƒ์ด๋‹ˆ ์šฐ์„  import ํ•œ๋‹ค.

from gym import Env
from gym.spaces import Box # observation space ์šฉ

__init__ ํ•จ์ˆ˜ ์•„๋ž˜์— action space, observation space, state, ๊ทธ๋ฆฌ๊ณ  episode length ๋ฅผ ์„ ์–ธํ•ด์ฃผ์—ˆ๋‹ค.

a. Attributes ์„ค์ •

class ENV(Env):
		def __init__(self):
        self.action_space = [i for i in range(-2, 3)]
        self.observation_space = Box(low=np.array([0]), high=np.array([100]), dtype=np.int8)
        self.state = np.random.choice([-20, 0, 20, 40, 60])
				self.prev_state = self.state
        self.episode_length = 100

		def step(self, action):
				pass
		
		def reset(self):
				pass

ํ•˜๋‚˜์”ฉ ๋ณด์ž.

  • action space → -2์—์„œ 2 ์‚ฌ์ด์˜ ์ •์ˆ˜๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ ๊ธธ์ด 5์งœ๋ฆฌ ๋ฆฌ์ŠคํŠธ์ด๋‹ค. gym.spaces.Discrete ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ •์˜ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋Š”๋ฐ, ์ด๋ฒˆ ์˜ˆ์ œ์—์„œ ์‚ฌ์šฉํ•  ์˜ˆ์ œ ๋ชจ๋ธ์ธ Dueling DQN์€ action ์ถ”๋ก ์„ ์˜ค์ง ์–‘์ˆ˜๋กœ๋งŒ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ action space์˜ index๋ฅผ action์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด๋ ‡๊ฒŒ ์„ ์–ธ์„ ํ•ด์คฌ๋‹ค.
  • observation space → gym.spaces.Box ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. low์™€ high์— ์•„์ฃผ ํฐ ์˜๋ฏธ๊ฐ€ ์žˆ์ง€๋Š” ์•Š๋‹ค.
  • state → ํ˜„์žฌ ์šฐ๋ฆฌ๋Š” ์‹ค์ œ ์‹œํ—˜ ํ™˜๊ฒฝ์ด ์—†์œผ๋ฏ€๋กœ ์ ๋‹นํžˆ ๋žœ๋คํ•˜๊ฒŒ ์„ค์ •ํ•˜๋„๋ก ํ–ˆ๋‹ค.
  • episode_length → ํ•˜๋‚˜์˜ episode์— ๋ช‡ ๊ฐœ์˜ action์„ ์‹คํ–‰ํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…€์„์ด 0์ด ๋˜๋ฉด step ํ•จ์ˆ˜์˜ return ์ค‘ ํ•˜๋‚˜์ธ done ์„ True๋กœ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค.

b. reset ํ•จ์ˆ˜ ์ž‘์„ฑ

์ด์ œ reset ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•œ๋‹ค. ์ดˆ๊ธฐํ™”๋ผ ๋งŒ๋งŒํ•˜๋‹ค.

def reset(self):
    self.state = np.random.choice([-20, 0, 20, 40, 60])
    self.episode_length = 100 
    return self.get_obs()

def get_obs(self):
    return np.array([self.state], dtype=int)

get_obs ๋ผ๋Š” ๋…€์„์ด ๋“ฑ์žฅํ–ˆ๋‹ค. ๋ณด์‹œ๋ฉด ์•„์‹œ๊ฒ ์ง€๋งŒ Dueling Double DQN์— ํฌํ•จ๋œ neural network์˜ input์œผ๋กœ ์‚ฌ์šฉ๋  state๋ฅผ ๋ฏธ๋ฆฌ numpy array์˜ (1, ) shape๋กœ ๋งŒ๋“ค์–ด์„œ ์‹ ๊ฒฝ๋ง ํ†ต๊ณผ ๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ์ง€ ์•Š๋„๋ก ํ•ด์ค€ ๊ฒƒ์ธ๋ฐ, ์ด๊ฑด ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์— ๋งž์ถฐ์„œ ์กฐ์ •ํ•ด์ฃผ์–ด์•ผํ•œ๋‹ค.

c. step ํ•จ์ˆ˜ ์ž‘์„ฑ

์กฐ๊ธˆ ์–ด๋ ค์šด step ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด๋ณด์ž.

def step(self, action):
    self.state += self.action_space[action]
    self.episode_length -= 1 
    
	# Reward ์ฑ…์ •
    if self.state >= 20 and self.state <= 25:
        reward = +100
    else:
        reward = -100

		prev_diff = min(abs(self.prev_state - 20), abs(self.prev_state - 25))
    curr_diff = min(abs(self.state - 20), abs(self.state - 25))

    if curr_diff <= prev_diff: 
        if reward != 100: reward = reward + 50 
        else: reward = 100
    if curr_diff > prev_diff: reward -= 50

		self.prev_state = self.state
	
	# Episode ๋๋‚ฌ๋Š”์ง€ ํ™•์ธ
    if self.episode_length <= 0: 
        done = True
    else:
        done = False
    
    info = {}
    
    return self.get_obs(), reward, done, info

๋‹ค์†Œ ์กฐ์žกํ•ด๋ณด์ด๋Š” reward function์€ ์ด๋ ‡๊ฒŒ ์ƒ์„ฑํ•ด์ฃผ์—ˆ๋‹ค.

  1. ๋ฒ”์œ„ ์•ˆ์— state๊ฐ€ ์žˆ์œผ๋ฉด reward = 100, ์—†์œผ๋ฉด reward = -100
  2. ์ด์ „๋ณด๋‹ค ๋ฒ”์œ„์— ๊ฐ€๊นŒ์›Œ์กŒ์œผ๋ฉด +50, ๋ฉ€์–ด์กŒ์œผ๋ฉด -50
    • ๋‹จ, ์ด๋ฏธ ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š” ๊ฒฝ์šฐ 100์œผ๋กœ ๊ณ ์ •

๊ทธ๋ฆฌ๊ณ  episode length์— ๋”ฐ๋ผ episode๊ฐ€ ๋๋‚ฌ๋Š”์ง€ ํŒ๋ณ„ํ•ด์ฃผ๊ณ , returnํ•˜๋ฉด ๋œ๋‹ค.

์œ ์˜ํ•ด์•ผํ•  ๊ฒƒ์€, ์ง€๊ธˆ ์ด ๋ณด์ƒํ•จ์ˆ˜๋Š” ์ž˜ ๋งŒ๋“  ๋ณด์ƒํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋‹ค!!!!!!!!!!!!!! ๋” ๋งŽ์€ ์ƒํ™ฉ์„ ๊ณ ๋ คํ•ด์„œ ์ ์ ˆํ•œ ๋ณด์ƒํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ์šฐ๋ฆฌ ๋ชจ๋ธ์€ ๊ถ๊ทน์ ์œผ๋กœ 0๊ณผ 1๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฏ€๋กœ ๊ธฐ๋ฉด ๊ธฐ๊ณ  ์•„๋‹ˆ๋ฉด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์—ผ๋‘์— ๋‘์ž. ๋‹จ์ˆœํ•œ ๋…€์„์—๊ฒŒ ๋ณต์žกํ•œ ๊ฒƒ์„ ๊ฐ€๋ฅด์น˜๋ ค๋ฉด ์„ ์ƒ๋‹˜์ด ์—ฐ๊ตฌ๋ฅผ ์ข€ ํ•ด์•ผํ•œ๋‹ค. (ํ•„์ž๋„ ์•„์ง ์ž˜ ๋ชจ๋ฅธ๋‹คใ…Ž)

์ด์™ธ์—๋„ render ํ•จ์ˆ˜๋‚˜ close ํ•จ์ˆ˜ ๋“ฑ gym environment class์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ์ž‘์„ฑํ•ด์ค„ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋“ค์ด ์žˆ์ง€๋งŒ ๋‚˜๋Š” ์‹œ๊ฐํ™”๋ฅผ ํ•œ๋‹ค๋˜๊ฐ€ ํ•˜๋Š” ์š•์‹ฌ๊นŒ์ง€๋Š” ์—†์œผ๋ฏ€๋กœ ์ง€๊ธˆ์€ ์ƒ๋žตํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค. 

 

3. ๋ชจ๋ธ ์ ์šฉ

๊ทธ๋Ÿผ ์ด์ œ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด๋ณด์ž. ์ฝ”๋“œ๋Š” ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด๋‘” ์•„๋ž˜ repository ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. ์•„๋ž˜ ๋ ˆํฌ์ง€ํ† ๋ฆฌ ํ•˜์œ„์˜ DDDQN ํด๋”์˜ ์ฝ”๋“œ๋ฅผ ๋ณด์ž.

 

GitHub - melli0505/Deep-RL

Contribute to melli0505/Deep-RL development by creating an account on GitHub.

github.com

์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ReinforcementLearning/DeepQLearning ์•„๋ž˜์— ์žˆ๋‹ค.

 

GitHub - philtabor/Youtube-Code-Repository: Repository for most of the code from my YouTube channel

Repository for most of the code from my YouTube channel - GitHub - philtabor/Youtube-Code-Repository: Repository for most of the code from my YouTube channel

github.com

DDDQN ๋ชจ๋ธ ๊ตฌ์กฐ

DDDQN(Dueling Double Deep Q-Network) ๋ชจ๋ธ์€ DDQN(Double Deep Q-Network) ๋ชจ๋ธ๊ณผ Dueling DQN ๋ชจ๋ธ์˜ ์žฅ์ ์„ ์„ž์–ด ๋งŒ๋“  ๋ชจ๋ธ์ด๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์ž์„ธํ•œ ๋ชจ๋ธ ์„ค๋ช…์€ ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ฏ€๋กœ ์ถ”ํ›„์— Dueling DQN / Double DQN๋ชจ๋ธ์— ๋Œ€ํ•œ ํฌ์ŠคํŒ…์„ ํ•˜๊ฒŒ๋˜๋ฉด ์ถ”๊ฐ€ํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ์šฐ๋ฆฌ์—๊ฒŒ ์ค‘์š”ํ•˜

๊ธด ํ•˜์ง€๋งŒ ์•„์ฃผ ์ค‘์š”ํ•˜์ง„ ์•Š๋‹ค. ์ฝ”๋“œ๋งŒ ์ž˜ ๋ณด๋ฉด ๋œ๋‹ค.

a. DDDQN ํ†บ์•„๋ณด๊ธฐ

env = ENV()
agent = Agent(env=env, lr=1e-3, gamma=0.99, n_actions=5, epsilon=1.0,
              batch_size=64, input_dims=[1])

์šฐ๋ฆฌ๊ฐ€ ์œ„์— ๋งŒ๋“ค์–ด๋‘” ENV๋ฅผ env๋กœ ์ •์˜ํ•œ ๋’ค์— DDDQN agent๋ฅผ ์„ ์–ธํ•ด์ค€๋‹ค. agent๋ฅผ ๊ฐ„๋‹จํžˆ ๋ณด๋„๋ก ํ•˜์ž.

class Agent():
    def __init__(self, input_dims, env, epsilon=1, lr=1e-3, gamma=0.99, n_actions=2, batch_size=64,
                 epsilon_dec=1e-3, eps_end=0.01, 
                 mem_size=100000, fc1_dims=128,
                 fc2_dims=128, replace=100):
                 
        self.env = env
        self.gamma = gamma
        self.epsilon = epsilon
        self.eps_dec = epsilon_dec
        self.eps_min = eps_end
        self.replace = replace
        self.batch_size = batch_size

        self.learn_step_counter = 1
        self.memory = ReplayBuffer(mem_size, input_dims)
        self.q_eval = DuelingDeepQNetwork(n_actions, fc1_dims, fc2_dims)
        self.q_next = DuelingDeepQNetwork(n_actions, fc1_dims, fc2_dims)
        self.q_eval.compile(optimizer=Adam(learning_rate=lr),
                            loss='mean_squared_error')
        self.q_next.compile(optimizer=Adam(learning_rate=lr),
                            loss='mean_squared_error')
        self.action_space = [i for i in range(n_actions)]

์•„์•„์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…์„ ํ•˜์ž๋ฉด, ์ด ๋ชจ๋ธ์€ ๋‘ ๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•˜๋‚˜๋Š” action-reward์— ๋”ฐ๋ผ ๋ฐ”๋กœ๋ฐ”๋กœ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜๊ณ , ํ•˜๋‚˜๋Š” ์ผ์ •๊ธฐ๊ฐ„ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜์ง€ ์•Š๋‹ค๊ฐ€ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜๋Š” ์ง€์—ฐ๋œ target ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐ€์ง„๋‹ค.

๋˜ํ•œ state-action-reward์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•  ๋•Œ๋Š” ๋ฐ”๋กœ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ์ •๋ณด๋“ค์„ ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ , Replay Memory๋ผ๋Š” ์ผ์ข…์˜ ์ €์žฅ์†Œ์— ์ €์žฅํ•ด๋‘์—ˆ๋‹ค๊ฐ€ ๋žœ๋คํ•˜๊ฒŒ ๊บผ๋‚ธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•˜์—ฌ ์˜์กด๋„๋ผ๋˜๊ฐ€ ๋ญ ๊ธฐํƒ€๋“ฑ๋“ฑ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋„๋ชจํ–ˆ๋‹ค. 

๋”๋ถˆ์–ด ํ•™์Šต ์ดˆ๋ฐ˜์— epsilon์— ์˜ํ•ด ๋žœ๋คํ•˜๊ฒŒ ๋ช‡ ๋ฒˆ action์„ ์ทจํ•˜๋„๋ก ํ•ด์„œ ๊ณผ์ ํ•ฉ(...๋งž๋Š” ํ‘œํ˜„์ธ์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค.) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. 

gamma์˜ ๊ฒฝ์šฐ์—๋Š” ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋ฏธ๋ž˜์ง€ํ–ฅ์ ์œผ๋กœ, ํ˜„์žฌ์˜ ๊ฐ’์ด ๋ฏธ๋ž˜์— ๋ฏธ์น  ์˜ํ–ฅ์„ ๋” ๋งŽ์ด ๊ณ ๋ คํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค. Gradient Vanishing ๋ฌธ์ œ์—์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ์ง„ํ–‰๋ ์ˆ˜๋ก ์•ž์„  ๊ฐ’์ด ์žŠํžˆ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ๋А๋‚Œ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ฒ ๋Š”๋ฐ, gamma๊ฐ€ 1์ด๋ฉด ๋‚˜์ค‘์— ํ˜„์žฌ ๋„คํŠธ์›Œํฌ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๊ฐ€ ๋ฏธ๋ž˜๊นŒ์ง€ ์˜ํ–ฅ์„ ์˜จ์ „ํžˆ, ๋งŽ์ด ๋ฏธ์น˜๋Š” ๊ฒƒ์ด๋‹ค. 

n_action์€ action space์˜ ๊ธธ์ด(action ๊ฐœ์ˆ˜)์ด๋‹ค. self.action space๋Š” environment ๊ตฌ์„ฑํ•  ๋•Œ ์ด์•ผ๊ธฐํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ์ด๋ฒˆ ์˜ˆ์ œ์—์„œ ์‚ฌ์šฉํ•  ์˜ˆ์ œ ๋ชจ๋ธ์ธ Dueling DQN์€ action ์ถ”๋ก ์„ ์˜ค์ง ์–‘์ˆ˜๋กœ๋งŒ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ action space์˜ index๋ฅผ environment step์˜ action์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด 0~n_action ์‚ฌ์ด๋กœ ์„ค์ •ํ•ด์ฃผ์—ˆ๋‹ค.

DDDQN์˜ ํ•™์Šต ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ๋„คํŠธ์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ action ๊ณ ๋ฅด๊ธฐ 
  2. env.step์œผ๋กœ state์™€ reward ๋ฐ›๊ธฐ
  3. ๋ฐ›์€ reward์™€ state, action ๋“ฑ์„ memory์— ์ €์žฅํ•˜๊ธฐ
  4. memory์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋ฝ‘์•„์„œ ํ•™์Šตํ•˜๊ณ  ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ
  5. 1~4 ๋ฐ˜๋ณต

๊ทธ๋ฆฌ๊ณ  ํ•˜๋‚˜ ์œ ์˜ํ•ด์•ผํ•  ๊ฒƒ์ด ์žˆ๋‹ค๋ฉด Readme ํŒŒ์ผ์—๋„ ์ž‘์„ฑํ•ด๋‘” ๊ฒƒ์ฒ˜๋Ÿผ ํ•™์Šต ๊ณผ์ •์—์„œ reward function์„ ์ƒ๋‹นํžˆ ๋งŽ์ด ๋ฏฟ๊ณ !!! ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์•„์„œ ์—…๋ฐ์ดํŠธ๋ฅผ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ network target ๊ฐ’์— reward์˜ ๋ฐ˜์˜์œจ 10๋ฐฐ ์ด๋ฒคํŠธ๋ฅผ ์„ค์ •ํ–ˆ๋‹ค. ์›๋ž˜๋Œ€๋กœ๋ผ๋ฉด ์•„๋ž˜์˜ * 10์ด ์—†์–ด์•ผํ•œ๋‹ค.

# dddqn.py, line 59-62

for idx, terminal in enumerate(dones):
    q_target[idx, actions[idx]] = rewards[idx] * 10 + \
            self.gamma*q_next[idx, max_actions[idx]]*(1-int(dones[idx]))
self.q_eval.train_on_batch(states, q_target)

๊ทธ๋Ÿฌ๋‹ˆ ๋ญ”๊ฐ€ ์ž˜๋ชป๋˜๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค๋ฉด ์ด ๋…€์„์„ ์˜์‹ฌํ•ด๋ณด์‹œ๊ธธ ๋ฐ”๋ž€๋‹ค. 

 

4. ์‹คํ–‰

python DDDQN_custom/main.py

์ด์ œ ์‹คํ–‰์„ ํ•ด์ฃผ๊ณ , load checkpoint๋‚˜ training resume์— ๋‹ค False๋ฅผ ์ฒ˜๋ฆฌํ•ด์ฃผ๋ฉด ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์ด ์‹œ์ž‘๋œ๋‹ค. 

์ด๋Ÿฐ ์‹์œผ๋กœ ํ•™์Šต์ด ๋˜๋Š”๋ฐ, ๋งŒ์•ฝ์— ๋ญ”๊ฐ€...ํ•™์Šต์ด...์ด์ƒํ•˜๋‹คใ…ใ…? ํ•˜๋ฉด choose_action์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ˆ˜์ •ํ•ด์ฃผ์ž. ๋‹จ evaluate๋ฅผ ํ•˜๊ณ  ์‹ถ์„ ๋•Œ๋Š” ๋‹ค์‹œ ๋Œ๋ ค์ค˜์•ผํ•œ๋‹ค ใ…Ž.. ํ•ด๊ฒฐ๋˜๋ฉด ๊นƒํ—™์— ์—…๋ฐ์ดํŠธ ํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.

    def choose_action(self, observation, evaluate=False):
        if np.random.random() < self.epsilon:# and evaluate is False:
            action = np.random.choice(self.action_space)
        else:
            state = np.array([observation])
            actions = self.q_eval.advantage(state)
            action = tf.math.argmax(actions, axis=1).numpy()[0]

        return action

 

์ด๋ ‡๊ฒŒ ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด

initial state:  [80]    - action: 1 | state: [79] | reward: -50
- action: 1 | state: [78] | reward: -50
- action: 1 | state: [77] | reward: -50
- action: 1 | state: [76] | reward: -50
- action: 1 | state: [75] | reward: -50
- action: 1 | state: [74] | reward: -50
- action: 1 | state: [73] | reward: -50
- action: 1 | state: [72] | reward: -50
- action: 1 | state: [71] | reward: -50
- action: 1 | state: [70] | reward: -50
- action: 1 | state: [69] | reward: -50
- action: 1 | state: [68] | reward: -50
- action: 1 | state: [67] | reward: -50
- action: 1 | state: [66] | reward: -50
- action: 0 | state: [64] | reward: -50
- action: 0 | state: [62] | reward: -50
- action: 0 | state: [60] | reward: -50
- action: 0 | state: [58] | reward: -50

...

- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
- action: 2 | state: [24] | reward: 100
| episode:  0   | score: 4900.00        | average score 4900.00
 - last state:  [24]    | reward:  100  | action:  2    | epsilon:  0.999

์š”๋กœ์ฝ”๋กฌ ์ž˜ ๋๋‚˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

๋!!!!!!!!!!!! ํ•˜๋‹ค๊ฐ€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๊ฑฐ๋‚˜ ์„ค๋ช… ์ค‘ ํ‹€๋ฆฐ ๋ถ€๋ถ„์ด ์žˆ๋‹ค๋ฉด ์–ผ๋งˆ๋“ ์ง€ ๋Œ“๊ธ€ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค(_ _)

728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ  (2) 2023.10.23
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)  (0) 2023.10.16
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)  (0) 2023.10.06
[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)  (0) 2023.08.01
[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning  (0) 2023.08.01
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
  • [๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
  • [๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)
  • [๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”