[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)

2023. 10. 6. 16:06ยท๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning
728x90

Dueling Double DQN

https://arxiv.org/pdf/1509.06461.pdf

https://arxiv.org/pdf/1511.06581.pdf

Double DQN

  • DQN์—์„œ reward๋ฅผ ๊ณผ๋Œ€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ.
  • Q Value๊ฐ€ agent๊ฐ€ ์‹ค์ œ๋ณด๋‹ค ๋†’์€ ๋ฆฌํ„ด์„ ๋ฐ›์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒฝํ–ฅ
  • ⇒ Q learning update ๋ฐฉ์ •์‹์— ๋‹ค์Œ ์ƒํƒœ(state)์— ๋Œ€ํ•œ Q value ์ตœ๋Œ€๊ฐ’์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ
  • Q ๊ฐ’์— ๋Œ€ํ•œ max ์—ฐ์‚ฐ์€ ํŽธํ–ฅ์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค.
    • ํ™˜๊ฒฝ์˜ ์ตœ๋Œ€ true value๊ฐ€ 0์ธ๋ฐ agent๊ฐ€ ์ถ”์ •ํ•˜๋Š” ์ตœ๋Œ€ true value๊ฐ€ ์–‘์ˆ˜์ธ ๊ฒฝ์šฐ์— ์„ฑ๋Šฅ ์ €ํ•˜

ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ network ์‚ฌ์šฉ.

  • Q Next : action selection → ๋‹ค์Œ ์•ก์…˜์œผ๋กœ ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ ์„ ํƒ
  • Q Eval : action evaluation → ์„ ํƒํ•œ ์•ก์…˜์ด ์ข‹์€ ์•ก์…˜์ด์—ˆ๋Š”์ง€ ํ‰๊ฐ€

Dueling DQN

๋‘ ๊ฐœ์˜ network ์‚ฌ์šฉ

  • V : Value function → ์ƒํƒœ(state)๋กœ๋ถ€ํ„ฐ ๋ฐ›์„ reward ์–‘์— ๋”ฐ๋ฅธ ์ƒํƒœ ๊ฐ€์น˜
  • A : Advantage function → ๋ถ€๋ถ„์ ์ธ ์ƒํƒœ(state)์—์„œ ์–ด๋–ค action์ด ๋‹ค๋ฅธ action์— ๋น„ํ•œ ๊ฐ€์น˜
  • Q function์€ V + A

1. DQN

import tensorflow as tf
import tensorflow.keras as keras


class DuelingDeepQNetwork(keras.Model):
    def __init__(self, n_actions, fc1_dims, fc2_dims):
        super(DuelingDeepQNetwork, self).__init__()
        self.dense1 = keras.layers.Dense(fc1_dims, activation='relu')
        self.dense2 = keras.layers.Dense(fc2_dims, activation='relu')
        self.V = keras.layers.Dense(1, activation=None)
        self.A = keras.layers.Dense(n_actions, activation=None)

    def call(self, state):
        x = self.dense1(state)
        x = self.dense2(x)
				# Value function -> ํ˜„์žฌ ์ƒํƒœ ๊ฐ€์น˜
        V = self.V(x)
				# Advantage function -> action ๊ฐ€์น˜
        A = self.A(x)

        Q = (V + (A - tf.math.reduce_mean(A, axis=1, keepdims=True)))

        return Q
    
    def advantage(self, state):
        x = self.dense1(state)
        x = self.dense2(x)
        A = self.A(x)

        return A

 

2. Replay Buffer

  • Replay Buffer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ด์  ์กด์žฌ
    • ์‹ ๊ฒฝ๋ง ์—…๋ฐ์ดํŠธ ์‹œ์— example ๊ฐ„์˜ ์˜์กด๋„ ๊ฐ์†Œ
    • mini batch ์‚ฌ์šฉ์œผ๋กœ ํ•™์Šต์†๋„ ๊ฐ€์†
    • ๊ณผ๊ฑฐ์˜ transition์„ ๋‹ค์‹œ ์‚ฌ์šฉํ•ด์„œ ๋ง๊ฐ์„ ๋ฐฉ์ง€
import numpy as np

class ReplayBuffer():
    def __init__(self, max_size, input_shape):
        self.mem_size = max_size
        self.mem_cntr = 0

        self.state_memory = np.zeros((self.mem_size, *input_shape),
                                        dtype=np.float32)
        self.new_state_memory = np.zeros((self.mem_size, *input_shape),
                                        dtype=np.float32)
        self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
        self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
        self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool)

    def store_transition(self, state, action, reward, state_, done):
        index = self.mem_cntr % self.mem_size
        self.state_memory[index] = state
        self.new_state_memory[index] = state_
        self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.terminal_memory[index] = done

        self.mem_cntr += 1

    def sample_buffer(self, batch_size):
        max_mem = min(self.mem_cntr, self.mem_size)
        batch = np.random.choice(max_mem, batch_size, replace=False)

        states = self.state_memory[batch]
        new_states = self.new_state_memory[batch]
        actions = self.action_memory[batch]
        rewards = self.reward_memory[batch]
        dones = self.terminal_memory[batch]

        return states, actions, rewards, new_states, dones

 

3. Agent

from importlib import import_module
import tensorflow as tf
from dueling_ddqn_lunar import DuelingDeepQNetwork
from dueling_ddqn_replay_buffer_lunar import ReplayBuffer
from tensorflow.keras.optimizers import Adam
import numpy as np


class Agent():
    def __init__(self, lr, gamma, n_actions, epsilon, batch_size,
                 input_dims, epsilon_dec=1e-3, eps_end=0.01, 
                 mem_size=100000, fc1_dims=128,
                 fc2_dims=128, replace=100):
        self.action_space = [i for i in range(n_actions)]

# gamma = discount factor
        self.gamma = gamma
# epsilon = agent๊ฐ€ ๋‹ค๋ฅธ action์„ ์„ ํƒํ•˜๋„๋ก ํ•˜๋Š” random ์ง€์ˆ˜
        self.epsilon = epsilon
        self.eps_dec = epsilon_dec
        self.eps_min = eps_end
        self.replace = replace
        self.batch_size = batch_size

        self.learn_step_counter = 0
        self.memory = ReplayBuffer(mem_size, input_dims)

# ์ด action์„ ํ‰๊ฐ€ํ•˜๋Š” network
        self.q_eval = DuelingDeepQNetwork(n_actions, fc1_dims, fc2_dims)
# ํ˜„์žฌ ์ƒํ™ฉ์— ๋”ฐ๋ฅธ action์„ ์„ ํƒํ•˜๋Š” network
        self.q_next = DuelingDeepQNetwork(n_actions, fc1_dims, fc2_dims)

        self.q_eval.compile(optimizer=Adam(learning_rate=lr),
                            loss='mean_squared_error')
                            
        self.q_next.compile(optimizer=Adam(learning_rate=lr),
                            loss='mean_squared_error')

    def store_transition(self, state, action, reward, new_state, done):
        self.memory.store_transition(state, action, reward, new_state, done)

    def choose_action(self, observation):
        if np.random.random() < self.epsilon:
            action = np.random.choice(self.action_space)
        else:
            state = np.array([observation])
# action ํ‰๊ฐ€
            actions = self.q_eval.advantage(state)
# ํ‰๊ฐ€ ๊ฐ’ ์ค‘ ์ตœ๊ณ 
            action = tf.math.argmax(actions, axis=1).numpy()[0]

        return action

    def learn(self):
        if self.memory.mem_cntr < self.batch_size:
            return

        if self.learn_step_counter % self.replace == 0:
            self.q_next.set_weights(self.q_eval.get_weights())

        states, actions, rewards, states_, dones = \
                                    self.memory.sample_buffer(self.batch_size)

        q_pred = self.q_eval(states)
        q_target = q_pred.numpy()
        max_actions = tf.math.argmax(self.q_eval(states_), axis=1)
        
        # improve on my solution!
        for idx, terminal in enumerate(dones):
            #if terminal:
                #q_next[idx] = 0.0
            q_target[idx, actions[idx]] = rewards[idx] + \
                    self.gamma*q_next[idx, max_actions[idx]]*(1-int(dones[idx]))
        self.q_eval.train_on_batch(states, q_target)

        self.epsilon = self.epsilon - self.eps_dec if self.epsilon > \
                        self.eps_min else self.eps_min

        self.learn_step_counter += 1
728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ  (2) 2023.10.23
[๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)  (0) 2023.10.16
[๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€  (0) 2023.08.16
[๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)  (0) 2023.08.01
[๊ฐ•ํ™”ํ•™์Šต] Markov Decision Process & Q-Learning  (0) 2023.08.01
'๐Ÿฌ ML & Data/๐Ÿ“ฎ Reinforcement Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๊ฐ•ํ™”ํ•™์Šต] Dealing with Sparse Reward Environments - ํฌ๋ฐ•ํ•œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•˜๊ธฐ
  • [๊ฐ•ํ™”ํ•™์Šต] DDPG(Deep Deterministic Policy Gradient)
  • [๊ฐ•ํ™”ํ•™์Šต] gym์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต custom ํ™˜๊ฒฝ ์ƒ์„ฑ๋ถ€ํ„ฐ Dueling DDQN ํ•™์Šต๊นŒ์ง€
  • [๊ฐ•ํ™”ํ•™์Šต] DQN(Deep Q-Network)
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๊ฐ•ํ™”ํ•™์Šต] Dueling Double Deep Q Learning(DDDQN / Dueling DQN / D3QN)
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”