[๋…ผ๋ฌธ ๊ตฌํ˜„] Transformer ํ…์„œํ”Œ๋กœ์šฐ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ

2025. 1. 16. 16:38ยท๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ
๋ชฉ์ฐจ
  1. 4.1. Encoder layer
  2. 4.2 Encoder Padding Mask
  3. 5.1. Decoder Layer
  4. 5.2. Decoder padding mask
  5. 5.3. Look-ahead mask
728x90

๋ชจ๋ธ์„ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ ๋งŒ๋“  ๊ฑด ์•„๋‹ˆ๊ณ , ์ด์ „์— ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•ด๋ณด๋ฉด์„œ ์ดํ•ด๋ฅผ ๊นŠ์ดํ•˜๋Š” ์‹œ๊ฐ„์„ ๊ฐ–๊ณ ์ž ๋งŒ๋“ค์–ด๋ณด์•˜๋‹ค.
๋”ฐ๋ผ์„œ ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋„ฃ๊ณ  ํ•™์Šตํ–ˆ์„ ๋•Œ ๊ตฌ๋ฐ๊ธฐ์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ฐธ๊ณ ํ•ด์ฃผ์‹œ๊ธธ...

1. Attention ๊ตฌํ˜„ํ•˜๊ธฐ

$$Attention = softmax(\frac{QK^{T}}{\sqrt{ d_{k} }})V$$
์ด ๊ณต์‹์— ๋งž๊ฒŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ $d_{k}$ ๋Š” k์˜ ์ฐจ์›์ˆ˜์ด๋‹ค.
์ฐจ๋ก€์ฐจ๋ก€ ๋ณด๋ฉด, ์ˆœ์„œ๋Œ€๋กœ ์ฐจ๊ทผ์ฐจ๊ทผ ์ง„ํ–‰ํ•˜๋ฉด ๋œ๋‹ค. ๋งˆ์Šคํฌ ์ ์šฉ์— ๊ด€ํ•œ ๊ฑด์€ ๋’ค์—์„œ.

def scaled_dot_product_attention(query, key, value, mask=None):
    # QK^T
    matmul_qk = tf.matmul(query, key, transpose_b=True) # ๋‘ ๊ฐœ ์ค‘ ๋’ค์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’ transpose

    # QK^T / sqrt(d_k)
    d_k = tf.cast(tf.shape(key)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(d_k)

    # softmax
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    # multiply with V
    output = tf.matmul(attention_weights, value)

    return output, attention_value

2. Multi head Attention ๊ตฌํ˜„ํ•˜๊ธฐ

Multi head attention์€ ์œ„์—์„œ ๊ตฌํ˜„ํ•œ attention์„ ํ˜ธ์ถœํ•ด์„œ self-attention์„ ์—ฌ๋Ÿฌ ํ—ค๋“œ๊ฐ€ ๋…์ž์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•˜๋Š” ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ํด๋ž˜์Šค ์ž์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ ˆ์ด์–ด๋กœ ๋ณผ ๊ฒƒ์ด๋ฏ€๋กœ tf.keras.layers.Layer์„ ์ƒ์†๋ฐ›์•„ ๊ตฌํ˜„ํ•œ๋‹ค.

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        # d_model % num_heads ๋Š” ์–ธ์ œ๋‚˜ 0์ด์–ด์•ผ ํ•จ! ํ‰๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ ์„œ ๊ณ„์‚ฐํ•  ๊ฒƒ
        assert self.d_model % self.num_heads == 0, "d_model must be divisiable by num_heads"

        self.depth = self.d_model // self.num_heads

        # q, k, v๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๊ฐ€์ค‘์น˜
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        # d_model์„ head ๊ฐœ์ˆ˜๋งŒํผ ๋‚˜๋ˆ ์„œ ์ž…๋ ฅ์„ multi head๋กœ ๋งŒ๋“ค ์ดˆ์„์„ ๋‹ค์ง
        # shape = (64, x_length, num_heads, d_model // num_heads)
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    # multi-head-attention layer ํ˜ธ์ถœ
    def call(self, query, key, value, mask=None):
        batch_size = tf.shape(query)[0]

        # Q, K, T ๊ตฌํ•จ
        query = self.wq(query)
        key = self.wk(key)
        value = self.wv(value)

        # multi head์— ๋งž๊ฒŒ ๋‚˜๋ˆ„๊ณ  reshape
        query = self.split_heads(query, batch_size)
        key = self.split_heads(key, batch_size)
        value = self.split_heads(value, batch_size)

        # Attention score ๊ณ„์‚ฐ
        scaled_attention, attention_weights = scaled_dot_product_attention(query, key, value, mask)

        # ๊ณ„์‚ฐ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ”๊ฟจ๋˜ ์œ„์น˜ ๋ณต์›
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)

        return output, attention_weights

3. Feed Forward ๊ตฌํ˜„ํ•˜๊ธฐ

Encoder์™€ Decoder ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ Feed Forward layer๊ฐ€ ์กด์žฌํ•œ๋‹ค.

class PositionwiseFeedForward(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ffn=2048):
        super(PositionwiseFeedForward, self).__init__()
        self.d_model = d_model
        self.d_ffn = d_ffn

        self.dense1 = layers.Dense(self.d_ffn, activation="relu")
        self.dense2 = layers.Dense(self.d_model)

    def call(self, x):
        x = self.dense1(x)
        x = self.dense2(x)
        return x
  • Feed forward ๋„คํŠธ์›Œํฌ๋Š” ๋‹จ์ˆœํ•˜๊ฒŒ ๋‘ ๊ฐœ์˜ fully connected layer๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์„ฑํ•˜์˜€๋‹ค.

4. Encoder ๊ตฌํ˜„ํ•˜๊ธฐ

4.1. Encoder layer

๊ตฌํ˜„ํ•ด์•ผํ•  Encoder Layer๋ฅผ ๋ณด๋ฉด, Multi-head Attention๊ณผ normalization ๋ ˆ์ด์–ด๊ฐ€ ์žˆ๊ณ  feed forward layer๋กœ ์ด์–ด์ง„๋‹ค. ๊ทธ๋ฆฌ๊ณ  residual connection์„ ์ด์šฉํ•ด์„œ layer์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์— ๋ฐ˜์˜ํ•˜๋Š” ์‹์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

์œ„ ๋‹จ๊ณ„์—์„œ multi-head connection๊ณผ feed forward๋ฅผ ๊ตฌํ˜•ํ•˜๋‘์—ˆ์œผ๋ฏ€๋กœ ์ด๋ฅผ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, d_ffn, dropout_rate=0.1):
        super(EncoderLayer, self).__init__()
        # ์‚ฌ์šฉํ•  layer ์„ ์–ธ
        # Multi-Head Attention
        self.multi_head_attention = MultiHeadAttention(d_model, num_heads)

        # FeedForward Network
        self.feedforward = PositionwiseFeedForward(d_model, d_ffn)

        # Dropout
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

        # Layer Normalization
        self.layer_norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, mask, training):
        # multi-head attention ์‹คํ–‰
        attn_output, _ = self.multi_head_attention(x, x, x, mask)
        # dropout
        attn_output = self.dropout1(attn_output, training=training)
        # ์ž…๋ ฅ x์™€ attention output์„ ๋”ํ•ด์„œ residual connection ๊ตฌํ˜„ + norm
        out1 = self.layer_norm1(x + attn_output)

        # feed forward network
        ffn_output = self.feedforward(out1)
        # dropout
        ffn_output = self.dropout2(ffn_output, training=training)
        # attention ์ถœ๋ ฅ out1๊ณผ feead forward output์„ ๋”ํ•ด์„œ residual connection ๊ตฌํ˜„ + norm
        out2 = self.layer_norm2(out1 + ffn_output)

        return out3

์ด์ œ ์ด encoder layer๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๋ถ™์—ฌ์„œ encoder๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.

๋จผ์ € encoder ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ embeding ํ•ด์•ผํ•œ๋‹ค. ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ ๊ณ ์ • ํฌ๊ธฐ์˜ dense ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.
๊ทธ๋ฆฌ๊ณ  positional encoding์˜ ์ˆ˜ํ–‰๊ฒฐ๊ณผ์™€ embedding์„ ๋จผ์ € ํ•ฉ์ณ์ฃผ๊ณ , ์ด๋ฅผ encoder layer๋“ค์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋„๋ก ํ•œ๋‹ค.

class Encoder(tf.keras.layers.Layer):
    def __init__(
        self, num_layers, d_model, num_heads, d_ffn, vocab_size, dropout_rate=0.1
    ):
        super(Encoder, self).__init__()

        # ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ
        self.embedding = tf.keras.layers.Embedding(
            input_dim=vocab_size, output_dim=d_model
        )  # ๋‹จ์–ด์ง‘์˜ ํฌ๊ธฐ๊ฐ€ 10000์ผ ๊ฒฝ์šฐ

        # Positional Encoding
        self.pos_encoding = self.add_weight(
            "pos_encoding", shape=[1, 10000, d_model], trainable=False
        )

        self.encoder_layers = [
            EncoderLayer(d_model, num_heads, d_ffn, dropout_rate)
            for _ in range(num_layers)
        ]

        self.dropout = tf.keras.layers.Dropout(dropout_rate)  

    def call(self, x, mask, training):
        # embedding + positional encoding
        x = self.embedding(x) + self.pos_encoding[:, : tf.shape(x)[1], :]
        x = self.dropout(x, training=training)
        for layer in self.encoder_layers:
            x = layer(x, mask, training)
        return x

4.2 Encoder Padding Mask

def create_padding_mask(seq):
    mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return mask[:, tf.newaxis, tf.newaxis, :]  # shape = (batch_size, 1, 1, seq_len)

Sequence ๊ธธ์ด๋ฅผ ํ†ต์ผํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋งŒ๋“ค์–ด์ง„ padding์ด๋‹ค. Transformer๋Š” ๊ณ ์ •๊ธธ์ด sequence๋ฅผ ์ž…์ถœ๋ ฅ์œผ๋กœ ๋‘๊ธฐ ๋•Œ๋ฌธ์— ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๋“ค์ญ‰๋‚ ์ญ‰ํ•ด์„œ๋Š” ์•ˆ๋œ๋‹ค.

5. Decoder ๊ตฌํ˜„ํ•˜๊ธฐ

5.1. Decoder Layer

Decoder layer๋Š” ๋‘ ๊ฐœ์˜ multi-head attention๊ณผ ํ•˜๋‚˜์˜ feed forward network๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์ด๋•Œ, decoder layer์˜ ์ฒซ ๋ฒˆ์งธ multi-head attention์€ masked multi-head attention ์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, decoder์˜ ์ž…๋ ฅ์ด target ์‹œํ€€์Šค์ธ ๊ฒƒ์„ ๊ณ ๋ คํ•˜์˜€์„ ๋•Œ, ํ˜„์žฌ ๋ชจ๋ธ์ด ์ถ”๋ก ํ•ด์•ผํ•˜๋Š” ์ˆœ์„œ์˜ ๋‹จ์–ด๋ณด๋‹ค ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ๊ฐ–์ง€ ๋ชปํ•˜๋„๋ก ๊ฐ€๋ฆฌ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ์•Œ๋ฉด ์ด๋ฏธ ๋‹ต์„ ์•„๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฆ„์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, d_ffn, dropout_rate=0.1):
        super(DecoderLayer, self).__init__()

        # Masked Multi-head Attention
        self.masked_multi_head_attention = MultiHeadAttention(
            d_model=d_model, num_heads=num_heads
        )

        # Multi-Head Attention
        self.multi_head_attention = MultiHeadAttention(
            d_model=d_model, num_heads=num_heads
        )

        # FeedForward Network
        self.feedforward = PositionwiseFeedForward(d_model=d_model, d_ffn=d_ffn)

        # Dropout
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)
        self.dropout3 = layers.Dropout(dropout_rate)

        # Layer Normalization
        self.layer_norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm3 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, x, encoder_output, look_ahead_mask, padding_mask, training):
        # Masked Multi-Head Attention
        attn1, _ = self.masked_multi_head_attention(x, x, x, mask=look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layer_norm1(x + attn1)

        attn2, _ = self.multi_head_attention(
            out1, encoder_output, encoder_output, mask=padding_mask
        )
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layer_norm2(out1 + attn2)

        ffn_output = self.feedforward(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layer_norm3(out2 + ffn_output)

        return out3

5.2. Decoder padding mask

def create_padding_mask(seq):
    mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return mask[:, tf.newaxis, tf.newaxis, :]  # shape = (batch_size, 1, 1, seq_len)

์ด ๋ถ€๋ถ„์€ Encoder padding mask์™€ ๊ฐ™๋‹ค. ๋‹ค๋งŒ decoder sequence๋Š” encoder sequence๋ณด๋‹ค ๊ธธ์ด๊ฐ€ ํ•˜๋‚˜ ์งง๋‹ค๋Š” ์‚ฌ์‹ค์„ ์—ผ๋‘์— ๋‘์–ด์•ผํ•œ๋‹ค.

5.3. Look-ahead mask

def create_look_ahead_mask(size):
    """
    tf.ones((4,4)) -> [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]
    tf.linalg.band_part,
      num_lower = -1 : ๋Œ€๊ฐ์„  ์•„๋ž˜ ๋ชจ๋“  ์š”์†Œ๋ฅผ ๋‚จ๊น€
      num_upper = 0 : ๋Œ€๊ฐ์„  ์œ„ ๋ชจ๋“  ์š”์†Œ๋ฅผ 0์œผ๋กœ ์ฒ˜๋ฆฌ๋ฆฌ -> [[1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1]]
    1 - tf.linalg.band_part(~) : ์—์„œ ๋นผ์ค˜์„œ ๊ฐ’ ๋ฐ˜์ „ -> [[0, 1, 1, 1], [0, 0, 1, 1], [0, 0, 0, 1], [0, 0, 0, 0]]
    1๋กœ ๋งˆ์Šคํ‚น๋จ -> ๋‚˜์ค‘์— 1e-6 ์ด๋ž‘ ๊ณฑํ•ด์ ธ์„œ ์ฐธ์กฐ ๋ชปํ•˜๊ฒŒ ๋  ์˜ˆ์ •
    """
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), num_lower=-1, num_upper=0)
    return mask

look-ahead ๋งˆ์Šคํฌ๋Š” ํ•™์Šต ์ค‘์— decoder๊ฐ€ ์ถ”๋ก ํ•ด์•ผํ•  ๊ฐ’๋ณด๋‹ค ๋” ๋ฏธ๋ž˜์˜ ์ •๋ณด๋“ค์„ ๋ณด์ง€ ๋ชปํ•˜๋„๋ก ๋ง‰๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. decoder์—์„œ ์ฒซ ๋ฒˆ์งธ multi-head attention์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฏธ๋ž˜์˜ ๊ฐ’์„ ๋ณด์ง€ ๋ชปํ•˜๋„๋ก, ์ •ํ™•ํžˆ๋Š” ๋ฏธ๋ž˜์˜ ์ •๋ณด๊ฐ’์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋„๋ก ์•„์ฃผ ์ž‘๊ฒŒ ๋งŒ๋“ค์–ด ํ•™์Šต์˜ ์˜ํ–ฅ๊ถŒ์—์„œ ๋นผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

6. Transformer

class Transformer(tf.keras.Model):
    def __init__(
        self,
        num_enc_layers,
        num_dec_layers,
        d_model,
        num_heads,
        d_ffn,
        input_vocab_size,
        target_vocab_size,
        max_pos_enc,
        dropout_rate=0.1,
    ):
        super(Transformer, self).__init__()

        # Encoder
        self.encoder = Encoder(
            num_enc_layers, d_model, num_heads, d_ffn, input_vocab_size, dropout_rate
        )

        # Decoder
        self.decoder = Decoder(
            num_dec_layers, d_model, num_heads, d_ffn, target_vocab_size, dropout_rate
        )

        # output
        self.dense = layers.Dense(target_vocab_size)

    def call(
        self,
        enc_input,
        dec_input,
        encoder_padding_mask,
        look_ahead_mask,
        decoder_padding_mask,
        training,
    ):
        # Encoder
        encoder_output = self.encoder(enc_input, encoder_padding_mask, training)

        # Decoder
        decoder_output = self.decoder(
            dec_input, encoder_output, look_ahead_mask, decoder_padding_mask, training
        )

        # output
        output = self.dense(decoder_output)

        return output

์ด์ œ๊ป ๊ตฌํ˜„ํ•œ ๋ ˆ์ด์–ด๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์Œ“์•„์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.

์‚ฌ์‹ค ๋งŒ๋“  ๋ชจ๋ธ์„ ๋”ฑํžˆ ํ…Œ์ŠคํŠธ(...) ํ•˜์ง€๋Š” ์•Š์•˜๊ณ  ๊ตฌ์กฐ๋ฅผ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•ด๋ณด๋Š”๋ฐ์— ์˜์˜๋ฅผ ๋’€๋‹ค.
๋”ฐ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•ด์„œ ํ•™์Šตํ•ด๋ณด๊ฒŒ ๋œ๋‹ค๋ฉด ๊ทธ๊ฒƒ๋„ ์ถ”๊ฐ€ํ•˜๊ฒ ๋‹ค.

728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Paper Review] Transformer - Attention is All You Need  (1) 2024.12.30
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2  (1) 2024.12.11
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1  (1) 2024.12.11
[Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning  (0) 2023.08.07
[Model Review] TadGAN(Time series Anomaly Detection GAN)  (0) 2023.05.17
  1. 4.1. Encoder layer
  2. 4.2 Encoder Padding Mask
  3. 5.1. Decoder Layer
  4. 5.2. Decoder padding mask
  5. 5.3. Look-ahead mask
'๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Paper Review] Transformer - Attention is All You Need
  • [Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2
  • [Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1
  • [Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๋…ผ๋ฌธ ๊ตฌํ˜„] Transformer ํ…์„œํ”Œ๋กœ์šฐ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ

๊ฐœ์ธ์ •๋ณด

  • ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ
  • ํฌ๋Ÿผ
  • ๋กœ๊ทธ์ธ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”

๋‹จ์ถ•ํ‚ค

๋‚ด ๋ธ”๋กœ๊ทธ

๋‚ด ๋ธ”๋กœ๊ทธ - ๊ด€๋ฆฌ์ž ํ™ˆ ์ „ํ™˜
Q
Q
์ƒˆ ๊ธ€ ์“ฐ๊ธฐ
W
W

๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๊ธ€

๊ธ€ ์ˆ˜์ • (๊ถŒํ•œ ์žˆ๋Š” ๊ฒฝ์šฐ)
E
E
๋Œ“๊ธ€ ์˜์—ญ์œผ๋กœ ์ด๋™
C
C

๋ชจ๋“  ์˜์—ญ

์ด ํŽ˜์ด์ง€์˜ URL ๋ณต์‚ฌ
S
S
๋งจ ์œ„๋กœ ์ด๋™
T
T
ํ‹ฐ์Šคํ† ๋ฆฌ ํ™ˆ ์ด๋™
H
H
๋‹จ์ถ•ํ‚ค ์•ˆ๋‚ด
Shift + /
โ‡ง + /

* ๋‹จ์ถ•ํ‚ค๋Š” ํ•œ๊ธ€/์˜๋ฌธ ๋Œ€์†Œ๋ฌธ์ž๋กœ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ‹ฐ์Šคํ† ๋ฆฌ ๊ธฐ๋ณธ ๋„๋ฉ”์ธ์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.