[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2

2024. 12. 11. 14:23ยท๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ
728x90

3. Selective State Space Models

3.1 Selection as a Means of Compression

figure 2 ์™ผ์ชฝ์˜ Copying task์˜ ๊ธฐ๋ณธ ๋ฒ„์ „์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด์— ์ผ์ • ๊ฑฐ๋ฆฌ๋ฅผ ๋‘๊ณ  ์‹œ๋ถˆ๋ณ€ํ•œ ๋ชจ๋ธ๋กœ global convolution์ด๋‚˜ ์„ ํ˜• ํšŒ๊ท€๋ฅผ ํ†ตํ•ด์„œ ์‰ฝ๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ. Selective Copying์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฌด์ž‘์œ„๋กœ ์„ ์ •๋˜๊ณ , ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ™”ํ•˜๋ฉฐ ์„ ํƒ์ ์œผ๋กœ ์ž…๋ ฅ์„ ๊ธฐ์–ตํ• ์ง€ ๋ฌด์‹œํ• ์ง€๋ฅผ contents์— ๋”ฐ๋ผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์Œ. Induction head ๋Š” LLM์˜ ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ธ ๋งฅ๋ฝ์— ๋”ฐ๋ฅธ ์ถ”๋ก ์— ๊ด€๋ จ๋œ ๊ฒƒ

๋ณ‘ํ•ฉ ์ž‘์—…์— ๊ด€ํ•œ ๋‘๊ฐ€์ง€ ์‹คํ–‰ ์˜ˆ์‹œ

  • Selective Copying : ๊ธฐ์–ตํ•  ํ† ํฐ์˜ ์œ„์น˜๋ฅผ ๋ฐ”๊ฟ” Copying Task๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค. ๊ด€๋ จ์žˆ๋Š” ํ† ํฐ์„ ๊ธฐ์–ตํ•˜๊ณ  ๊ด€๋ จ์—†๋Š” ํ† ํฐ์„ ๊ฑธ๋Ÿฌ๋‚ด๋ ค๋ฉด ๋‚ด์šฉ ์ธ์‹ ์ถ”๋ก (content-aware resoning)์ด ํ•„์š”ํ•˜๋‹ค.
  • Induction Heads : ์ ์ ˆํ•œ ์ปจํ…์ŠคํŠธ์—์„œ ์ถœ๋ ฅ์„ ๋‚ผ ์‹œ๊ธฐ๋ฅผ ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‚ด์šฉ ์ธ์‹ ์ถ”๋ก ์ด ํ•„์š”ํ•˜๋‹ค. LLM์˜ ๋™์ž‘ ๊ณผ์ • ์„ค๋ช…์„ ์œ„ํ•ด ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” ๋งค์ปค๋‹ˆ์ฆ˜.

์ด ์ž‘์—…์€ LTI ๋ชจ๋ธ์˜ ์‹คํŒจํ•œ ๋ชจ๋“œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ํšŒ๊ท€์  ๊ด€์ ์—์„œ constant dynamics(์—ฌ๊ธฐ์—์„œ๋Š” $\bar{A}, \bar{B}$)๋Š” context์—์„œ ์˜ฌ๋ฐ”๋ฅธ ์ •๋ณด๋ฅผ ์„ ํƒํ•˜๊ฑฐ๋‚˜, ์ž…๋ ฅ์— ๋”ฐ๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์‹œํ€€์Šค๋ฅผ ๋”ฐ๋ผ hidden state์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๋‹ค.
ํ•ฉ์„ฑ๊ณฑ์  ๊ด€์ ์—์„œ global convolutions๋Š” ์ผ๋ฐ˜ copying task๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋Š”๋ฐ, ์™œ๋ƒํ•˜๋ฉด ๊ทธ๊ฒŒ ์‹œ๊ฐ„ ์ธ์‹ ๊ธฐ๋ฐ˜์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‚ด์šฉ ์ธ์‹ ๋ถ€์กฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด์„œ Selective Copying task์—๋Š” ์–ด๋ ค์›€์ด ์žˆ๋‹ค.

์‹œํ€€์Šค ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ๊ณผ ํšจ๊ณผ์„ฑ(์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋Š”๊ฐ€ vs ๋ชฉํ‘œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์„ฑ๊ณต์ ์œผ๋กœ ๋‹ฌ์„ฑํ–ˆ๋Š”๊ฐ€)์˜ ํŠธ๋ ˆ์ด๋“œ ์˜คํ”„๋Š” ์–ผ๋งˆ๋‚˜ ์ž์‹ ์˜ state๋ฅผ ์ž˜ ์••์ถ•ํ–ˆ๋Š”๊ฐ€๋กœ ํ‘œํ˜„๋œ๋‹ค. ํšจ์œจ์ ์ธ ๋ชจ๋ธ์€ ๋ฐ˜๋“œ์‹œ ์ž‘์€ state๋ฅผ ๊ฐ€์ ธ์•ผํ•˜๊ณ , ๋ฐ˜๋ฉด์— ํšจ๊ณผ์ ์ธ ๋ชจ๋ธ์€ context์—์„œ ๋ชจ๋“  ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ state์— ํฌํ•จํ•ด์•ผํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ ์‹œํ€€์Šค ๋ชจ๋ธ ์ˆ˜๋ฆฝ์— ํ•„์š”ํ•œ ๊ธฐ๋ณธ ์›์น™์„ ์ง‘์ค‘์„ ์œ„ํ•œ ๋‚ด์šฉ์ธ์‹ ํ˜น์€ ์ž…๋ ฅ์„ sequential state๋กœ ํ•„ํ„ฐ๋งํ• ์ง€์— ๋Œ€ํ•œ ์„ ํƒ์„ฑ Selectivity๋กœ ์ œ์•ˆํ•œ๋‹ค. ๋ถ€๋ถ„์ ์œผ๋กœ selection mechanism์€ ์ •๋ณด ์ „ํŒŒ ํ˜น์€ ์‹œํ€€์Šค ์ฐจ์›์— ๋Œ€ํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ์ œ์–ดํ•œ๋‹ค.

3.2 Improving SSMs with Selection

Selection ๋งค์ปค๋‹ˆ์ฆ˜์„ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์‹œํ€€์Šค๋ฅผ ๋”ฐ๋ผ ์ƒํ˜ธ์ž‘์šฉ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ž…๋ ฅ์— ๋”ฐ๋ผ ์›€์ง์ด๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๊ฐ€์žฅ ๋‹ค๋ฅธ ์ ์€ SSM๊ณผ ๋‹ฌ๋ฆฌ $\Delta, B, C$ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ input์— ๋Œ€ํ•ด ํ•จ์ˆ˜๋กœ ์ •์˜ํ•˜๊ณ , tensor์˜ shape์„ ๋ฐ”๊พผ ๊ฒƒ์ด๋‹ค. ๋ถ€๋ถ„์ ์œผ๋กœ๋Š” ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ์ด์ œ ๊ธธ์ด ์ฐจ์› L์„ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋Š” ๋ชจ๋ธ์ด ์‹œ๋ถˆ๋ณ€์—์„œ ์‹œ๊ฐ„ ๋ณ€ํ™”๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ์ชฝ์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ๋œป์ด๋‹ค.

$$\begin{matrix}
s_{B}(x) &=& Linear_{N}(x) \
s_{C}(x) &=& Linear_{N}(x) \
s_{\Delta}(x) &=& Broadcast_{D}(Linear_{1}(x) \
\tau_{\Delta} &=& softplus
\end{matrix}$$

$Linear_{d}$ ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐํ™” ๋œ ์ฐจ์› $d$ ์˜ projection์ด๋‹ค. $s_{\Delta}$ ์™€ $\tau_{\Delta}$ ์˜ ์„ ํƒ์€ RNN gating ๋งค์ปค๋‹ˆ์ฆ˜๊ณผ์˜ ์—ฐ๊ฒฐ ๋•Œ๋ฌธ์ด๋‹ค.

3.3 Efficient Implementation of Selective SSMs

์„ ํƒ ๋งค์ปค๋‹ˆ์ฆ˜์€ ๊ฝค ์ž์—ฐ์Šค๋Ÿฝ๊ณ , ์ด์ „ ์—ฐ๊ตฌ์—์„œ๋Š” recurrent SSMs์—์„œ $\Delta$๋ฅผ ์‹œ๊ฐ„์ด ์ง€๋‚  ์ˆ˜๋ก ๋ณ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์„ ํƒ์˜ ํŠน๋ณ„ํ•œ ์ผ€์ด์Šค๋ฅผ ํ†ตํ•ฉํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ SSMs ์‚ฌ์šฉ์˜ ์ฃผ์š” ํ•œ๊ณ„๋Š” ์—ฐ์ž”์  ํšจ์œจ์„ฑ์— ์žˆ๋‹ค. ์ด๋Š” S4๋‚˜ ๋‹ค๋ฅธ ํŒŒ์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด LTI(non-selective) ๋ชจ๋ธ์„, ๊ทธ ์ค‘์—์„œ๋„ global convolution ํ˜•ํƒœ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ์ด๋‹ค.

3.3.1 Motivation of Prior Models

  • SSMs์™€ ๊ฐ™์€ recurrent model์€ ํ‘œํ˜„๋ ฅ๊ณผ ์†๋„ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์žก์•„์•ผํ•œ๋‹ค.
  • recurrent ๋ชจ๋“œ์—์„œ ํŒŒ์ƒ๋œ ๊ฒƒ์ด convolution ๋ชจ๋“œ๋ผ์„œ recurrent๊ฐ€ ์ข€ ๋” ์œ ์—ฐํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๊ณผ์ •์—์„œ (B, L, D, C) ์ฐจ์›์˜ latent state h๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ตฌ์ฒดํ™”ํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด๊ฒŒ ์ž…๋ ฅ x์™€ ์ถœ๋ ฅ y์˜ ์ฐจ์›๋ณด๋‹ค ํ›จ์”ฌ ํฌ๋‹ค. ๊ทธ๋ž˜์„œ state ์—ฐ์‚ฐ์„ ์šฐํšŒํ•˜๊ณ  convolution kernel(B, L, D)๋ฅผ ๊ตฌ์ฒดํ™”ํ•˜๋Š” convolution mode๊ฐ€ ๋” ํšจํœผ์ ์ด๋ผ๊ณ  ์†Œ๊ฐœ๋๋‹ค.
  • ์ด์ „ LTI state space model ์€ ์ด์ค‘ ์ˆœํ™˜-ํ•ฉ์„ฑ๊ณฑ ํ˜•ํƒœ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํšจ์œจ์„ฑ ์ €ํ•˜ ์—†์ด ๊ธฐ์กด RNN๋ณด๋‹ค 10-100๋ฐฐ์— ๋‹ฌํ•˜๋Š” state ์ฐจ์›์„ ๋Š˜๋ฆฐ๋‹ค.
728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๋…ผ๋ฌธ ๊ตฌํ˜„] Transformer ํ…์„œํ”Œ๋กœ์šฐ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ  (0) 2025.01.16
[Paper Review] Transformer - Attention is All You Need  (1) 2024.12.30
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1  (1) 2024.12.11
[Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning  (0) 2023.08.07
[Model Review] TadGAN(Time series Anomaly Detection GAN)  (0) 2023.05.17
'๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๋…ผ๋ฌธ ๊ตฌํ˜„] Transformer ํ…์„œํ”Œ๋กœ์šฐ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ
  • [Paper Review] Transformer - Attention is All You Need
  • [Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1
  • [Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”