[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1

2024. 12. 11. 14:20ยท๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ
728x90

๋‚˜์˜จ์ง€ ๋ฒŒ์จ 1๋…„๋„ ๋„˜์—ˆ์ง€๋งŒ ์ตœ์‹  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ์•ˆ ํ•œ์ง€๊ฐ€ ๋ฐฑ๋งŒ๋…„ ์ •๋„ ๋œ ๊ฒƒ ๊ฐ™์•„์„œ ํ•œ ๋ฒˆ ์ฝ์–ด๋ณด๋Š” mamba... ๊ทธ๋ฆฌ ์ •ํ™•ํ•œ ๋ฆฌ๋ทฐ๋Š” ์•„๋‹ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์‹ค ๋ฒˆ์—ญ์— ๊ฐ€๊น๊ณ  ์ข€ ๋” ์ดํ•ดํ•ด๋ณด๋ฉด์„œ ๋‚ด์šฉ ์ˆ˜์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

1. Introduction

์ตœ๊ทผ๋“ค์–ด Structured State Space Sequence Models(SSMs) ๊ฐ€ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๊ตฌ์กฐ ๋ถ„์•ผ์—์„œ ์œ ๋งํ•œ ํด๋ž˜์Šค๋กœ ๋“ฑ์žฅํ–ˆ๋‹ค. ์ „ํ†ต์ ์ธ state space models์— ์˜๊ฐ์„ ๋ฐ›์•„ CNN๊ณผ RNN์˜ ํ†ตํ•ฉ์„ ์กฐ์œจํ•œ๋‹ค(interpreted).
Mamba์—์„œ๋Š” selective state space model์˜ ์ƒˆ๋กœ์šด ์ข…๋ฅ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์‹œํ€„์Šค ๊ธธ์ด์— ๋”ฐ๋ผ ์„ ํ˜•์ ์œผ๋กœ ํ™•์žฅํ•˜๋ฉด์„œ transformer์˜ ๋ชจ๋ธ๋ง ํŒŒ์›Œ๋ฅผ ๋”ฐ๋ผ์žก๊ธฐ ์œ„ํ•ด์„œ ๋ช‡๋ช‡์˜ axes(์—ฌ๊ธฐ์—์„œ๋Š” ๊ธฐ์ค€? ์ถ•.)์— ๋Œ€ํ•œ ๊ธฐ์กด ์—ฐ๊ตฌ๋ฅผ ๋ฐœ์ „์‹œํ‚ฌ ๊ฒƒ์ด๋‹ค.

Selection Mechanism

  • ๊ธฐ์กด ๋ชจ๋ธ์˜ ์ฃผ์š” ํ•œ๊ณ„๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ ์˜์กด์  ๋ฐฉ์‹์œผ๋กœ, ํšจ์œจ์ ์œผ๋กœ select ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ€์ง„๋‹ค๋Š” ์ ์ด๋‹ค.
  • Mamba์—์„œ๋Š” ์„ ํƒ์  ๋ณต์‚ฌ์™€ ํ—ค๋“œ ์œ ๋„์™€ ๊ฐ™์€ ์ค‘์š”ํ•œ ํ•ฉ์„ฑ ์ž‘์—…์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์ง๊ด€์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž…๋ ฅ์— ๊ธฐ๋ฐ˜ํ•œ SSM ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”ํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ์„ ํƒ ๋งค์ปค๋‹ˆ์ฆ˜์„ ์„ค๊ณ„ํ•˜์˜€๋‹ค.
  • ์ด๋Š” ๋ชจ๋ธ์ด ๊ด€๋ จ์ด ์ ์€ ์ •๋ณด๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ณ  ๊ด€๋ จ์žˆ๋Š” ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜๋„๋ก ํ•œ๋‹ค.

Hardware-aware Algorithm

Architecture

2023๋…„ ๋ฐœํ‘œ๋œ SSM Acritecture๊ณผ Transformer์˜ MLP block์„ single block์œผ๋กœ ํ•˜์—ฌ ์„ ํƒ์  state space ๋ฅผ ํ†ตํ•ฉํ•œ ๋‹จ์ˆœํ•˜๊ณ  ํ†ตํ•ฉ์ ์ธ ์•„ํ‚คํ…์ณ ๋””์ž์ธ(Mamba)๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค.
Selective SSMs๋Š” fully recurrent models๋กœ ์‹œํ€€์Šค์—์„œ ๋™์ž‘ํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ foundation models์˜ ๋ฐฑ๋ณธ์ด ๋˜๊ธฐ ์ ํ•ฉํ•œ key properties๊ฐ€ ์žˆ๋‹ค.

  1. high quality
  2. fast training and inference
  3. long context

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ช‡๊ฐ€์ง€ ํƒ€์ž…์˜ modalities์™€ ์„ค์ •์— ๋”ฐ๋ฅธ pretraining quality์™€ domain-specificํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€๋กœ Mamba์˜ ์ผ๋ฐ˜์ ์ธ ์‹œํ€€์Šค FM backbone์œผ๋กœ์„œ์˜ ์ž ์žฌ๋ ฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค.

2. SSM(State Space Model)

Structured state space sequence models(S4)๋Š” RNN, CNN, ๊ทธ๋ฆฌ๊ณ  ์ „ํ†ต์ ์ธ state space model๊ณผ ์—ฐ๊ด€์ด ์žˆ๋Š” ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ตœ์‹  ์‹œํ€€์Šค ๋ชจ๋ธ์˜ ํ•œ ์ข…๋ฅ˜์ด๋‹ค. ์ด๋Š”1์ฐจ์› ํ•จ์ˆ˜ ํ˜น์€ ์‹œํ€€์Šค๋ฅผ mappingํ•˜๋Š”, $x(t) \in R \to y(t) \in R$ implicit latent state $h(t) \in R^{N}$ ๋ถ€๋ถ„์  ์—ฐ์† ์‹œ์Šคํ…œ์ด๋‹ค.

S4 ๋ชจ๋ธ์€ sequence-to-sequence transformation์„ ๋‘ ๋‹จ๊ณ„๋กœ ์ •์˜ํ•˜๋Š” 4๊ฐ€์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ($\Delta, A, B, C$) ๋กœ ์ •์˜๋œ๋‹ค.

์ด์‚ฐํ™” Discretization

์—ฐ์† ํŒŒ๋ผ๋ฏธํ„ฐ($\Delta, A, B$)๋ฅผ ์ด์ง„ ํŒŒ๋ผ๋ฏธํ„ฐ($\bar{A}, \bar{B}$)๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค. ์ด๋•Œ ๋ณ€ํ™˜ ๋ฐฉ๋ฒ•์€ zero-order hold(ZOH)์— ๋”ฐ๋ฅธ๋‹ค.

์ด์‚ฐํ™”๋Š” ํ•ด์ƒ๋„ ๋ถˆ๋ณ€๊ณผ ๊ฐ™์€ ์ถ”๊ฐ€์ ์ธ ์†์„ฑ์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ๊ณ , ๋ชจ๋ธ์ด ์ ์ ˆํ•˜๊ฒŒ ์ •๊ทœํ™”๋˜์—ˆ๋Š”์ง€ ์ž๋™์œผ๋กœ ํ™•์ธํ•˜๋Š” ๋“ฑ ์—ฐ์† ์‹œ๊ฐ„ ์‹œ์Šคํ…œ๊ณผ ๊ธด๋ฐ€ํ•œ ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ RNN์˜ gating ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ๋„ ์—ฐ๊ด€์ด ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ๊ณ„์  ๊ด€์ ์—์„œ ์ด์‚ฐํ™”๋Š” ๋‹จ์ˆœํžˆ ssm์˜ forward pass์˜ ๊ทธ๋ž˜ํ”„ ๊ณ„์‚ฐ์˜ ์ฒซ ๋‹จ๊ณ„๋กœ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค. SSM์˜ ๋Œ€์ฒด ๋ฒ„์ „์€ ์ด์‚ฐํ™” ๋‹จ๊ณ„ ๋Œ€์‹  ํŒŒ๋ผ๋ฏธํ„ฐ ($\bar{A}, \bar{B}$)๋ฅผ ์ง์ ‘ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋” ์‰ฌ์šด ๋ฐฉ๋ฒ•์ด ๋  ์ˆ˜๋„ ์žˆ๋‹ค.

Computation

ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์ง„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค์—, ๋ชจ๋ธ์€ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ ๋ฐ”๋กœ Linear recurrence*์™€ *global convolution ์ด๋‹ค.

๋Œ€์ฒด๋กœ ๋ชจ๋ธ์€ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ํ•™์Šต์„ ์œ„ํ•ด convolutional mode๋ฅผ ์‚ฌ์šฉํ•˜๋‹ค๊ฐ€ ํšจ์œจ์ ์ธ ํšŒ๊ท€์ถ”๋ก ์„ ์œ„ํ•ด recurrent ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

์„ ํ˜•์‹œ๋ถˆ๋ณ€ Linear Time Invariance(LTI)

์—ฌ๊ธฐ์—์„œ๋Š” $(\Delta, A, B, C)$์™€ $(\bar{A}, \bar{B})$๊ฐ€ ๋ชจ๋“  time steps์— ๊ณ ์ •๋œ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ Linear Time Invariance๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋Š” recurrence์™€ convolution์— ๊นŠ์€ ์—ฐ๊ด€์ด ์žˆ๋‹ค. ํŽธํ•˜๊ฒŒ LTI SSMs๋ฅผ ์–ด๋–ค ์„ ํ˜• ํšŒ๊ท€๋‚˜ ํ•ฉ์„ฑ๊ณฑ๊ณผ ๋™๋“ฑํ•˜๊ฒŒ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ , LTI๋ฅผ ์ด๋Ÿฐ ์ข…๋ฅ˜์˜ ๋ชจ๋ธ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํฌ๊ด„์  ์šฉ์–ด๋ผ๊ณ  ์ƒ๊ฐํ•ด๋„ ๋˜๊ฒ ๋‹ค.

๊ทผ๋ณธ์ ์ธ ํšจ์œจ์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ชจ๋“  SSMs ๊ตฌ์กฐ๋Š” LTI์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ค‘์š”ํ•œ ์‹œ์‚ฌ์ ์€ LTI ๋ชจ๋ธ๋“ค์ด ํŠน์ •ํ•œ ํƒ€์ž…์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ชจ๋ธ๋ง์— ๋Œ€ํ•œ ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ์˜ ๊ธฐ์ˆ ์  ๊ธฐ์—ฌ๋Š” LTI ์ œ์•ฝ์„ ์‚ญ์ œํ•˜๊ณ  ํšจ์œจ์ ์ธ ๋ณ‘๋ชฉํ˜„์ƒ์„ ๊ทน๋ณต์— ๊ด€๋ จ์ด ์žˆ๋‹ค...

Structure and Dimensions

๊ฐ€์žฅ ์ธ๊ธฐ์žˆ๋Š” structure ํ˜•ํƒœ๋Š” diagonal์ด๋‹ค. ์ด ๊ฒฝ์šฐ $A \in \mathbb{R}^{N \times N}, B \in \mathbb{R}^{N \times 1}, C \in \mathbb{R}^{1 \times N}$ ํ–‰๋ ฌ๋“ค์€ ๋ชจ๋‘ N ์ˆซ์ž๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค. Batch size B, Length L, channel D์˜ input sequence x๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” SSM์ด ๊ฐ channel์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š” ์ „์ฒด hidden state๋Š” ์ž…๋ ฅ๋งˆ๋‹ค $DN$ ์ฐจ์›์„ ๊ฐ€์ง€๊ณ , ๋ชจ๋“  sequence ๊ธธ์ด์— ๋”ฐ๋ฅธ ๊ณ„์‚ฐ์€ $O(BLDN)$ ์‹œ๊ฐ„๊ณผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

General State Space Models

  • Markov Decision Process(MDP)
  • Dynamic Casual Modeling(DCM)
  • Kalman Filters
  • Hidden Markov Models(HMM)
  • Linear Dynamical Systems(LDS)
  • Recurrent models at large deep learning

SSM Architectures

  • Linear attention
  • H3
  • Hyena
  • RetNet
  • RWKV
728x90
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Paper Review] Transformer - Attention is All You Need  (1) 2024.12.30
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2  (1) 2024.12.11
[Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning  (0) 2023.08.07
[Model Review] TadGAN(Time series Anomaly Detection GAN)  (0) 2023.05.17
[Model Review] YOLOv5 + Roboflow Annotation  (0) 2023.03.14
'๐Ÿฌ ML & Data/๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Paper Review] Transformer - Attention is All You Need
  • [Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 2
  • [Paper Review] Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning
  • [Model Review] TadGAN(Time series Anomaly Detection GAN)
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[Paper Review] Mamba - Linear Time Sequence Modeling with Selective State Spaces 1
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”