Speculative Decoding 가이드¶

LLM 추론 가속화를 위한 Speculative Decoding 기법 실무 가이드.

개요¶

Speculative Decoding은 작은 Draft 모델이 여러 토큰을 미리 생성하고, 큰 Target 모델이 한 번에 검증하는 방식으로 추론 속도를 높이는 기법이다.

구분	설명
핵심 아이디어	Draft-then-Verify 패러다임
속도 향상	2-3x (태스크/모델 의존)
품질	Lossless (동일한 출력 분포 보장)
적용 대상	Autoregressive LLM 전반

동작 원리¶

┌─────────────────────────────────────────────────────────┐
│                  Speculative Decoding                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. Draft Phase                                         │
│  ┌─────────────┐                                       │
│  │ Draft Model │ ──→ [t1, t2, t3, t4, t5]              │
│  │   (Small)   │      예측 토큰들 (γ개)                │
│  └─────────────┘                                       │
│                                                         │
│  2. Verify Phase                                        │
│  ┌─────────────┐                                       │
│  │Target Model │ ──→ 병렬 검증 (1 forward pass)        │
│  │   (Large)   │                                       │
│  └─────────────┘                                       │
│                                                         │
│  3. Accept/Reject                                       │
│  [t1 ✓] [t2 ✓] [t3 ✓] [t4 ✗] [t5 -]                   │
│   accepted=3, 다음 토큰 = Target의 t4' 사용            │
│                                                         │
└─────────────────────────────────────────────────────────┘

수학적 배경¶

Draft 모델 확률 분포 $q(x)$, Target 모델 확률 분포 $p(x)$일 때:

Acceptance 확률: $$\alpha = \min\left(1, \frac{p(x)}{q(x)}\right)$$

Rejection Sampling: - $q(x) \leq p(x)$: 항상 accept - $q(x) > p(x)$: 확률적으로 reject

이 방식은 Target 모델의 출력 분포를 정확히 재현한다 (Lossless).

주요 변형¶

1. Standard Speculative Decoding¶

# 기본 구조
def speculative_decode(draft_model, target_model, prompt, gamma=5):
    tokens = []
    while not done:
        # Draft: γ개 토큰 생성
        draft_tokens = draft_model.generate(prompt, num_tokens=gamma)

        # Verify: 한 번의 forward pass로 검증
        logits = target_model.forward([prompt + draft_tokens])

        # Accept/Reject
        accepted = verify_and_accept(draft_tokens, logits)
        tokens.extend(accepted)

    return tokens

2. Self-Speculative Decoding¶

같은 모델의 조기 레이어(Early Exit)를 Draft로 사용:

장점	단점
별도 Draft 모델 불필요	모델 수정 필요
메모리 효율적	모든 아키텍처에 적용 어려움
KV Cache 공유 가능

3. Staged Speculative Decoding¶

여러 단계의 Draft 모델 사용:

Tiny Draft → Small Draft → Target
    ↓           ↓            ↓
  ~10M       ~1B          ~70B

4. DistillSpec (Online Speculative Decoding)¶

Knowledge Distillation으로 Draft 모델을 온라인으로 개선:

런타임에 Draft 모델 adaptation
Token acceptance rate 향상
UC Berkeley 2025 연구

5. Mirror Speculative Decoding¶

Apple 2026 연구. 병렬 Draft 경로로 Serial 병목 해소:

      ┌─ Draft Path A ─┐
Input ┼─ Draft Path B ─┼→ Parallel Verify
      └─ Draft Path C ─┘

성능 분석¶

Speedup 공식¶

\[\text{Speedup} = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(c \cdot \gamma + 1)}\]

$\alpha$: 평균 acceptance rate
$\gamma$: speculation length (Draft 토큰 수)
$c$: Draft/Target 속도 비율

최적 γ 선택¶

Acceptance Rate	권장 γ
> 0.9	8-12
0.7-0.9	5-8
0.5-0.7	3-5
< 0.5	2-3

태스크별 성능¶

태스크	예상 Speedup	비고
번역	2.5-3.0x	높은 예측 가능성
요약	2.0-2.5x
코드 생성	2.0-2.8x	구조적 패턴
대화	1.5-2.0x	다양한 응답
창작	1.3-1.8x	낮은 예측 가능성

구현 가이드¶

vLLM 구현¶

from vllm import LLM, SamplingParams

# Draft + Target 모델 설정
llm = LLM(
    model="meta-llama/Llama-3-70B",
    speculative_model="meta-llama/Llama-3-8B",
    num_speculative_tokens=5,
    speculative_draft_tensor_parallel_size=1,
)

# 생성
outputs = llm.generate(prompts, SamplingParams(temperature=0.7))

HuggingFace 구현¶

from transformers import AutoModelForCausalLM, AutoTokenizer

target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

outputs = target.generate(
    inputs,
    assistant_model=draft,
    do_sample=True,
    temperature=0.7,
)

TensorRT-LLM 구현¶

# config 설정
config = {
    "speculative_decoding_mode": "draft_tokens_external",
    "max_draft_tokens": 5,
    "draft_model_path": "/path/to/draft",
}

Draft 모델 선택¶

선택 기준¶

어휘 호환성: 동일한 tokenizer 또는 vocabulary alignment 필요
속도 비율: Draft는 Target의 10-20% 크기가 이상적
도메인 적합성: 태스크 도메인에 맞는 Draft 선택

Vocabulary Mismatch 해결¶

ICML 2025 연구 - Heterogeneous Vocabulary 지원:

# Vocabulary alignment layer
class VocabAligner:
    def __init__(self, draft_vocab, target_vocab):
        self.mapping = self._build_alignment(draft_vocab, target_vocab)

    def align(self, draft_tokens):
        return [self.mapping.get(t, t) for t in draft_tokens]

프로덕션 고려사항¶

메모리 관리¶

KV Cache 요구량:
- Target only: 1x
- Speculative (별도 Draft): 1.1-1.2x
- Self-Speculative: 1x (공유)

배치 처리¶

Homogeneous Batch: 동일 prompt 길이 → 효율적
Heterogeneous Batch: Padding 오버헤드 발생
Continuous Batching: vLLM 권장

동적 γ 조정¶

class AdaptiveSpeculator:
    def __init__(self, initial_gamma=5):
        self.gamma = initial_gamma
        self.acceptance_history = []

    def adjust_gamma(self, accepted_count, total_count):
        rate = accepted_count / total_count
        if rate > 0.85:
            self.gamma = min(12, self.gamma + 1)
        elif rate < 0.5:
            self.gamma = max(2, self.gamma - 1)

적용 시나리오¶

적합한 경우¶

긴 텍스트 생성 (요약, 번역, 문서 작성)
예측 가능한 패턴 (코드, 템플릿)
Latency-sensitive 애플리케이션
단일 요청 처리

부적합한 경우¶

매우 짧은 응답 (< 20 토큰)
극도로 창의적인 태스크
높은 temperature (> 1.0)
GPU 메모리가 제한적인 환경

비교: 다른 추론 가속 기법¶

기법	Speedup	품질 손실	복잡도
Speculative Decoding	2-3x	없음	중간
Quantization (INT8)	1.5-2x	약간	낮음
Continuous Batching	2-4x (처리량)	없음	중간
KV Cache Optimization	1.3-1.5x	없음	낮음
Model Pruning	1.5-2x	있음	높음

핵심 논문¶

논문	연도	기여
Fast Inference via Speculative Decoding	2022	원본 제안 (Google)
SpecInfer	2024	Tree-based speculation
DistillSpec	2025	Online adaptation
Mirror Speculative Decoding	2026	Parallel draft paths
Heterogeneous Vocabulary SD	2025	Cross-vocab support

요약¶

Speculative Decoding은 LLM 추론 속도를 2-3배 높이면서도 출력 품질을 완전히 보존하는 강력한 기법이다. Draft 모델 선택, γ 튜닝, 메모리 관리에 주의하면 프로덕션 환경에서 효과적으로 활용할 수 있다.

Target Model	Draft Model	비고
Llama 3 70B	Llama 3 8B	동일 계열
Mistral 7B	Mistral 7B (4bit)	양자화 버전
GPT-4	GPT-3.5	API 기반
Qwen2 72B	Qwen2 7B	동일 계열