Mechanistic Interpretability¶

메타 정보¶

항목	내용
분류	AI Safety / Interpretability / Reverse Engineering
핵심 논문	"Mechanistic Interpretability for AI Safety -- A Review" (Bereska & Gavves, 2024)
주요 저자	Chris Olah (Anthropic), Neel Nanda (Google DeepMind), Tom Conerly (Anthropic), Aaron Mueller (Boston U), Lee Sharkey (Goodfire)
핵심 개념	신경망 내부를 역공학하여 학습된 알고리즘(회로)과 표현(특징)을 인간이 이해 가능한 형태로 추출
관련 시스템	TransformerLens, SAELens, Neuronpedia, Gemma Scope, Attribution Graphs
관련 분야	Sparse Autoencoders, Representation Engineering, Causal Inference, AI Alignment, Probing

정의¶

Mechanistic Interpretability (MI)는 신경망의 내부 작동 원리를 역공학하여, 모델이 학습한 알고리즘과 표현을 인간이 이해할 수 있는 형태로 추출하는 연구 분야다. 단순히 입출력 행동을 관찰하는 행동적 해석(behavioral interpretability)을 넘어, 개별 뉴런, 어텐션 헤드, MLP 레이어가 수행하는 구체적 계산을 밝히는 것을 목표로 한다.

핵심 프레임워크:

신경망을 세 수준에서 분석:

1. Features (특징) -- 표현의 기본 단위
   - 모델이 학습한 의미 있는 방향(direction)
   - 단일 뉴런이 아닌 분산 표현(distributed representation)
   - 예: "대문자 시작 단어", "부정적 감정", "코드 문법"

2. Circuits (회로) -- 계산의 기본 단위
   - 특정 행동을 구현하는 뉴런/어텐션 헤드의 연결 패턴
   - 입력 -> 중간 표현 -> 출력까지의 정보 흐름
   - 예: Induction Head, IOI Circuit, Greater-Than Circuit

3. Superposition (중첩) -- 표현의 핵심 난제
   - 뉴런 수보다 더 많은 특징을 인코딩
   - 희소 활성화 + 준직교(near-orthogonal) 배치
   - Polysemanticity의 원인

핵심 가정들:
  Linear Representation Hypothesis: 고수준 개념이 활성화 공간의 선형 방향으로 표현됨
  Superposition Hypothesis: 모델이 차원보다 많은 특징을 희소 중첩으로 인코딩
  Universality: 유사한 회로가 서로 다른 모델에서 독립적으로 발견됨

배경: MI의 발전사¶

Phase 1: 비전 모델 시대 (2017--2020)¶

초기 MI 연구는 CNN의 특징 시각화에서 시작:

1. Feature Visualization (Olah et al., 2017)
   - 최적화 기반으로 뉴런이 반응하는 입력 패턴 생성
   - InceptionV1의 각 뉴런이 곡선, 텍스처, 객체 부분을 감지하는 것 확인

2. Circuits in CNNs (Olah et al., Distill 2020)
   - "Zoom In" -- 신경망을 Features + Circuits + Universality로 분석
   - InceptionV1의 곡선 감지기 -> 원/나선 감지기 회로 발견
   - MI의 방법론적 토대 확립

핵심 통찰:
  - 뉴런은 해석 가능한 단위가 될 수 있다 (비전 모델에서)
  - 뉴런 간 연결이 의미 있는 알고리즘을 구성한다
  - 유사한 특징이 다른 모델에서도 나타난다 (Universality)

Phase 2: Transformer 회로 발견 (2021--2023)¶

Transformer 구조에 대한 기계적 이해:

1. Mathematical Framework (Elhage et al., 2021)
   - "A Mathematical Framework for Transformer Circuits"
   - Residual stream을 정보 흐름의 중심 통로로 분석
   - 어텐션 헤드의 QK/OV 회로 분해

2. Induction Heads (Olsson et al., 2022)
   - 문맥 내 학습(in-context learning)의 핵심 메커니즘 발견
   - [A][B] ... [A] -> [B] 패턴 복사
   - 2레이어 구조: Previous Token Head + Induction Head

3. Superposition 이론화 (Elhage et al., 2022)
   - "Toy Models of Superposition"
   - 합성 모델에서 중첩 현상의 수학적 특성 규명
   - 특징 희소성과 중첩 밀도의 관계 정량화

4. IOI Circuit (Wang et al., ICLR 2023)
   - GPT-2 Small에서 간접 목적어 식별 회로 완전 분석
   - 26개 어텐션 헤드의 역할을 5가지 기능 그룹으로 분류
   - 약 260만 파라미터 중 관련 회로 식별

타임라인:
  2021: Transformer 수학적 프레임워크 정립
  2022: Induction Head 발견 + Superposition 이론화
  2023: IOI 회로 완전 분석 + 자동 회로 발견 시작

Phase 3: Sparse Autoencoders와 확장 (2023--2024)¶

Superposition 해결을 위한 SAE 시대:

1. Towards Monosemanticity (Bricken et al., Anthropic 2023)
   - 1-layer Transformer에서 SAE로 해석 가능한 특징 추출
   - 4096개 뉴런에서 512K 특징 발견
   - Polysemantic 뉴런 -> Monosemantic 특징 분해 입증

2. Scaling Monosemanticity (Templeton et al., Anthropic 2024)
   - Claude 3 Sonnet에서 수백만 특징 식별
   - "Golden Gate Bridge" 특징 등 고수준 개념 발견
   - 특징 조작(steering)으로 모델 행동 변경 가능 입증

3. OpenAI GPT-4 SAE (Gao et al., 2024)
   - 16M 잠재 변수를 가진 SAE를 GPT-4에 적용
   - 대규모 언어 모델에서의 SAE 확장성 검증

핵심 진전:
  - 단일 뉴런 분석에서 분산 특징 분석으로 전환
  - SAE가 Superposition 해결의 표준 도구로 자리잡음
  - 프로덕션 모델 수준에서의 특징 추출 가능성 입증

Phase 4: Circuit Tracing과 현재 (2025--2026)¶

회로 추적의 산업화와 현실 직면:

1. Circuit Tracing / Attribution Graphs (Anthropic, 2025.03)
   - Claude 3.5 Haiku의 내부 계산을 그래프로 시각화
   - SAE 특징을 노드로, 특징 간 영향을 엣지로 표현
   - 다단계 추론, 시 작성, 언어 독립 추상화 등의 메커니즘 발견
   - 전체 프롬프트의 약 25%에서 성공적 회로 추적

2. Open Problems 컨센서스 (Huang et al., 2025.01)
   - 18개 기관, 29명 연구자의 공동 논문 (Schmidt Sciences 의뢰)
   - 방법론/응용/사회기술적 문제로 분류
   - MI 분야의 공식 로드맵 역할

3. DeepMind의 전략 전환 (2025)
   - SAE가 단순 Linear Probe보다 실용 과제에서 성능 열위 발견
   - "야심적 역공학"에서 "실용적 해석가능성"으로 전환
   - Gemma Scope 2: 270M--27B 파라미터 모델용 SAE 인프라 공개

4. 이론적 한계 발견
   - NP-Hard 증명: 많은 회로 발견 쿼리가 계산적으로 난해 (ICLR 2025)
   - 카오스 역학: 심층 네트워크에서 steering 벡터가 O(log(1/e))
     레이어 후 예측 불가능 (Lyapunov 지수, 2025.12)
   - Regulatory Impossibility Theorem: 완전한 해석 + 무제한 능력 +
     무시할 오차를 동시에 달성 불가 (2025)

현재 상태 (2026):
  - MIT Technology Review "2026 Breakthrough Technology"로 선정
  - ICML 2026 Mechanistic Interpretability Workshop 개최
  - Anthropic: "2027년까지 대부분의 모델 문제 탐지" 목표
  - Neel Nanda: "가장 야심적인 MI 비전은 아마 죽었다"
  - 합의: "Swiss cheese model" -- MI는 여러 안전 층 중 하나

핵심 개념 상세¶

1. Features (특징)¶

신경망이 학습한 의미 있는 표현 단위. 활성화 공간의 방향(direction)으로 존재하며, 단일 뉴런이 아닌 분산 형태로 인코딩된다.

Feature의 계층 구조:

Low-level Features:
  - 토큰 위치, 대소문자, 문법 범주
  - 비전: 엣지, 텍스처, 색상

Mid-level Features:
  - 구문 구조, 의미 역할, 엔티티 유형
  - 비전: 객체 부분, 패턴

High-level Features:
  - 추상 개념, 감정, 의도, 주제
  - 비전: 장면, 스타일

Feature의 핵심 속성:
  Monosemantic: 하나의 개념에 대응 (이상적)
  Polysemantic: 여러 개념에 대응 (현실)
  방향(Direction): 활성화 공간의 단위 벡터로 표현
  활성화 패턴: 어떤 입력에서 활성화되는지가 정의

2. Circuits (회로)¶

특정 행동을 구현하는 계산 서브그래프.

회로 유형	발견	설명	구성 요소
Induction Head	Olsson et al., 2022	[A][B]...[A]->[B] 패턴 복사	Previous Token Head + Induction Head
IOI Circuit	Wang et al., 2023	간접 목적어 식별	26개 어텐션 헤드, 5개 기능 그룹
Greater-Than	Hanna et al., 2023	수치 비교 회로	MLP + 어텐션 헤드 조합
Copy Suppression	McDougall et al., 2023	반복 토큰 억제	Anti-Induction Head
Docstring Circuit	Heimersheim & Nanda, 2024	코드 문서화 패턴	인수명 복사 + 서식

3. Superposition¶

모델이 뉴런 수(d_model)보다 훨씬 많은 특징을 인코딩하는 현상.

Superposition의 수학적 구조:

모델 차원 d에서 m >> d 개의 특징을 인코딩:

  x_reconstructed = sum_{i=1}^{m} a_i * f_i

  f_i: 특징 벡터 (d차원)
  a_i: 활성화 강도 (대부분 0 -- 희소)

작동 조건:
  1. 희소성: 각 입력에서 소수의 특징만 활성화
  2. 준직교: 특징 벡터 간 내적이 작음 (|<f_i, f_j>| << 1)
  3. 간섭 허용: 약간의 재구성 오차를 감수

Johnson-Lindenstrauss 연결:
  d차원에 O(exp(d)) 개의 거의 직교인 벡터 배치 가능
  -> 이론적으로 매우 많은 특징을 중첩 가능
  -> 실제로는 희소성 조건에 의존

SAE의 역할:
  Superposition 해제 = 희소 사전 학습(Sparse Dictionary Learning)
  encoder: 활성화 -> 희소 특징 벡터
  decoder: 희소 특징 벡터 -> 활성화 재구성

4. Attribution Graphs (2025)¶

Anthropic이 개발한 회로 추적 방법론. SAE 특징을 노드로, 특징 간 영향 관계를 엣지로 하는 계산 그래프를 생성한다.

Attribution Graph 구조:

입력 토큰 -> [Layer 0 특징들] -> [Layer 1 특징들] -> ... -> 출력 로짓

노드: SAE가 식별한 해석 가능한 특징
엣지: 한 특징이 다른 특징의 활성화에 미치는 인과적 영향
가중치: 기여도(attribution score)

구축 절차:
  1. 모델의 모든 레이어에 SAE 적용
  2. 각 특징의 활성화 기록
  3. 특징 간 기여도(gradient 기반)를 엣지로 연결
  4. 임계값 이하 엣지 가지치기
  5. 결과 그래프를 시각화/분석

발견 사례 (Claude 3.5 Haiku):
  - 다단계 추론: "Dallas is in Texas" + "Texas capital is Austin"
    -> 두 사실을 결합하는 회로 식별
  - 시 작성: 운율 계획 특징이 실제 단어 선택 전에 활성화
  - 언어 독립 추상화: 영어/프랑스어 입력이 동일한 중간 특징 활성화

한계:
  - 전체 프롬프트의 약 25%에서만 성공적 추적
  - 나머지 75%는 계산 경로가 불투명
  - 수작업 분석에 프롬프트당 수 시간 소요

주요 분석 기법¶

1. Activation Patching¶

특정 위치의 활성화 값을 다른 입력의 활성화로 교체하여 인과 관계를 파악한다.

import torch
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

def activation_patching(model, clean_text, corrupted_text, layer, position):
    """
    Activation Patching: 인과 관계 식별

    clean_text의 특정 위치 활성화를 corrupted_text 실행에 삽입하여
    해당 위치가 최종 출력에 미치는 인과적 영향 측정
    """
    # Clean run에서 활성화 캐시
    _, clean_cache = model.run_with_cache(clean_text)

    # Patch hook 정의
    def patch_hook(activation, hook):
        activation[:, position, :] = clean_cache[hook.name][:, position, :]
        return activation

    # Corrupted run에 clean 활성화 삽입
    patched_logits = model.run_with_hooks(
        corrupted_text,
        fwd_hooks=[(f"blocks.{layer}.hook_resid_post", patch_hook)]
    )

    # 기본 corrupted output
    corrupted_logits = model(corrupted_text)

    # 인과 효과: patched와 corrupted의 차이
    effect = (patched_logits - corrupted_logits).abs().mean().item()
    return effect


# 전체 레이어/위치에 대한 인과 맵 생성
def causal_scan(model, clean_text, corrupted_text):
    """모든 레이어 x 위치에 대해 activation patching 수행"""
    n_layers = model.cfg.n_layers
    tokens = model.to_tokens(clean_text)
    seq_len = tokens.shape[1]

    results = torch.zeros(n_layers, seq_len)
    for layer in range(n_layers):
        for pos in range(seq_len):
            results[layer, pos] = activation_patching(
                model, clean_text, corrupted_text, layer, pos
            )
    return results

2. Logit Lens / Tuned Lens¶

중간 레이어의 잔차 스트림(residual stream)을 unembedding 행렬에 투영하여 각 레이어가 "예측하는" 토큰을 관찰한다.

import torch

def logit_lens(model, text, top_k=5):
    """
    Logit Lens: 레이어별 예측 시각화

    각 레이어의 residual stream을 unembedding에 투영하여
    모델이 각 단계에서 어떤 토큰을 "예측"하는지 관찰
    """
    _, cache = model.run_with_cache(text)

    results = {}
    for layer in range(model.cfg.n_layers):
        # 레이어 출력 -> LayerNorm -> Unembedding
        residual = cache["resid_post", layer][:, -1, :]
        normed = model.ln_final(residual)
        logits = normed @ model.W_U

        probs = torch.softmax(logits, dim=-1)
        top_probs, top_indices = probs.topk(top_k)

        results[layer] = [
            (model.to_string(idx.item()), prob.item())
            for idx, prob in zip(top_indices[0], top_probs[0])
        ]

    return results


def tuned_lens(model, text, translators):
    """
    Tuned Lens: 학습된 affine 변환으로 개선된 logit lens

    각 레이어별로 학습된 변환기(translator)를 적용하여
    logit lens의 정확도를 향상
    """
    _, cache = model.run_with_cache(text)

    results = {}
    for layer in range(model.cfg.n_layers):
        residual = cache["resid_post", layer][:, -1, :]
        # 학습된 변환 적용
        translated = translators[layer](residual)
        normed = model.ln_final(translated)
        logits = normed @ model.W_U
        results[layer] = logits
    return results

3. Sparse Autoencoders (SAE)¶

Superposition을 해제하여 해석 가능한 monosemantic 특징으로 분해하는 핵심 도구.

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    """
    Sparse Autoencoder for Mechanistic Interpretability

    활성화 벡터 x를 고차원 희소 공간으로 인코딩한 뒤 재구성.
    희소 특징 각각이 해석 가능한 개념에 대응하는 것이 목표.

    구조:
      encoder: d_model -> n_features (ReLU, 희소)
      decoder: n_features -> d_model (재구성)

    손실:
      L = ||x - x_hat||^2 + lambda * ||features||_1
    """
    def __init__(self, d_model, n_features, sparsity_coef=1e-3):
        super().__init__()
        self.encoder = nn.Linear(d_model, n_features, bias=True)
        self.decoder = nn.Linear(n_features, d_model, bias=False)
        self.sparsity_coef = sparsity_coef

        # Decoder 가중치 정규화 (단위 벡터)
        with torch.no_grad():
            self.decoder.weight.data = nn.functional.normalize(
                self.decoder.weight.data, dim=0
            )

    def encode(self, x):
        return torch.relu(self.encoder(x))

    def decode(self, features):
        return self.decoder(features)

    def forward(self, x):
        features = self.encode(x)
        reconstruction = self.decode(features)
        return features, reconstruction

    def loss(self, x):
        features, reconstruction = self.forward(x)
        recon_loss = nn.functional.mse_loss(reconstruction, x)
        sparsity_loss = self.sparsity_coef * features.abs().mean()
        return recon_loss + sparsity_loss, features


# SAE 학습 파이프라인
def train_sae(
    model,
    dataloader,
    hook_point="blocks.5.hook_resid_post",
    d_model=768,
    expansion_factor=32,
    lr=3e-4,
    epochs=5
):
    """
    특정 레이어의 활성화에 대해 SAE 학습

    Parameters:
      model: HookedTransformer 모델
      hook_point: 활성화를 추출할 레이어 위치
      d_model: 모델 차원
      expansion_factor: 특징 수 = d_model * expansion_factor
      lr: 학습률
      epochs: 학습 에포크
    """
    n_features = d_model * expansion_factor
    sae = SparseAutoencoder(d_model, n_features)
    optimizer = torch.optim.Adam(sae.parameters(), lr=lr)

    for epoch in range(epochs):
        total_loss = 0
        for batch in dataloader:
            # 모델에서 활성화 추출
            with torch.no_grad():
                _, cache = model.run_with_cache(batch)
                activations = cache[hook_point]
                # [batch, seq_len, d_model] -> [batch*seq_len, d_model]
                activations = activations.reshape(-1, d_model)

            loss, features = sae.loss(activations)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Decoder 가중치 정규화 유지
            with torch.no_grad():
                sae.decoder.weight.data = nn.functional.normalize(
                    sae.decoder.weight.data, dim=0
                )

            total_loss += loss.item()

        print(f"Epoch {epoch+1}: Loss = {total_loss / len(dataloader):.4f}")
    return sae


# 특징 분석
def analyze_features(sae, model, texts, hook_point, top_k=10):
    """학습된 SAE의 특징 활성화 패턴 분석"""
    all_features = []
    for text in texts:
        with torch.no_grad():
            _, cache = model.run_with_cache(text)
            activations = cache[hook_point][:, -1, :]  # 마지막 토큰
            features = sae.encode(activations)
            all_features.append(features)

    features_matrix = torch.cat(all_features, dim=0)

    # 가장 자주 활성화되는 특징
    activation_freq = (features_matrix > 0).float().mean(dim=0)
    top_features = activation_freq.topk(top_k)

    return {
        "top_feature_indices": top_features.indices.tolist(),
        "top_feature_frequencies": top_features.values.tolist(),
        "mean_active_features": (features_matrix > 0).sum(dim=1).float().mean().item()
    }

4. Probing (Linear Probes)¶

특정 레이어가 특정 정보를 인코딩하는지 검증하는 가벼운 기법.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def linear_probe(model, texts, labels, layer_idx, position=-1):
    """
    Linear Probe: 정보 인코딩 검증

    특정 레이어에 선형 분류기를 학습시켜
    해당 레이어가 특정 속성을 인코딩하는지 확인

    Parameters:
      model: HookedTransformer
      texts: 입력 텍스트 리스트
      labels: 이진/다중 분류 레이블
      layer_idx: 분석할 레이어 인덱스
      position: 토큰 위치 (-1: 마지막 토큰)
    """
    activations = []
    for text in texts:
        with torch.no_grad():
            _, cache = model.run_with_cache(text)
            act = cache["resid_post", layer_idx][:, position, :]
            activations.append(act.cpu().numpy())

    X = np.vstack(activations)
    y = np.array(labels)

    probe = LogisticRegression(max_iter=1000, C=1.0)
    scores = cross_val_score(probe, X, y, cv=5, scoring="accuracy")

    return {
        "mean_accuracy": scores.mean(),
        "std_accuracy": scores.std(),
        "per_fold": scores.tolist()
    }


def probe_all_layers(model, texts, labels):
    """모든 레이어에 대해 probe를 수행하여 정보 위치 파악"""
    results = {}
    for layer in range(model.cfg.n_layers):
        results[layer] = linear_probe(model, texts, labels, layer)
    return results

5. Attention Pattern Analysis¶

어텐션 헤드의 가중치 패턴을 분석하여 기능을 유형별로 분류한다.

def classify_attention_heads(model, input_ids):
    """
    어텐션 헤드 기능 분류

    각 헤드의 어텐션 패턴을 분석하여 기능 유형 추론:
    - Previous Token: 바로 앞 토큰에 주목
    - Induction: 이전에 등장한 패턴의 다음 토큰에 주목
    - Positional: 특정 위치(BOS 등)에 주목
    - Duplicate Token: 현재 토큰과 동일한 이전 토큰에 주목
    """
    with torch.no_grad():
        _, cache = model.run_with_cache(input_ids)

    head_types = {}
    for layer in range(model.cfg.n_layers):
        attn = cache["pattern", layer][0]  # [n_heads, seq, seq]
        for head in range(model.cfg.n_heads):
            pattern = attn[head]

            scores = {
                "previous_token": pattern.diagonal(-1).mean().item(),
                "bos_attention": pattern[:, 0].mean().item(),
                "diagonal": pattern.diagonal().mean().item(),
                "entropy": -(pattern * (pattern + 1e-10).log()).sum(-1).mean().item()
            }

            # 가장 높은 점수의 유형 할당
            head_type = max(scores, key=scores.get)
            head_types[(layer, head)] = {
                "type": head_type,
                "scores": scores
            }

    return head_types

6. Feature Steering (활성화 조작)¶

특정 특징의 강도를 조작하여 모델 행동을 방향성 있게 변경한다.

def steer_with_sae(model, sae, text, feature_idx, strength=3.0, hook_point="blocks.10.hook_resid_post"):
    """
    SAE 특징 기반 모델 행동 조작

    특정 특징의 활성화를 증폭/억제하여
    모델 출력을 원하는 방향으로 유도

    Parameters:
      feature_idx: 조작할 SAE 특징 인덱스
      strength: 양수 = 증폭, 음수 = 억제
    """
    def steering_hook(activation, hook):
        features = sae.encode(activation)
        features[:, :, feature_idx] += strength
        return sae.decode(features)

    steered_logits = model.run_with_hooks(
        text,
        fwd_hooks=[(hook_point, steering_hook)]
    )
    return steered_logits


def contrastive_steering(model, sae, text, amplify_features, suppress_features, strength=2.0, hook_point="blocks.10.hook_resid_post"):
    """
    대비적 조작: 특정 특징은 증폭, 다른 특징은 억제

    예: "공손함" 특징 증폭 + "공격성" 특징 억제
    """
    def hook(activation, h):
        features = sae.encode(activation)
        for idx in amplify_features:
            features[:, :, idx] += strength
        for idx in suppress_features:
            features[:, :, idx] -= strength
        return sae.decode(features)

    return model.run_with_hooks(text, fwd_hooks=[(hook_point, hook)])

주요 연구 성과와 타임라인¶

Anthropic¶

연구	핵심 내용	연도/학회
Toy Models of Superposition	중첩 현상의 수학적 모델링	2022
Towards Monosemanticity	SAE로 해석 가능한 특징 추출 최초 입증	2023
Scaling Monosemanticity	Claude 3 Sonnet에서 수백만 특징 식별	2024
Circuit Tracing	Attribution Graphs로 완전 회로 추적	2025.03
On the Biology of a Large Language Model	Claude 3.5 Haiku 내부 메커니즘 분석	2025.03

Google DeepMind¶

연구	핵심 내용	연도
Gemma Scope	Gemma 모델용 오픈소스 SAE 인프라	2024
Gemma Scope 2	270M--27B 파라미터, 110PB 활성화, 1T+ SAE 파라미터	2025.12
전략 전환	"야심적 역공학" -> "실용적 해석가능성"	2025

OpenAI¶

연구	핵심 내용	연도
Language Models Can Explain Neurons	GPT-4로 뉴런 설명 자동 생성	2023
Extracting Concepts from GPT-4	16M 잠재 변수 SAE	2024
Emergent Misalignment 탐지	SAE로 "misaligned persona" 특징 식별 및 교정	2025

커뮤니티/학계¶

연구	핵심 내용	연도
IOI Circuit	GPT-2의 간접 목적어 회로 완전 분석	ICLR 2023
Automated Circuit Discovery	ACDC 알고리즘	NeurIPS 2023
MIB Benchmark	MI 방법론 표준 벤치마크	ICML 2025
Open Problems in MI	분야 컨센서스 로드맵	2025.01
NP-Hardness of Circuit Finding	회로 발견의 계산 복잡도 증명	ICLR 2025
Stream Algorithm	선형 시간 어텐션 분석 (100K 토큰, 소비자 HW)	2025.10

도구 및 라이브러리¶

TransformerLens¶

from transformer_lens import HookedTransformer

# 모델 로드 (자동으로 hook point 추가)
model = HookedTransformer.from_pretrained("gpt2-small")

# 캐시된 활성화로 추론
logits, cache = model.run_with_cache("Hello, world!")

# 활성화 접근 예시
residual = cache["resid_post", 5]    # 레이어 5 잔차 스트림
attn = cache["pattern", 3]           # 레이어 3 어텐션 패턴
mlp_out = cache["mlp_out", 7]        # 레이어 7 MLP 출력
attn_out = cache["attn_out", 2]      # 레이어 2 어텐션 출력

# Hook 기반 개입
def my_hook(activation, hook):
    activation[:, 5, :] = 0  # 특정 위치 제로화
    return activation

patched = model.run_with_hooks(
    "test input",
    fwd_hooks=[("blocks.3.hook_resid_post", my_hook)]
)

SAELens¶

from sae_lens import SAE, LanguageModelSAERunnerConfig

# SAE 학습 설정
config = LanguageModelSAERunnerConfig(
    model_name="gpt2-small",
    hook_point="blocks.5.hook_resid_post",
    d_in=768,
    expansion_factor=32,        # 24,576 features
    lr=3e-4,
    l1_coefficient=1e-3,
    training_tokens=1_000_000,
)

# 사전 학습된 SAE 로드 (Neuronpedia/HuggingFace)
sae = SAE.from_pretrained("gpt2-small-layer5-32x")

# 특징 활성화 분석
features = sae.encode(activations)
top_features = features.topk(k=10)

Neuronpedia¶

온라인 도구 (neuronpedia.org):

기능:
  - 학습된 SAE 특징 검색/탐색
  - 특징 활성화 패턴 시각화
  - 자동 생성된 특징 설명 열람
  - 여러 모델/레이어 간 특징 비교

지원 모델:
  - GPT-2 (Small, Medium, Large)
  - Pythia 계열
  - Gemma 계열

현재 한계와 미해결 문제¶

1. 정의의 부재
   - "Feature"의 엄밀한 수학적 정의 없음
   - 활성화 방향? 개념? 해석 가능한 단위? -- 혼용
   - 서로 다른 정의가 양립 불가능한 방법론을 생성

2. 선형 표현 가설의 한계
   - Csordas et al. (2024): "onion" 비선형 표현 발견
   - 깊은 네트워크에서 카오스 역학 발생 (양의 Lyapunov 지수)
   - 선형 개입(steering)의 수학적 한계

3. SAE의 미해결 문제
   - 재구성 오차: 10-40% 성능 저하
   - 데이터 의존성: 학습 데이터에 없는 개념(예: "거절 행동") 누락
   - Feature splitting/absorption: 비논리적 특징 생성
   - Mechanism vs Activation: 활성화는 분해하지만 가중치 계산은 미설명

4. 확장성
   - 계산 비용: Gemma 2 SAE만 20PB 저장 + GPT-3 수준 연산
   - 프로덕션 적용 시 모델 비용 2배 증가 가능
   - 수작업 분석: 프롬프트당 수 시간 소요

5. 검증(Faithfulness)
   - 발견한 회로가 실제 계산을 충실히 반영하는지 확인 어려움
   - "Interpretability illusions": 설득력 있지만 틀린 해석
   - Ground truth 부재

6. Self-repair (Hydra Effect)
   - 한 구성 요소를 제거하면 다른 구성 요소가 보상
   - 인과 귀속(causal attribution)을 왜곡

7. 실용적 격차
   - DeepMind 발견: SAE가 단순 linear probe보다 실용 과제에서 열위
   - 안전 관련 과제에서 기존 기법 대비 우위 미입증
   - "Regulatory Impossibility Theorem": 완전 해석 불가능

실용적 응용¶

1. Emergent Misalignment 탐지¶

def detect_misalignment(model, sae, text, misalignment_features, threshold=0.5):
    """
    SAE 특징으로 모델의 잠재적 비정렬 행동 탐지

    OpenAI (2025) 연구에서 영감:
    - 좁은 태스크(예: 불안전한 코드 작성) 파인튜닝이
      넓은 비정렬(misalignment)을 유발
    - SAE로 "misaligned persona" 특징 식별 가능
    - 약 100개 교정 샘플로 수정 가능
    """
    with torch.no_grad():
        _, cache = model.run_with_cache(text)
        activations = cache["blocks.10.hook_resid_post"][:, -1, :]
        features = sae.encode(activations)

    # 비정렬 관련 특징의 활성화 확인
    risk_scores = {}
    for name, idx in misalignment_features.items():
        score = features[0, idx].item()
        risk_scores[name] = score

    max_risk = max(risk_scores.values())
    is_risky = max_risk > threshold

    return {
        "is_risky": is_risky,
        "risk_scores": risk_scores,
        "max_risk": max_risk
    }

2. 모델 디버깅¶

def debug_prediction(model, text, expected_token, actual_token):
    """
    잘못된 예측의 레이어별 원인 분석

    각 레이어의 잔차 기여도를 분해하여
    어느 레이어에서 예상과 다른 방향으로 전환되는지 식별
    """
    _, cache = model.run_with_cache(text)

    expected_idx = model.to_single_token(expected_token)
    actual_idx = model.to_single_token(actual_token)

    contributions = []
    for layer in range(model.cfg.n_layers):
        # 레이어 기여도 = resid_post - resid_pre
        layer_contrib = cache["resid_post", layer][:, -1, :] - \
                        cache["resid_pre", layer][:, -1, :]

        # 각 토큰에 대한 로짓 기여
        normed = model.ln_final(layer_contrib)
        logit_contrib = normed @ model.W_U

        contributions.append({
            "layer": layer,
            "expected_logit": logit_contrib[0, expected_idx].item(),
            "actual_logit": logit_contrib[0, actual_idx].item(),
            "diff": (logit_contrib[0, actual_idx] - logit_contrib[0, expected_idx]).item()
        })

    return contributions

참고 자료¶

핵심 논문¶

Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits." Distill.
Elhage, N. et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic.
Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic.
Olsson, C. et al. (2022). "In-context Learning and Induction Heads." Anthropic.
Wang, K. et al. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023.
Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic.
Conmy, A. et al. (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability." NeurIPS 2023.
Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic.
Bereska, L. & Gavves, E. (2024). "Mechanistic Interpretability for AI Safety -- A Review." arXiv:2404.14082.

2025--2026 연구¶

Huang, Q. et al. (2025). "Open Problems in Mechanistic Interpretability." Schmidt Sciences.
Anthropic (2025). "Circuit Tracing: Revealing Computational Graphs in Language Models." transformer-circuits.pub.
Anthropic (2025). "On the Biology of a Large Language Model." transformer-circuits.pub.
Marks, S. et al. (2024). "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models." ICML 2024.
Geiger, A. et al. (2025). "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability." JMLR 2025.

도구¶

TransformerLens: https://github.com/TransformerLensOrg/TransformerLens
SAELens: https://github.com/jbloomAus/SAELens
Neuronpedia: https://neuronpedia.org
Gemma Scope: https://github.com/google-deepmind/gemma-scope