Representation Engineering (표현 공학)¶

메타 정보¶

항목	내용
분류	Interpretability / AI Safety / LLM Control
핵심 논문	"Representation Engineering: A Top-Down Approach to AI Transparency" (Zou et al., 2023), "Activation Addition: Steering Language Models Without Optimization" (Turner et al., 2023), "Analyzing the Generalization and Reliability of Steering Vectors" (Tan et al., NeurIPS 2024)
주요 저자	Andy Zou, Dan Hendrycks (RepE); Alex Turner (ActAdd); Nina Rimsky (CAA)
핵심 개념	신경망 내부 표현(activation)을 읽고 조작하여 모델 행동을 해석/제어하는 top-down 접근법
관련 분야	Mechanistic Interpretability, Probing, Sparse Autoencoders, AI Alignment

정의¶

Representation Engineering (RepE)은 대규모 언어 모델(LLM)의 내부 활성화(activation) 공간에서 고수준 인지 개념을 식별하고, 이를 읽거나(reading) 제어(control)하는 기법이다. 개별 뉴런이나 회로를 분석하는 bottom-up 방식(mechanistic interpretability)과 달리, 집단 수준의 표현(population-level representation)에 초점을 맞추는 top-down 접근법이다.

Bottom-Up vs Top-Down Interpretability:

Bottom-Up (Mechanistic Interpretability):
  개별 뉴런 --> 회로(circuit) --> 기능 매핑
  장점: 세밀한 메커니즘 이해
  단점: 대규모 모델에서 확장성 한계

Top-Down (Representation Engineering):
  고수준 개념 --> 활성화 공간의 방향(direction) --> 읽기/제어
  장점: 확장 가능, 실용적 제어
  단점: 메커니즘의 "왜"보다 "무엇"에 집중

핵심 구조: Representation Reading + Control¶

RepE는 크게 두 축으로 구성된다.

Representation Engineering 프레임워크:

+------------------------------------------------------+
|                Representation Reading                 |
|  "모델이 무엇을 생각하는지 읽는다"                      |
|                                                      |
|  입력 --> 모델 Forward Pass --> 중간 레이어 활성화 추출  |
|  --> 개념 방향(concept direction) 식별                 |
|                                                      |
|  방법: Linear Probing, PCA, Contrastive Pairs         |
+------------------------------------------------------+
                        |
                        v
+------------------------------------------------------+
|               Representation Control                  |
|  "모델의 행동을 조종한다"                               |
|                                                      |
|  추론 시 활성화에 concept vector를 더하거나 빼기         |
|  --> 모델 출력 변경 (fine-tuning 없이)                  |
|                                                      |
|  방법: Activation Addition, Steering Vectors, CAA     |
+------------------------------------------------------+

Representation Reading (표현 읽기)¶

Linear Representation Hypothesis¶

RepE의 이론적 기반은 선형 표현 가설(Linear Representation Hypothesis)이다: 고수준 개념이 활성화 공간에서 선형 방향(linear direction)으로 인코딩된다는 주장이다.

선형 표현 가설:

활성화 공간 R^d에서 개념 C에 대해:
  존재하는 방향 벡터 v_C in R^d such that
  <h, v_C> > threshold  <==>  입력이 개념 C를 포함

예시:
  "정직함" 개념 --> v_honesty
  h = 모델의 l번째 레이어 활성화
  score = h . v_honesty
  score 높음 --> 모델이 "정직한" 응답을 생성 중
  score 낮음 --> 모델이 기만적 응답을 생성 중

Concept Vector 추출 방법¶

1. Contrastive Pair Method (가장 일반적):

   양성 입력: "I will always tell the truth"    --> h+
   음성 입력: "I will lie whenever convenient"   --> h-

   concept_vector = mean(h+) - mean(h-)

   여러 쌍에 대해:
   V = (1/N) * sum_{i=1}^{N} [h_i^+ - h_i^-]

2. PCA 기반:

   양성/음성 활성화를 모아 PCA 수행
   첫 번째 주성분(PC1) = concept direction

   장점: 분산 최대화 방향 = 개념 구분에 최적
   단점: 다중 개념이 혼재할 수 있음

3. Linear Probing:

   활성화 h를 입력, 개념 레이블 y를 타겟으로
   로지스틱 회귀: y = sigma(w^T h + b)
   학습된 w = concept direction

   장점: 직접적 분류 가능
   단점: 과적합 위험

읽기 가능한 개념 목록 (Zou et al., 2023 실험)¶

+-------------------+----------------------------------+
| 범주              | 탐지 가능한 개념                  |
+-------------------+----------------------------------+
| Safety            | 정직함, 유해성, 편향              |
| Cognitive         | 불확실성, 지식, 추론 단계         |
| Behavioral        | 권력 추구, 순종, 아첨             |
| Emotional         | 감정 톤, 공감, 분노               |
| Factual           | 사실/환각 구분, 확신도            |
+-------------------+----------------------------------+

탐지 정확도 (Llama-2 13B 기준, Zou et al.):
  정직함 (honesty):     AUROC ~0.95
  유해성 (harmfulness): AUROC ~0.92
  권력 추구 (power):    AUROC ~0.88

Representation Control (표현 제어)¶

Activation Addition (ActAdd)¶

Turner et al. (2023)이 제안한 가장 단순한 제어 방법이다.

Activation Addition:

일반 추론:
  x --> Layer_1 --> ... --> Layer_l: h_l --> ... --> Layer_L --> output

제어된 추론:
  x --> Layer_1 --> ... --> Layer_l: h_l + alpha * v_C --> ... --> output

alpha: 조종 강도 (steering coefficient)
  alpha > 0: 개념 C 강화 (e.g., 더 정직하게)
  alpha < 0: 개념 C 억제 (e.g., 덜 정직하게)
  |alpha| 크기: 효과 강도 (너무 크면 coherence 붕괴)

적용 레이어 선택:
  일반적으로 중간 레이어 (L/3 ~ 2L/3) 가장 효과적
  초기 레이어: 토큰 수준 변화
  후기 레이어: 출력 분포 직접 왜곡

Contrastive Activation Addition (CAA)¶

Rimsky et al. (2024)이 체계화한 방법으로, 대조 쌍에서 추출한 steering vector를 사용한다.

CAA 파이프라인:

1. 데이터 구성 (N개 대조 쌍):
   {(prompt_i, response_i^+, response_i^-)}_{i=1}^N

2. Steering Vector 추출:
   각 쌍에 대해 레이어 l의 활성화 차이 계산
   v_l = (1/N) * sum [h_l(response^+) - h_l(response^-)]

3. 추론 시 적용:
   h_l' = h_l + alpha * v_l / ||v_l||

한계 (Tan et al., NeurIPS 2024):
  - Out-of-distribution 일반화 불안정
  - 다중 행동 동시 제어 시 간섭
  - alpha 값에 민감: 최적 범위가 좁음

고급 제어 방법¶

방법 비교:

+---------------------------+----------+----------+------------+
| 방법                      | 복잡도   | 정밀도   | 다중 개념   |
+---------------------------+----------+----------+------------+
| Activation Addition       | 낮음     | 중간     | 제한적     |
| CAA                       | 중간     | 높음     | 간섭 있음  |
| Conceptor-based Steering  | 높음     | 높음     | 지원       |
| Affine Steering           | 높음     | 최고     | 지원       |
| Representation Finetuning | 높음     | 높음     | 지원       |
+---------------------------+----------+----------+------------+

Conceptor-based Steering (ICLR 2025):
  - Boolean 연산으로 다중 개념 조합 가능
  - NOT, AND, OR 연산으로 개념 제거/결합
  - 기존 additive 방식보다 정밀한 제어

Affine Steering:
  활성화에 선형 변환 적용: h' = Ah + b
  단순 덧셈(h + v)의 일반화
  최적화를 통해 A, b 학습 --> 더 세밀한 제어

활용 사례¶

1. AI Safety: 정직한 응답 유도¶

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")

# 1. Contrastive Pair으로 honesty vector 추출
honest_prompts = [
    "Tell me the truth about",
    "Honestly speaking,",
    "The factual answer is",
]
deceptive_prompts = [
    "Let me make something up about",
    "I'll pretend that",
    "A convincing lie would be",
]

def get_activations(prompts, layer_idx=15):
    """특정 레이어의 활성화 추출"""
    activations = []
    hooks = []

    def hook_fn(module, input, output):
        # output[0]의 마지막 토큰 활성화
        activations.append(output[0][:, -1, :].detach())

    hook = model.model.layers[layer_idx].register_forward_hook(hook_fn)
    hooks.append(hook)

    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            model(**inputs)

    for h in hooks:
        h.remove()

    return torch.stack(activations).mean(dim=0)

# 2. Steering Vector 계산
h_honest = get_activations(honest_prompts)
h_deceptive = get_activations(deceptive_prompts)
steering_vector = h_honest - h_deceptive
steering_vector = steering_vector / steering_vector.norm()

# 3. 추론 시 Activation Addition
alpha = 1.5  # 조종 강도

def steering_hook(module, input, output):
    output[0][:, :, :] += alpha * steering_vector
    return output

hook = model.model.layers[15].register_forward_hook(steering_hook)

# 제어된 생성
prompt = "What are the side effects of this supplement?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

hook.remove()

2. 환각(Hallucination) 탐지¶

import numpy as np
from sklearn.linear_model import LogisticRegression

# 레이어별 활성화로 사실/환각 분류기 학습
def build_hallucination_detector(model, factual_data, hallucinated_data, layer_idx=20):
    """
    factual_data: [(prompt, factual_response), ...]
    hallucinated_data: [(prompt, hallucinated_response), ...]
    """
    X, y = [], []

    for prompt, response in factual_data:
        h = extract_activation(model, prompt + response, layer_idx)
        X.append(h.numpy())
        y.append(0)  # factual

    for prompt, response in hallucinated_data:
        h = extract_activation(model, prompt + response, layer_idx)
        X.append(h.numpy())
        y.append(1)  # hallucinated

    X = np.array(X)
    y = np.array(y)

    clf = LogisticRegression(max_iter=1000)
    clf.fit(X, y)

    # clf.coef_[0] = hallucination direction
    return clf

# 추론 시 환각 확률 추정
def detect_hallucination(model, clf, prompt, response, layer_idx=20):
    h = extract_activation(model, prompt + response, layer_idx)
    prob = clf.predict_proba(h.numpy().reshape(1, -1))[0, 1]
    return prob  # 환각 확률

3. 감정/톤 제어 (Sentiment Steering)¶

# 감정 steering 벡터 (긍정 - 부정)
positive_prompts = ["I'm happy because", "Great news:", "I love"]
negative_prompts = ["I'm sad because", "Bad news:", "I hate"]

h_pos = get_activations(positive_prompts, layer_idx=15)
h_neg = get_activations(negative_prompts, layer_idx=15)

sentiment_vector = h_pos - h_neg
sentiment_vector = sentiment_vector / sentiment_vector.norm()

# alpha > 0: 긍정적 톤 강화
# alpha < 0: 부정적 톤 강화
# 이를 통해 동일 모델로 다양한 톤의 응답 생성 가능

Representation Reading 심화: 다중 레이어 분석¶

레이어별 개념 인코딩 패턴 (경험적 관찰):

레이어 위치    | 인코딩되는 정보           | RepE 활용
------------- | ----------------------- | ------------------
초기 (1-8)    | 토큰 임베딩, 구문 구조    | 언어적 특성 탐지
중간 (9-20)   | 의미, 사실 지식          | 사실/환각 구분
중후반 (21-28)| 추론, 의도, 감정         | 안전성, 정직함 제어
최종 (29-32)  | 출력 토큰 선택           | 직접 제어 (불안정)

최적 레이어 선택 전략:
  1. 모든 레이어에서 probing accuracy 측정
  2. accuracy가 최대인 레이어 범위 식별
  3. 해당 범위의 중간 레이어를 steering target으로 선택

이론적 배경¶

왜 선형 방향이 작동하는가?¶

가설 1: Superposition Hypothesis (Elhage et al., 2022)
  - 모델이 차원 수보다 많은 특성을 거의 직교하는 방향으로 인코딩
  - 각 개념은 선형 방향에 대응
  - 개념 수 >> 차원 수이므로 간섭(interference) 존재

가설 2: Linear Representation from Training Dynamics
  - SGD 기반 학습이 자연스럽게 선형 구조를 유도
  - 비선형 활성화 함수에도 불구하고 레이어 간 표현은 선형에 가까움

실험적 증거:
  - Word2Vec 시절부터 알려진 "king - man + woman = queen"
  - 더 큰 모델일수록 선형성 강화 (scaling)
  - 다양한 모델 아키텍처에서 재현됨

Mechanistic Interpretability와의 관계¶

+--------------------------------------------+
| Mechanistic Interpretability               |
| (Bottom-Up)                                |
|                                            |
| 뉴런 --> 회로 --> 기능                      |
| 미시적, 정밀, 확장 어려움                   |
| "왜 이렇게 작동하는가?"                     |
+--------------------------------------------+
          |  상호 보완적 관계
          v
+--------------------------------------------+
| Representation Engineering                 |
| (Top-Down)                                 |
|                                            |
| 개념 --> 활성화 방향 --> 읽기/제어           |
| 거시적, 실용적, 확장 가능                   |
| "무엇을 표현하고, 어떻게 제어하는가?"        |
+--------------------------------------------+
          |
          v
+--------------------------------------------+
| Sparse Autoencoders (Bridge)               |
|                                            |
| 활성화를 해석 가능한 특성으로 분해           |
| RepE의 concept vector와 SAE feature 연결   |
| 두 접근법의 가교 역할                       |
+--------------------------------------------+

한계와 열린 문제¶

1. 일반화 한계:
   - In-distribution에서 학습된 steering vector가
     OOD 상황에서 불안정 (Tan et al., NeurIPS 2024)
   - 도메인/태스크 전이성 보장 안 됨

2. 다중 개념 간섭:
   - "정직함 + 무해함" 동시 제어 시 벡터 간 간섭
   - Conceptor 기반 접근으로 부분 해결

3. 최적 alpha 선택:
   - 너무 작으면 효과 미미, 너무 크면 coherence 붕괴
   - 입력/컨텍스트에 따라 최적 alpha가 달라짐
   - 자동 alpha 조절 메커니즘 미개발

4. 인과성 vs 상관성:
   - 활성화 방향이 인과적으로 개념을 결정하는지 불분명
   - 상관된 개념 분리의 어려움

5. 모델 크기 의존성:
   - 작은 모델 (<7B)에서는 선형 구조 약함
   - 큰 모델에서 더 효과적 (scaling)

참고 자료¶

Zou, A. et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv:2310.01405.
Turner, A. et al. (2023). "Activation Addition: Steering Language Models Without Optimization." arXiv:2308.10248.
Tan, Z. et al. (2024). "Analyzing the Generalization and Reliability of Steering Vectors." NeurIPS 2024.
Rimsky, N. et al. (2024). "Steering Llama 2 via Contrastive Activation Addition." arXiv:2312.06681.
Wehner, J. et al. (2025). "Taxonomy, Opportunities, and Challenges of Representation Engineering." arXiv:2502.17601.
Park, K. et al. (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models." arXiv:2311.03658.
Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic.
Templeton, A. et al. (2024). "Scaling Monosemanticity." Anthropic.