Weak-to-Strong Generalization¶

메타 정보¶

항목	내용
논문	Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
저자	Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
소속	OpenAI Superalignment Team
발표	ICML 2024
arXiv	2312.09390
코드	github.com/openai/weak-to-strong
분야	AI Alignment, Superalignment, Weak Supervision

개요¶

Weak-to-Strong Generalization은 약한 모델(weak supervisor)의 레이블로 강한 모델(strong model)을 학습시킬 때, 강한 모델이 약한 supervisor의 성능을 초과하는 현상이다. 이는 미래 초지능(superintelligent) AI 정렬 문제의 핵심 과제인 "인간이 자신보다 훨씬 똑똑한 AI를 어떻게 감독할 수 있는가"에 대한 실증적 연구 방향을 제시한다.

핵심 통찰¶

현재 RLHF(Reinforcement Learning from Human Feedback)는 인간이 모델 행동을 평가할 수 있다는 가정에 의존한다. 그러나 초지능 모델은 인간이 평가하기 어려운 복잡한 행동을 보일 것이다. 예를 들어, 초지능 모델이 수백만 줄의 복잡한 코드를 생성한다면, 인간은 그것이 안전한지 평가하기 어렵다.

이 연구는 "약한 모델 -> 강한 모델" 감독을 통해 "인간 -> 초지능 AI" 감독 문제를 유추적으로 연구한다.

문제 정의¶

Superalignment Challenge¶

[현재] 인간 -> AI 모델 감독
          (인간이 더 똑똑함)

[미래] 인간 -> 초지능 AI 감독
          (AI가 훨씬 더 똑똑함)

[유추] Weak Model -> Strong Model 감독
          (Strong Model이 더 capable)

핵심 질문¶

약한 supervisor가 강한 모델의 full capability를 이끌어낼 수 있는가?
강한 모델은 약한 supervisor의 오류를 단순히 모방할 것인가, 아니면 자신의 지식으로 일반화할 것인가?

연구 방법론¶

실험 프레임워크¶

[1] Weak Supervisor 생성
    - 작은 사전학습 모델을 ground truth로 학습
    - weak labels 생성

[2] Strong Student 학습
    - 큰 사전학습 모델을 weak labels로 학습
    - weak-to-strong performance 측정

[3] Strong Ceiling 측정
    - 큰 사전학습 모델을 ground truth로 학습
    - 이상적 성능 baseline

핵심 지표: Performance Gap Recovered (PGR)¶

PGR = (weak-to-strong accuracy - weak accuracy) / (strong ceiling accuracy - weak accuracy)

PGR = 0%: 강한 모델이 약한 supervisor를 단순 모방
PGR = 100%: 강한 모델이 full capability 회복

실험 설정¶

Task	Weak Model	Strong Model	데이터셋
NLP	GPT-2 수준	GPT-4	22개 벤치마크
Chess	GPT-2 수준	GPT-4	Chess puzzles
Reward Modeling	GPT-2 수준	GPT-3.5 수준	ChatGPT RM 데이터

주요 발견¶

1. Naive Finetuning 결과¶

단순히 weak labels로 strong model을 학습시키면:

NLP Tasks: - GPT-4를 GPT-2 레이블로 학습 시 PGR 약 50% - 강한 모델이 약한 supervisor보다 consistently 우수

Chess Puzzles: - 약한 일반화 (PGR 낮음) - Task 특성상 오류 모방 경향

Reward Modeling: - 가장 낮은 PGR - Alignment-critical task에서 naive 방법 불충분

2. 일반화 발생 원인¶

강한 사전학습 모델은 이미 alignment-relevant tasks에 대한 좋은 representation을 가지고 있다:

복잡한 코드를 생성할 수 있는 모델은 그 코드가 사용자 의도에 부합하는지 "알고 있음"
약한 supervisor는 새로운 능력을 가르치는 것이 아니라 기존 지식을 elicit하는 역할

개선 방법¶

Method 1: Auxiliary Confidence Loss¶

강한 모델이 confident하게 예측하도록 유도:

L_total = L_CE(weak_labels) + alpha * L_conf

L_conf = -E[H(p(y|x))]
       = E[p(y|x) * log(p(y|x))]

직관: 강한 모델이 약한 supervisor와 confident하게 disagreement할 수 있도록 허용

결과: NLP tasks에서 PGR 약 50% -> 약 80%로 향상

Method 2: Bootstrapping¶

중간 크기 모델을 활용한 점진적 지식 전달:

Weak Model -> Medium Model -> Strong Model

[GPT-2 level] -> [GPT-3 level] -> [GPT-4]

결과: Chess puzzles에서 효과적

Method 3: Unsupervised Generative Finetuning¶

Strong model의 representation을 task-specific 데이터로 개선:

[1] Strong model을 task 데이터로 generative finetuning (unsupervised)
[2] 이후 weak labels로 supervised finetuning

결과: Reward Modeling에서 가장 효과적

실험 결과 요약¶

NLP Tasks (22개 벤치마크)¶

Method	PGR (GPT-2 -> GPT-4)
Naive Finetuning	~50%
+ Confidence Loss	~80%
Strong Ceiling	100%

Chess Puzzles¶

Method	PGR
Naive Finetuning	낮음
+ Bootstrapping	중간
Strong Ceiling	100%

Reward Modeling (ChatGPT RM)¶

Method	PGR (GPT-2 -> GPT-3.5)
Naive Finetuning	매우 낮음
+ Generative FT	개선되나 여전히 gap 존재
Strong Ceiling	100%

한계점과 열린 문제¶

현재 연구의 한계¶

Task 의존성: 어떤 method도 모든 setting에서 일관되게 작동하지 않음
RM Gap: Reward Modeling에서 여전히 상당한 gap 존재
Disanalogy: 실제 superalignment 문제와의 차이점 존재

주요 Disanalogies¶

요소	현재 실험	실제 Superalignment
Weak Supervisor	작은 모델	인간
오류 유형	능력 부족	복잡, 의도적 속임 가능
Strong Model	더 큰 모델	초지능 AI
Ground Truth	접근 가능	접근 불가능

미래 연구 방향¶

Scalable Oversight: 인간 감독 능력 확장 기술
Deception Detection: 강한 모델의 속임 탐지
Iterative Refinement: 점진적 정렬 방법론

Python 구현 예시¶

Confidence Loss 구현¶

import torch
import torch.nn as nn
import torch.nn.functional as F

class WeakToStrongTrainer:
    """Weak-to-Strong Generalization trainer with auxiliary confidence loss."""

    def __init__(
        self,
        strong_model: nn.Module,
        confidence_weight: float = 0.5,
        temperature: float = 1.0
    ):
        self.model = strong_model
        self.confidence_weight = confidence_weight
        self.temperature = temperature

    def confidence_loss(self, logits: torch.Tensor) -> torch.Tensor:
        """
        Auxiliary confidence loss (negative entropy).
        Encourages model to make confident predictions.

        Args:
            logits: Model output logits [batch_size, num_classes]

        Returns:
            Negative entropy loss (scalar)
        """
        probs = F.softmax(logits / self.temperature, dim=-1)
        log_probs = F.log_softmax(logits / self.temperature, dim=-1)

        # Negative entropy: sum(p * log(p))
        # Lower entropy = higher confidence
        entropy = -(probs * log_probs).sum(dim=-1)

        return entropy.mean()  # We minimize this to maximize confidence

    def compute_loss(
        self,
        logits: torch.Tensor,
        weak_labels: torch.Tensor
    ) -> dict:
        """
        Compute total loss with weak supervision and confidence auxiliary.

        Args:
            logits: Model predictions [batch_size, num_classes]
            weak_labels: Labels from weak supervisor [batch_size]

        Returns:
            Dictionary with loss components
        """
        # Cross-entropy with weak labels
        ce_loss = F.cross_entropy(logits, weak_labels)

        # Confidence loss (negative entropy)
        conf_loss = self.confidence_loss(logits)

        # Total loss
        total_loss = ce_loss + self.confidence_weight * conf_loss

        return {
            'total': total_loss,
            'ce': ce_loss,
            'confidence': conf_loss
        }


class BootstrappingPipeline:
    """Iterative bootstrapping for weak-to-strong generalization."""

    def __init__(
        self,
        model_sizes: list,  # e.g., ['small', 'medium', 'large']
        model_factory: callable
    ):
        self.model_sizes = model_sizes
        self.model_factory = model_factory

    def bootstrap_train(
        self,
        train_data: torch.utils.data.Dataset,
        initial_labels: torch.Tensor
    ) -> nn.Module:
        """
        Progressively train larger models using smaller model labels.

        Args:
            train_data: Training dataset
            initial_labels: Labels from weakest supervisor

        Returns:
            Final (strongest) trained model
        """
        current_labels = initial_labels

        for i, size in enumerate(self.model_sizes):
            print(f"Training {size} model (step {i+1}/{len(self.model_sizes)})")

            # Create model of current size
            model = self.model_factory(size)

            # Train on current labels
            model = self._train_model(model, train_data, current_labels)

            # Generate new labels for next iteration
            if i < len(self.model_sizes) - 1:
                current_labels = self._generate_labels(model, train_data)

        return model

    def _train_model(
        self,
        model: nn.Module,
        data: torch.utils.data.Dataset,
        labels: torch.Tensor
    ) -> nn.Module:
        """Train model on given labels."""
        # Implementation details omitted
        return model

    def _generate_labels(
        self,
        model: nn.Module,
        data: torch.utils.data.Dataset
    ) -> torch.Tensor:
        """Generate pseudo-labels from trained model."""
        model.eval()
        labels = []
        with torch.no_grad():
            for batch in data:
                logits = model(batch)
                preds = logits.argmax(dim=-1)
                labels.append(preds)
        return torch.cat(labels)


def compute_pgr(
    weak_acc: float,
    w2s_acc: float,
    strong_ceil_acc: float
) -> float:
    """
    Compute Performance Gap Recovered (PGR).

    Args:
        weak_acc: Weak supervisor accuracy
        w2s_acc: Weak-to-strong trained model accuracy
        strong_ceil_acc: Strong model trained on ground truth

    Returns:
        PGR percentage (0-100)
    """
    if strong_ceil_acc == weak_acc:
        return 100.0 if w2s_acc >= weak_acc else 0.0

    pgr = (w2s_acc - weak_acc) / (strong_ceil_acc - weak_acc)
    return max(0.0, min(100.0, pgr * 100))


# Example usage
if __name__ == "__main__":
    # Simulated results
    results = {
        'nlp': {
            'weak': 0.65,
            'w2s_naive': 0.78,
            'w2s_confidence': 0.88,
            'strong_ceiling': 0.95
        },
        'chess': {
            'weak': 0.45,
            'w2s_naive': 0.52,
            'w2s_bootstrap': 0.68,
            'strong_ceiling': 0.85
        },
        'reward_modeling': {
            'weak': 0.55,
            'w2s_naive': 0.58,
            'w2s_gen_ft': 0.72,
            'strong_ceiling': 0.90
        }
    }

    print("Performance Gap Recovered (PGR) Analysis")
    print("=" * 50)

    for task, metrics in results.items():
        print(f"\n{task.upper()}:")

        pgr_naive = compute_pgr(
            metrics['weak'],
            metrics['w2s_naive'],
            metrics['strong_ceiling']
        )

        best_key = [k for k in metrics if k.startswith('w2s_') and k != 'w2s_naive'][0]
        pgr_best = compute_pgr(
            metrics['weak'],
            metrics[best_key],
            metrics['strong_ceiling']
        )

        print(f"  Naive PGR: {pgr_naive:.1f}%")
        print(f"  Best Method PGR: {pgr_best:.1f}%")

출력 예시¶

Performance Gap Recovered (PGR) Analysis
==================================================

NLP:
  Naive PGR: 43.3%
  Best Method PGR: 76.7%

CHESS:
  Naive PGR: 17.5%
  Best Method PGR: 57.5%

REWARD_MODELING:
  Naive PGR: 8.6%
  Best Method PGR: 48.6%

핵심 요약¶

구분	내용
문제	약한 supervisor로 강한 모델의 full capability를 이끌어낼 수 있는가?
발견	Naive finetuning만으로도 weak-to-strong generalization 발생
한계	Alignment-critical tasks (RM)에서 여전히 상당한 gap
개선	Confidence loss, Bootstrapping, Generative finetuning
의의	Superalignment 문제에 대한 실증적 연구 프레임워크 제시

참고 문헌¶

Burns, C., et al. (2023). Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. arXiv:2312.09390
Christiano, P., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS
OpenAI. (2023). Superalignment. openai.com/superalignment