Curriculum Learning¶

메타 정보¶

항목	내용
분류	Training Strategy / Optimization
원논문	"Curriculum Learning" (ICML 2009)
저자	Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston
핵심 개념	학습 데이터를 쉬운 것에서 어려운 것 순서로 제시하여 학습 효율과 일반화 향상
관련 분야	Self-Paced Learning, Data Ordering, LLM Pretraining, Active Learning

정의¶

Curriculum Learning(CL)은 인간과 동물의 학습 과정에서 영감을 받은 학습 전략이다. 학습 데이터를 무작위로 셔플하는 대신, 난이도(difficulty) 기준으로 정렬하여 쉬운 샘플부터 점진적으로 어려운 샘플을 도입한다. 이를 통해 학습 속도, 수렴 품질, 일반화 성능을 개선할 수 있다.

핵심 가정: 학습 초기에 쉬운 예제로 좋은 표현(representation)의 기초를 다지면, 이후 어려운 예제를 더 효과적으로 학습할 수 있다.

핵심 아이디어¶

인간 학습과의 유사성¶

인간 교육 과정:
  산수 -> 대수 -> 미적분 -> 실해석학

  쉬운 개념을 먼저 익히고, 그 위에 어려운 개념을 쌓음

Curriculum Learning:
  쉬운 샘플 -> 중간 샘플 -> 어려운 샘플

  모델이 쉬운 패턴을 먼저 학습하고, 복잡한 패턴으로 확장

Random Training vs Curriculum Learning¶

Random Training:
  Epoch 1: [어려움, 쉬움, 중간, 어려움, 쉬움, ...]
  Epoch 2: [중간, 어려움, 쉬움, 쉬움, 중간, ...]
  -> 무작위 순서, 초기에 어려운 예제로 인한 noisy gradient

Curriculum Learning:
  Phase 1: [쉬움, 쉬움, 쉬움, ...]           -- 기초 패턴 학습
  Phase 2: [쉬움, 중간, 중간, 쉬움, ...]      -- 점진적 확장
  Phase 3: [쉬움, 중간, 어려움, 어려움, ...]   -- 전체 데이터
  -> 구조화된 학습, 안정적인 gradient

이론적 직관¶

Bengio et al. (2009)는 두 가지 관점에서 CL의 이점을 설명했다:

Continuation Method와의 연결: 최적화 관점에서, 쉬운 예제로 구성된 "smoothed" objective에서 시작하여 점진적으로 full objective로 전환하는 것은 continuation method와 유사하다. Smoothed objective는 local minima가 적어 초기 수렴이 용이하다.
Entropy 증가: 학습 초기에 낮은 엔트로피(쉬운 샘플)로 시작하여 점차 높은 엔트로피(전체 데이터)로 확장하면, loss surface의 복잡한 영역을 더 잘 탐색할 수 있다.

Loss Landscape 관점:

Random:
  시작점 -> [복잡한 landscape] -> local minimum (sub-optimal)

Curriculum:
  시작점 -> [smooth landscape] -> 좋은 basin 진입
         -> [점차 복잡해짐]   -> better minimum

Curriculum Learning의 구성 요소¶

CL은 크게 두 가지 요소로 구성된다:

1. Difficulty Measurer (난이도 측정기)¶

각 샘플의 난이도를 정의하는 기준이다.

방법	설명	적용 분야
Loss 기반	사전 학습된 모델의 loss 크기	범용
Confidence 기반	Teacher 모델의 예측 확신도	분류
문장 길이/복잡도	텍스트 길이, 어휘 다양성	NLP
이미지 해상도/노이즈	시각적 복잡도	CV
데이터 노이즈 수준	레이블 노이즈 추정치	노이즈 데이터
Perplexity	언어 모델 기준 perplexity	LLM
Age-of-Acquisition	단어 습득 연령 기준	NLP

난이도 측정 파이프라인:

데이터셋 D = {(x_1, y_1), ..., (x_N, y_N)}
        |
        v
Difficulty Scorer:  d(x_i) -> score_i
        |
        v
정렬:   D_sorted = sort(D, key=d)
        |
        v
쉬움 [====|=======|==========] 어려움
     Phase 1  Phase 2  Phase 3

2. Training Scheduler (학습 스케줄러)¶

정렬된 데이터를 언제, 어떤 비율로 도입할지 결정한다.

(a) Discrete Curriculum (단계별):
    Step 1: 상위 33% (easiest)로 학습
    Step 2: 상위 66%로 확장
    Step 3: 전체 100% 사용

(b) Continuous Curriculum (연속적):
    t=0: 쉬운 20%만 샘플링
    t -> T: 점차 전체 데이터에서 균등 샘플링

    lambda(t) = min(1, lambda_0 * (1 + t/T_grow))

(c) Competence-Based (역량 기반):
    모델의 현재 역량 c(t)를 측정하여
    난이도 d < c(t)인 샘플만 학습에 포함

주요 변형¶

1. Self-Paced Learning (SPL)¶

Kumar et al. (NIPS 2010)이 제안. 외부 기준 대신 모델 자체의 loss를 난이도 지표로 사용한다.

Self-Paced Learning:

  min_{w, v} E(w, v) = sum_i v_i * L(y_i, f(x_i, w)) + g(v; lambda)

  v_i in {0, 1}: 샘플 i의 선택 여부
  lambda: pace parameter (점진적 증가)
  g(v; lambda): self-paced regularizer

해석:
  - Loss가 작은 샘플 (v_i=1): 현재 모델이 잘 처리 -> "쉬운" 샘플
  - Loss가 큰 샘플 (v_i=0): 현재 모델이 못 처리 -> "어려운" 샘플
  - lambda가 증가하면서 더 많은 샘플 포함

구분	Curriculum Learning	Self-Paced Learning
난이도 기준	사전 정의 (외부)	모델 자체 loss (내부)
적응성	정적	동적 (학습 중 변화)
설계 비용	도메인 지식 필요	자동
단점	난이도 정의 어려움	쉬운 샘플에 편향 가능

2. Self-Paced Curriculum Learning (SPCL)¶

Jiang et al. (AAAI 2015). CL과 SPL의 장점을 결합한다.

SPCL:
  min_{w, v} sum_i v_i * L(y_i, f(x_i, w)) + g(v; lambda) - lambda * sum_i f_i * v_i

  f_i: 사전 정의된 curriculum (외부 난이도)
  g(v; lambda): self-paced regularizer (내부 난이도)

결합 효과:
  - 외부 지식 (curriculum)으로 대략적 순서 정의
  - 모델의 학습 상태 (self-paced)로 세밀 조정

3. Anti-Curriculum Learning¶

쉬운 것이 아닌 어려운 것부터 학습하는 역(reverse) 커리큘럼이다.

Anti-Curriculum:
  Phase 1: 어려운 샘플 우선  (hard mining 효과)
  Phase 2: 중간 샘플 추가
  Phase 3: 전체 데이터

언제 유효한가:
  - 데이터셋이 충분히 크고 노이즈가 적을 때
  - Hard negative mining이 중요한 태스크 (metric learning 등)
  - 모델 capacity가 충분할 때

연구에 따르면, anti-curriculum이 일부 태스크(특히 contrastive learning, object detection)에서 standard curriculum보다 좋은 성능을 보이기도 한다. 최적의 순서는 태스크와 데이터에 의존한다.

4. Transfer Teacher¶

Teacher 모델이 Student의 학습 커리큘럼을 결정한다.

Transfer Teacher 방식:

(a) 사전 학습된 Teacher 사용:
    Teacher(x_i) -> confidence_i
    confidence 높은 순서로 정렬

(b) RL 기반 Teacher:
    Teacher (policy network) -> 다음 배치 선택
    Student 성능 변화 -> reward signal
    -> Teacher가 최적 curriculum 학습

(c) Reward-based Teacher:
    Teacher가 각 샘플의 학습 가치(reward) 예측
    높은 reward 샘플 우선 학습

5. Competence-Based Curriculum¶

Platanios et al. (NAACL 2019). 모델의 역량(competence)을 명시적으로 추적하여 적합한 난이도의 데이터를 공급한다.

Competence Function:
  c(t) = min(1, sqrt(t * (1 - c_0^2) / T + c_0^2))

  c(0) = c_0: 초기 역량 (작은 값)
  c(T) = 1: 최종 역량 (전체 데이터)

데이터 선택:
  시점 t에서 난이도 d(x_i) <= c(t)인 샘플만 포함

NMT 실험 결과 (Platanios et al.):
  - 학습 시간 70% 감소
  - BLEU 점수 유지 또는 소폭 향상

LLM Pretraining에서의 Curriculum Learning¶

2024-2025년 LLM 연구에서 CL의 중요성이 재조명되고 있다.

데이터 순서가 중요한 이유¶

LLM Pretraining 특성:
  - 수조 토큰의 대규모 데이터
  - 1-2 epoch만 학습 (데이터를 한 번만 봄)
  - 데이터 순서 = 유일한 최적화 레버

Random Shuffle:
  [Wikipedia, Code, Reddit, Books, Code, Web, ...]
  -> 학습 초기부터 모든 도메인/난이도 혼재

Curriculum Ordering:
  Phase 1: Clean text (Wikipedia, Books) -- 문법/구조 학습
  Phase 2: + Code, Q&A               -- 논리/추론 학습
  Phase 3: + Web crawl, Forums       -- 다양성 확장

난이도 측정 방법 (LLM)¶

지표	측정 방법	장점	단점
Perplexity	기준 LM으로 측정	직관적, 범용	기준 모델 의존
문장 길이	토큰 수	계산 비용 없음	조악한 지표
어휘 다양성	Unique token ratio	간단	내용 무관
데이터 소스 품질	소스별 품질 등급	도메인 반영	수작업 필요
Compression ratio	gzip 등 압축률	정보 밀도 측정	의미 무관
Word frequency	단어 빈도 기반	언어학적 근거	NLP 특화

응용 분야¶

Computer Vision¶

이미지 분류:
  Phase 1: 깨끗한 이미지, 전형적 예시
  Phase 2: 약간의 변환/증강
  Phase 3: 어려운 예시 (occlusion, 희귀 시점)

Object Detection:
  Phase 1: 큰 객체, 단순 배경
  Phase 2: 중간 크기, 복잡한 배경
  Phase 3: 작은 객체, 겹침, 가려짐

NLP / NMT¶

기계 번역:
  난이도 기준: 문장 길이, 희귀 단어 비율, 구문 복잡도

  Phase 1: 짧고 단순한 문장
  Phase 2: 중간 길이, 일반적 구문
  Phase 3: 긴 문장, 복잡한 구문, 전문 용어

결과 (Platanios et al., 2019):
  - WMT En-De: 학습 시간 70% 감소, BLEU 유지
  - WMT En-Fr: 학습 시간 60% 감소, BLEU 0.3 향상

Reinforcement Learning¶

환경 난이도 조절:
  Phase 1: 단순한 환경 (sparse obstacles)
  Phase 2: 중간 복잡도
  Phase 3: 풀 환경

OpenAI의 접근 (2017-):
  - Automatic Domain Randomization
  - 환경 파라미터를 점진적으로 조절
  - 에이전트 성능에 따라 환경 난이도 자동 증가

Noisy Label Learning¶

노이즈 레이블 환경에서 CL의 장점:

관찰: 깨끗한 레이블의 샘플은 loss가 빠르게 감소
     노이즈 레이블의 샘플은 loss가 천천히 감소하거나 증가

활용:
  Phase 1: 낮은 loss 샘플만 학습 (깨끗한 데이터 높은 확률)
  Phase 2: 점차 임계값 완화

  -> Self-paced learning이 자연스러운 노이즈 필터 역할

Python 구현¶

기본 Curriculum Learning¶

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, Sampler
import numpy as np
from typing import List, Callable, Optional


class DifficultyScorer:
    """샘플별 난이도 점수 계산"""

    def __init__(self, method: str = 'loss'):
        """
        Args:
            method: 'loss', 'confidence', 'length', 'perplexity'
        """
        self.method = method

    def score_by_loss(
        self,
        model: nn.Module,
        dataset: Dataset,
        device: str = 'cuda'
    ) -> np.ndarray:
        """사전 학습된 모델의 loss로 난이도 측정"""
        model.eval()
        scores = []
        loader = DataLoader(dataset, batch_size=128, shuffle=False)
        criterion = nn.CrossEntropyLoss(reduction='none')

        with torch.no_grad():
            for data, targets in loader:
                data, targets = data.to(device), targets.to(device)
                outputs = model(data)
                losses = criterion(outputs, targets)
                scores.extend(losses.cpu().numpy())

        return np.array(scores)

    def score_by_confidence(
        self,
        model: nn.Module,
        dataset: Dataset,
        device: str = 'cuda'
    ) -> np.ndarray:
        """Teacher 모델의 예측 확신도 (높으면 쉬움)"""
        model.eval()
        scores = []
        loader = DataLoader(dataset, batch_size=128, shuffle=False)

        with torch.no_grad():
            for data, _ in loader:
                data = data.to(device)
                outputs = model(data)
                probs = torch.softmax(outputs, dim=1)
                confidence = probs.max(dim=1).values
                # 낮은 confidence = 높은 난이도
                scores.extend((1 - confidence).cpu().numpy())

        return np.array(scores)

    def score_by_length(self, texts: List[str]) -> np.ndarray:
        """텍스트 길이 기반 난이도 (NLP용)"""
        lengths = np.array([len(t.split()) for t in texts])
        # Min-max normalize to [0, 1]
        return (lengths - lengths.min()) / (lengths.max() - lengths.min() + 1e-8)


class CurriculumSampler(Sampler):
    """Curriculum Learning을 위한 데이터 샘플러"""

    def __init__(
        self,
        difficulty_scores: np.ndarray,
        num_epochs: int,
        current_epoch: int = 0,
        strategy: str = 'linear',
        initial_fraction: float = 0.3
    ):
        """
        Args:
            difficulty_scores: 각 샘플의 난이도 점수 (낮을수록 쉬움)
            num_epochs: 총 에폭 수
            current_epoch: 현재 에폭
            strategy: 'linear', 'sqrt', 'step'
            initial_fraction: 초기 데이터 비율
        """
        self.difficulty_scores = difficulty_scores
        self.num_epochs = num_epochs
        self.current_epoch = current_epoch
        self.strategy = strategy
        self.initial_fraction = initial_fraction

        # 난이도 순으로 정렬된 인덱스
        self.sorted_indices = np.argsort(difficulty_scores)
        self.n_samples = len(difficulty_scores)

    def _compute_fraction(self) -> float:
        """현재 에폭에서 사용할 데이터 비율"""
        t = self.current_epoch / max(self.num_epochs - 1, 1)

        if self.strategy == 'linear':
            frac = self.initial_fraction + (1 - self.initial_fraction) * t
        elif self.strategy == 'sqrt':
            frac = self.initial_fraction + (1 - self.initial_fraction) * np.sqrt(t)
        elif self.strategy == 'step':
            # 3단계 이산 커리큘럼
            if t < 0.33:
                frac = self.initial_fraction
            elif t < 0.66:
                frac = (1 + self.initial_fraction) / 2
            else:
                frac = 1.0
        else:
            frac = 1.0

        return min(frac, 1.0)

    def __iter__(self):
        fraction = self._compute_fraction()
        n_selected = max(1, int(self.n_samples * fraction))

        # 쉬운 순서대로 n_selected개 선택
        selected = self.sorted_indices[:n_selected]

        # 선택된 샘플 내에서 셔플
        np.random.shuffle(selected)

        return iter(selected.tolist())

    def __len__(self):
        fraction = self._compute_fraction()
        return max(1, int(self.n_samples * fraction))

    def set_epoch(self, epoch: int):
        self.current_epoch = epoch


class CurriculumTrainer:
    """Curriculum Learning 학습기"""

    def __init__(
        self,
        model: nn.Module,
        train_dataset: Dataset,
        val_dataset: Dataset,
        difficulty_scores: np.ndarray,
        num_epochs: int = 100,
        batch_size: int = 64,
        learning_rate: float = 1e-3,
        strategy: str = 'linear',
        initial_fraction: float = 0.3,
        device: str = 'cuda'
    ):
        self.model = model.to(device)
        self.train_dataset = train_dataset
        self.device = device
        self.num_epochs = num_epochs
        self.batch_size = batch_size

        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

        # Curriculum Sampler
        self.sampler = CurriculumSampler(
            difficulty_scores=difficulty_scores,
            num_epochs=num_epochs,
            strategy=strategy,
            initial_fraction=initial_fraction
        )

        self.val_loader = DataLoader(val_dataset, batch_size=batch_size)

    def train(self):
        """전체 학습 루프"""
        best_acc = 0
        history = []

        for epoch in range(self.num_epochs):
            self.sampler.set_epoch(epoch)

            train_loader = DataLoader(
                self.train_dataset,
                batch_size=self.batch_size,
                sampler=self.sampler
            )

            # Train
            self.model.train()
            total_loss = 0
            n_batches = 0

            for data, targets in train_loader:
                data, targets = data.to(self.device), targets.to(self.device)

                self.optimizer.zero_grad()
                outputs = self.model(data)
                loss = self.criterion(outputs, targets)
                loss.backward()
                self.optimizer.step()

                total_loss += loss.item()
                n_batches += 1

            # Evaluate
            val_acc = self._evaluate()
            avg_loss = total_loss / max(n_batches, 1)
            fraction = self.sampler._compute_fraction()

            history.append({
                'epoch': epoch,
                'loss': avg_loss,
                'val_acc': val_acc,
                'data_fraction': fraction,
                'n_samples': len(self.sampler)
            })

            if val_acc > best_acc:
                best_acc = val_acc

            if (epoch + 1) % 10 == 0:
                print(
                    f"Epoch {epoch+1}/{self.num_epochs} | "
                    f"Loss: {avg_loss:.4f} | "
                    f"Val Acc: {val_acc:.4f} | "
                    f"Data: {fraction:.1%} ({len(self.sampler)} samples)"
                )

        print(f"\nBest validation accuracy: {best_acc:.4f}")
        return history

    def _evaluate(self):
        self.model.eval()
        correct = 0
        total = 0

        with torch.no_grad():
            for data, targets in self.val_loader:
                data, targets = data.to(self.device), targets.to(self.device)
                outputs = self.model(data)
                _, predicted = outputs.max(1)
                correct += predicted.eq(targets).sum().item()
                total += targets.size(0)

        return correct / total

Self-Paced Learning¶

class SelfPacedLearning:
    """Self-Paced Learning 구현"""

    def __init__(
        self,
        model: nn.Module,
        train_dataset: Dataset,
        lambda_init: float = 0.1,
        lambda_growth: float = 1.3,
        lambda_max: float = 10.0,
        batch_size: int = 64,
        device: str = 'cuda'
    ):
        """
        Args:
            lambda_init: 초기 pace parameter
            lambda_growth: 매 에폭 lambda 증가 비율
            lambda_max: 최대 lambda 값
        """
        self.model = model.to(device)
        self.train_dataset = train_dataset
        self.device = device
        self.batch_size = batch_size

        self.lambda_param = lambda_init
        self.lambda_growth = lambda_growth
        self.lambda_max = lambda_max

        self.criterion = nn.CrossEntropyLoss(reduction='none')
        self.optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

        # 샘플 가중치 (v_i)
        self.n_samples = len(train_dataset)
        self.sample_weights = torch.ones(self.n_samples)

    def _update_weights(self, losses: torch.Tensor, indices: List[int]):
        """
        Self-paced regularizer에 따른 가중치 업데이트

        Hard thresholding: v_i = 1 if L_i < lambda, else 0
        """
        for i, idx in enumerate(indices):
            if losses[i].item() < self.lambda_param:
                self.sample_weights[idx] = 1.0
            else:
                self.sample_weights[idx] = 0.0

    def train_epoch(self):
        """한 에폭 학습"""
        self.model.train()

        # 전체 데이터의 loss 계산 (가중치 업데이트용)
        all_loader = DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=False
        )

        all_losses = []
        self.model.eval()
        with torch.no_grad():
            for data, targets in all_loader:
                data, targets = data.to(self.device), targets.to(self.device)
                outputs = self.model(data)
                losses = self.criterion(outputs, targets)
                all_losses.extend(losses.cpu().tolist())

        # 가중치 업데이트
        all_losses = torch.tensor(all_losses)
        for i in range(self.n_samples):
            self.sample_weights[i] = 1.0 if all_losses[i] < self.lambda_param else 0.0

        # 선택된 샘플로 학습
        selected_indices = torch.where(self.sample_weights > 0)[0].tolist()

        if len(selected_indices) == 0:
            return 0.0, 0

        selected_subset = torch.utils.data.Subset(self.train_dataset, selected_indices)
        train_loader = DataLoader(selected_subset, batch_size=self.batch_size, shuffle=True)

        self.model.train()
        total_loss = 0

        for data, targets in train_loader:
            data, targets = data.to(self.device), targets.to(self.device)

            self.optimizer.zero_grad()
            outputs = self.model(data)
            loss = self.criterion(outputs, targets).mean()
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()

        # Lambda 증가 (다음 에폭에 더 많은 샘플 포함)
        self.lambda_param = min(self.lambda_param * self.lambda_growth, self.lambda_max)

        return total_loss / len(train_loader), len(selected_indices)

LLM Pretraining용 Curriculum¶

import math
from dataclasses import dataclass
from typing import Iterator


@dataclass
class TextDifficulty:
    """텍스트 난이도 측정을 위한 다중 지표"""

    text: str
    index: int
    perplexity: float = 0.0
    token_length: int = 0
    vocab_diversity: float = 0.0
    compression_ratio: float = 0.0

    @property
    def composite_score(self) -> float:
        """정규화된 복합 난이도 점수"""
        # 각 지표를 [0, 1]로 정규화 후 가중 평균
        # (실제로는 전체 데이터셋 기준 정규화 필요)
        return (
            0.4 * self.perplexity +
            0.2 * self.token_length +
            0.2 * self.vocab_diversity +
            0.2 * self.compression_ratio
        )


class LLMCurriculumDataLoader:
    """LLM Pretraining용 Curriculum Data Loader"""

    def __init__(
        self,
        texts: list,
        difficulty_scores: np.ndarray,
        total_steps: int,
        batch_size: int,
        warmup_fraction: float = 0.1,
        strategy: str = 'competence'
    ):
        """
        Args:
            texts: 학습 텍스트 리스트
            difficulty_scores: 각 텍스트의 난이도 점수
            total_steps: 총 학습 스텝
            batch_size: 배치 크기
            warmup_fraction: 전체 데이터 도달까지의 비율
            strategy: 'competence', 'pacing', 'interleaved'
        """
        self.texts = texts
        self.batch_size = batch_size
        self.total_steps = total_steps
        self.warmup_steps = int(total_steps * warmup_fraction)
        self.strategy = strategy

        # 난이도 순 정렬
        self.sorted_indices = np.argsort(difficulty_scores)
        self.difficulty_scores = difficulty_scores[self.sorted_indices]
        self.n_samples = len(texts)

        self.current_step = 0

    def competence(self, step: int) -> float:
        """Platanios et al. (2019)의 competence function"""
        c_0 = 0.01  # 초기 역량
        if step >= self.warmup_steps:
            return 1.0
        t = step / self.warmup_steps
        return min(1.0, math.sqrt(t * (1 - c_0**2) + c_0**2))

    def get_batch(self) -> list:
        """현재 스텝에 적합한 배치 반환"""
        if self.strategy == 'competence':
            c = self.competence(self.current_step)
            n_available = max(1, int(self.n_samples * c))
            available_indices = self.sorted_indices[:n_available]
        elif self.strategy == 'pacing':
            # Linear pacing
            t = min(1.0, self.current_step / self.warmup_steps)
            n_available = max(1, int(self.n_samples * (0.2 + 0.8 * t)))
            available_indices = self.sorted_indices[:n_available]
        else:  # interleaved
            # 쉬운 데이터와 어려운 데이터를 교차 배치
            c = self.competence(self.current_step)
            easy_n = max(1, int(self.n_samples * c))
            n_easy = int(self.batch_size * (1 - c * 0.5))
            n_hard = self.batch_size - n_easy

            easy_batch = np.random.choice(
                self.sorted_indices[:easy_n],
                size=min(n_easy, easy_n),
                replace=False
            )
            hard_batch = np.random.choice(
                self.sorted_indices[easy_n:],
                size=min(n_hard, self.n_samples - easy_n),
                replace=False
            ) if easy_n < self.n_samples else np.array([])

            selected = np.concatenate([easy_batch, hard_batch]).astype(int)
            self.current_step += 1
            return [self.texts[i] for i in selected]

        # 샘플링
        selected = np.random.choice(
            available_indices,
            size=min(self.batch_size, len(available_indices)),
            replace=False
        )

        self.current_step += 1
        return [self.texts[i] for i in selected]


def compute_text_difficulty(
    texts: list,
    reference_model=None,
    tokenizer=None
) -> np.ndarray:
    """
    텍스트 난이도 복합 점수 계산

    Args:
        texts: 텍스트 리스트
        reference_model: Perplexity 계산용 기준 모델 (선택)
        tokenizer: 토크나이저 (선택)

    Returns:
        정규화된 난이도 점수 배열
    """
    scores = []

    for text in texts:
        words = text.split()

        # 1. 길이 기반
        length_score = len(words)

        # 2. 어휘 다양성 (Type-Token Ratio)
        unique_words = len(set(w.lower() for w in words))
        ttr = unique_words / max(len(words), 1)

        # 3. 평균 단어 길이 (복잡성 proxy)
        avg_word_len = np.mean([len(w) for w in words]) if words else 0

        # 복합 점수 (정규화 전)
        composite = length_score * 0.3 + (1 - ttr) * 0.4 + avg_word_len * 0.3
        scores.append(composite)

    scores = np.array(scores)

    # Min-max normalization
    if scores.max() > scores.min():
        scores = (scores - scores.min()) / (scores.max() - scores.min())

    return scores

완전한 학습 예시¶

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader


def run_curriculum_experiment():
    """Curriculum Learning vs Random Training 비교 실험"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # 데이터 준비
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)

    # 간단한 CNN 모델
    def create_model():
        return nn.Sequential(
            nn.Conv2d(1, 32, 3, 1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, 1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(9216, 128), nn.ReLU(),
            nn.Linear(128, 10)
        ).to(device)

    # Step 1: 난이도 점수 계산
    print("Computing difficulty scores...")
    scorer = DifficultyScorer(method='loss')

    # 간단한 모델로 1 epoch 학습 후 loss 기반 난이도 측정
    probe_model = create_model()
    probe_optimizer = torch.optim.Adam(probe_model.parameters(), lr=1e-3)
    probe_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

    probe_model.train()
    criterion = nn.CrossEntropyLoss()
    for data, targets in probe_loader:
        data, targets = data.to(device), targets.to(device)
        probe_optimizer.zero_grad()
        loss = criterion(probe_model(data), targets)
        loss.backward()
        probe_optimizer.step()

    difficulty_scores = scorer.score_by_loss(probe_model, train_dataset, device)

    # Step 2: Curriculum Training
    print("\n--- Curriculum Training ---")
    curriculum_model = create_model()
    curriculum_trainer = CurriculumTrainer(
        model=curriculum_model,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        difficulty_scores=difficulty_scores,
        num_epochs=30,
        batch_size=64,
        strategy='sqrt',
        initial_fraction=0.3,
        device=device
    )
    curriculum_history = curriculum_trainer.train()

    # Step 3: Random Training (비교용)
    print("\n--- Random Training ---")
    random_model = create_model()
    random_optimizer = torch.optim.Adam(random_model.parameters(), lr=1e-3)
    random_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=100)

    for epoch in range(30):
        random_model.train()
        for data, targets in random_loader:
            data, targets = data.to(device), targets.to(device)
            random_optimizer.zero_grad()
            loss = criterion(random_model(data), targets)
            loss.backward()
            random_optimizer.step()

        if (epoch + 1) % 10 == 0:
            random_model.eval()
            correct = sum(
                random_model(d.to(device)).argmax(1).eq(t.to(device)).sum().item()
                for d, t in test_loader
            )
            total = len(test_dataset)
            print(f"Epoch {epoch+1}/30 | Val Acc: {correct/total:.4f}")

    print("\nExperiment complete.")


if __name__ == "__main__":
    run_curriculum_experiment()

실무 가이드라인¶

하이퍼파라미터 권장값¶

파라미터	권장 범위	설명
Initial fraction	0.2 - 0.4	초기 데이터 비율
Warmup fraction	0.3 - 0.5	전체 데이터 도달까지 학습 비율
Pacing strategy	sqrt, linear	sqrt가 일반적으로 안정적
Difficulty metric	Loss, Perplexity	태스크에 따라 선택
Lambda growth (SPL)	1.1 - 1.5	Self-paced 속도

언제 Curriculum Learning을 사용해야 하는가¶

효과적인 경우:
  + 데이터셋이 크고 난이도 분포가 넓을 때
  + 노이즈 레이블이 존재할 때
  + 학습이 불안정하거나 수렴이 느릴 때
  + LLM pretraining (한 번만 데이터를 볼 때)
  + 도메인 지식으로 난이도 정의가 가능할 때

비효과적이거나 불필요한 경우:
  - 데이터셋이 작고 균질할 때
  - 이미 빠르게 수렴하는 경우
  - 난이도 정의가 불분명할 때
  - Hard example mining이 더 중요한 태스크

주의사항¶

1. 난이도 측정의 신뢰성
   - 잘못된 난이도 정의는 성능을 저하시킬 수 있음
   - 여러 지표를 결합하는 것이 안전
   - 도메인 전문가의 검토 권장

2. 초기 데이터 비율
   - 너무 적으면 (< 10%): 편향된 representation 학습
   - 너무 많으면 (> 50%): CL 효과 미미
   - 20-30%가 좋은 시작점

3. 전체 데이터 도달 시점
   - 너무 빠르면: CL의 이점 상실
   - 너무 느리면: 어려운 데이터 학습 부족
   - 전체 학습의 30-50% 시점 권장

4. Self-Paced vs Predefined
   - 도메인 지식이 있으면: Predefined curriculum
   - 없으면: Self-paced learning
   - 최선: SPCL (둘의 결합)

참고 자료¶

핵심 논문¶

Bengio, Y. et al. (2009). Curriculum Learning. ICML 2009.
Kumar, M. et al. (2010). Self-Paced Learning for Latent Variable Models. NIPS 2010.
Jiang, L. et al. (2015). Self-Paced Curriculum Learning. AAAI 2015.
Platanios, E. et al. (2019). Competence-based Curriculum Learning for Neural Machine Translation. NAACL 2019.
Soviany, P. et al. (2022). Curriculum Learning: A Survey. IJCV 2022.